Hashing Encoding:
It is also known as feature hashing. This technique converts categorical features into numerical values using hashing function. It is useful when you deal with high-cardinality datasets where one-hot encoding may not be practical due to memory constraints as that technique may create a large number of columns. Dimensionality curse will not allow us to undertake one-hot encoding technique.
Hash Function:
A function that maps a large input (like a string representing a category) to a smaller, fixed-size output (a hash value).
Collision: The possibility that two different input values (categories) map to the same hash value.
High Cardinality: A feature with a large number of unique values.
Working way
Input: A categorical feature with a large number of unique categories (e.g., different product names, cities).
Hashing: A hash function is applied to each category, producing a hash value.
Numeric Representation: The hash values are then used as numerical representations of the categories, suitable for use in machine learning models.
Advantages:
Memory Efficiency: Hashing reduces the number of features needed to represent categorical data, especially when you deal with high-cardinally data
Scalable: Allows model to handle a large number of unique categories without significant memory overhead
Can handle Unseen Values
Disadvantages:
Collisions:
The risk of different categories hashing to the same value, which can impact model performance.
Potential for Information Loss:
Collisions can lead to a loss of information about the original categories.
Load Libraries
Read data
Using BinaryEncoder convert gender category into numerical value
Using Hash encoding convert the categorical feature ‘card’ into hash value
Concate hashed_df with the original Df and drop the original feature Card
Type is remaining. containing 4 types