Hashing Encoding:

It is also known as feature hashing. This technique converts categorical features into numerical  values using hashing function. It is useful when you deal with high-cardinality  datasets  where one-hot encoding may not  be  practical due to  memory  constraints as that technique may create a large number of  columns. Dimensionality curse will not allow us to undertake one-hot encoding technique.

Hash Function:

A function that maps a large input (like a string representing a category) to a smaller, fixed-size output (a hash value).

Collision: The possibility that two different input values (categories) map to the same hash value.

High Cardinality: A feature with a large number of unique values.

Working way

Input: A categorical feature with a large number of unique categories (e.g., different product names, cities).

Hashing: A hash function is applied to each category, producing a hash value.

Numeric Representation: The hash values are then used as numerical representations of the categories, suitable for use in machine learning models.

Advantages:

Memory Efficiency: Hashing reduces the number of features needed to represent categorical data, especially when you deal with high-cardinally data

Scalable: Allows model to handle a large number of unique categories without significant memory overhead

Can handle Unseen Values

Disadvantages:

Collisions:

The risk of different categories hashing to the same value, which can impact model performance.

Potential for Information Loss:

Collisions can lead to a loss of information about the original categories.

Load Libraries

Read data

Using BinaryEncoder convert gender category into numerical value

 

Using Hash encoding convert the categorical feature  ‘card’  into hash value

Concate hashed_df with the original Df and drop the original feature Card

Type is remaining. containing 4  types

 

Concate hashed_df with original df and drop the original feature “type”