Ordinal encoding is a method for converting categorical data into numerical data while preserving the inherent order or ranking of the categories. It’s useful when working with machine learning models that expect numerical input features, especially when the categories have a natural order (like “low,” “medium,” “high”).
Ordinal encoding assigns unique numerical values based on the order of the categories whereas one-hot encoding creates a separate binary columns for each category. Ordinal encoding maintains the natural order of the categories, which can be crucial for models that can interpret this order (e.g., linear models, SVMs). It’s more efficient than one-hot encoding, especially for categorical features with many categories, as it doesn’t inflate the dimensionality of the feature space. It’s well-suited for models that require numerical input, like neural networks, where categorical features can be a barrier.
Advantages:
- Preserves the order
- Reduces dimensionality
- Suitable for various models that require numerical input
Download from kaggle
This is another encoding technique meant for categories which have a inherent meaningful order. Say the feature education_level contains ‘Secondary’, ‘Primary’, ‘Postgraduate’,’ University’, ‘College’ and ‘High School’. Let us download addiction_population_data.csv from kaggle website Addiction Population EDA & Visulation.
Store the file in any one of your drive. In my case I have stored in my D drive as addiction.csv
Load libraries and read data
Dataset contains 25 columns.
information about data
the following features contain missing values
- education_level
- social_support
- therapy_history
Since they contain missing values (NaN or na) their Dtype is mentioned as object. The first task is to fill the missing values are remove them. It is better to fill the missing values as shown below
Now check if any features contain missing values
Now there are no missing values
Categories/class as they appear in dataset.
Ordinal Encoder
Alphabetical Order
OrdinalEncoder does not code education_level in same hierarchy as we do in the real life. It codes in alphabetical order.
To get the code in the same hierarchy as we do in our real life for education level
Once again Load and read data in other dataframe
Find the categories/class of education_level
Since the education_level had many missing values. We filled with the class “College”
you have to import OrdinalEncoder lib from sklearn.preprocessing module
Manually create categories order as a list. Using OrdinalEncoder function encode categories as per categories order we have defined.
Programmed Order
Now categories of education_level feature are coded as we desired.