top of page

How to Handle Categorical Data?



Welcome back to the "Machine Learning Fundamentals" series at Data Science Flow! In this blog post, we will be discussing the following topics:

  1. What is categorical data?

  2. How to handle categorical data?

Now let us begin!


What is Categorical Data?


Categorical data, or non-numerical data, is a type of data that represents characteristics or attributes with distinct categories or groups. It comprises qualitative information that does not have a natural numerical value. Instead, it classifies or categorizes the data into groups or classes based on specific attributes. These categories can be nominal or ordinal:

  1. Nominal data: This type of categorical data does not have an inherent order or ranking. It simply represents different categories without any inherent sequence. For example, colors (red, blue, green) or types of animals (dog, cat, bird) are nominal categorical data.

  2. Ordinal data: In contrast, ordinal data has a specific order or ranking among its categories. It categorizes data into groups that can be arranged in a particular sequence or rank. An example of ordinal data might be size categories (small, medium, large) or education levels (high school, bachelor's, master's, Ph.D.).

Despite being popular, most machine learning models are not inherently equipped to handle categorical data (with the exception of models such as XGBoost, which support categorical features by default). As a result, for both regression and classification tasks, managing categorical data becomes a pivotal step right after handling missing data, as indicated in Table 1.


Table 1. The Data Preprocessing Pipeline for Regression and Classification.

​​​​Step

​​​​Regression

​​​Classification

​​Loading data

Y

Y

​​​Splitting data into training, validation and test

Y

Y

​​​Handling uncommon features

Y

Y

​​​Handling identifiers

Y

Y

​​​​Handling missing data

Y

Y

​​​​Encoding categorical data

Y

Y

​​​​Scaling data

Y

Y

​​​​Handling class imbalance

N

Y


How to Handle Categorical Data?


The concept of dealing with categorical data is quite simple: transforming non-numeric values of a categorical variable into numerical formats to make them suitable for machine learning models. However, the method of encoding these categorical variables varies based on the characteristics of the variables. In this section, we will explore strategies for encoding both categorical features and categorical targets.


Encoding Categorical Feature


One popular method for encoding categorical features is One-Hot-Encoding (OHE). This technique transforms a categorical feature into multiple binary columns, each representing a unique category value. For example, as shown in Figure 1, if we apply OHE to the Color feature with three unique values, Red, Blue, and Green, it yields three separate feature-value pairs, Color_Red, Color_Blue, and Color_Green. This process effectively transforms the categorical information into a numerical format.


Figure 1. Using One-Hot-Encoding for categorical feature Color.


The code snippet below shows how to use pandas get_dummies function for encoding categorical features.

import pandas as pd
import numpy as np

# One-hot-encode the categorical features in df
df = pd.get_dummies(
    df, 
    columns=np.setdiff1d(df.columns, [name_of_the_target])
)

It is crucial to emphasize that, as One-Hot-Encoding (OHE) transforms a categorical feature with k distinct values into k binary columns, it encounters challenges when dealing with features with a large number of unique values (i.e., when k is large). This is primarily due to the notable expansion in dimensions, which impedes the performance and efficiency of the subsequent machine learning models. Strategies such as feature engineering to consolidate values into groups or tools such as XGBoost which supports categorical features play a pivotal role in alleviating these challenges.


Note is that it may be tempting to assign numerical values directly to the categories of a categorical feature. For example, in the illustration above, there might be an inclination to encode Red, Blue, and Green as 1, 2, and 3, a method commonly known as Label Encoding (LE), as demonstrated below.


Figure 2. Using Label Encoding for categorical feature Color.


As illustrated in Figure 1 and 2, while OHE increases dimensionality, LE does not. This might initially seem advantageous, implying that LE avoids the dimensionality issue associated with OHE. However, despite its allure, using LE in this context is incorrect. This is due to the fact that the categorical feature Color is nominal, lacking inherent order or ranking, as discussed earlier. Employing LE for encoding would assign numerical values with an unintended order. For instance, Blue might be interpreted as twice the magnitude of Red, and Green as three times the magnitude of Red. The question is, why is this problematic?


To exemplify, let us consider a scenario where we are constructing a linear regression model to predict the sale price of houses based on their color. The model's equation takes the form:


Sale Price = W × Color,


where W represents the weight assigned to feature Color. If we were to employ Label Encoding (LE) for encoding Color (as illustrated in Figure 2), the predicted price for a blue house would erroneously be twice the magnitude of a red one, and the predicted price for a green house would inaccurately be three times the magnitude of a red one. This discrepancy clearly highlights the inappropriateness of such an approach.

Now, you might wonder if adopting LE for an ordinal categorical feature is acceptable. This might seem reasonable because unlike nominal variables, which lack inherent order or ranking, ordinal variables possess a specific order or ranking among their categories. Consequently, one might consider encoding the categories in the Size feature, such as Small, Medium, and Large, as 1, 2, and 3, as illustrated in Figure 3.



Figure 3. Using Label Encoding for categorical feature Size.


However, despite its apparent reasonableness, encoding an ordinal feature in this manner is still incorrect. Continuing with the example of predicting house prices, let us imagine constructing a linear regression model based on the size of houses, where the model's equation is:


Sale Price = W × Size.


If we were to employ LE for encoding size, the predicted price for a medium-sized house would mistakenly be twice the magnitude of a small one, and the predicted price for a large house would inaccurately be three times the magnitude of a small one. It is precisely due to such pitfalls that One-Hot-Encoding (OHE) being the most widely adopted approach for encoding categorical features, irrespective of whether they are nominal or ordinal.


Encoding Categorical Target

Like categorical features, encoding the target variable may be necessary when it is categorical. However, in contrast to feature encoding where Label Encoding (LE) is unsuitable, LE should be applied when encoding a categorical target. For instance, if Color in Figure 2 and Size in Figure 3 are target variables, LE should be used instead of One-Hot Encoding (OHE). The rationale behind this choice lies in the fact that, unlike categorical features where the exact encoded values play a role in the prediction process, the encoded values of a categorical target are solely utilized to distinguish unique categories. Therefore, as long as the encoded values are distinct, the approach is deemed appropriate.

The code snippet below demonstrates how to use sklearn LabelEncoder module to encode a categorical target.

from sklearn.preprocessing import LabelEncoder

# The LabelEncoder
le = LabelEncoder()

# Encode categorical target in df
df[name_of_the_target] = le.fit_transform(df[name_of_the_target])

Takeaways


In this blog post, we have covered the concept of categorical data and explored methods for handling them. Here are the key takeaways:

  • The majority of machine learning models do not possess built-in capabilities to handle categorical data, with the exception of a few, such as XGBoost, which inherently supports categorical features.

  • One-Hot-Encoding (OHE) is suitable for encoding categorical features. However, when dealing with categorical features with an extensive range of unique values, it might be necessary to consolidate values into groups before employing OHE, a step taken to address the challenges associated with high dimensionality.

  • Label Encoding is suitable for encoding categorical targets.

Thank you for reading. We will see you next time!

Comentários


bottom of page