How to Handle Identifiers?

Welcome back to the "Machine Learning Fundamentals" series at Data Science Flow! In this blog post, we will be discussing the following topics:

What are identifiers?
How to handle identifiers?

Now let us begin!

What are Identifiers?

In our blog post 4, we discussed an important rule regarding the features, underscoring that it is critical to ensure that the training, validation and test data must have the same set of features.

In this blog post, we introduce another important rule concerning a specific kind of features known as "Identifiers". To be specific, identifiers are features with unique values for each sample in the data. In simpler terms, they serve as the "identifying tags" for data entries. For instance, consider a dataset containing student records; in this context, the "Student ID" column is an identifier, exhibiting a unique value for each student.

Now that we have a clear understanding of what identifiers are, let us delve into the guideline. To clarify, it is advisable to eliminate identifiers from the training, validation and test data, particularly in the context of supervised learning. This recommendation stems from the inherent nature of identifiers, which should not exert any influence on the target variable. For instance, consider "Student ID", which should not be related to a student's future salary. If we were to include identifiers in the training data, there is a risk that our model might inadvertently assign some degree of predictive influence to these identifiers. Even if this influence is minimal, it has the potential to undermine the overall performance of our model.

To illustrate this point further, let us consider an example. Suppose our training data comprises the features "Student ID" and "GPA", and the target variable "Future Salary". For the sake of illustration, let us assume we intend to train a linear regression model, utilizing these features to predict the target. In this scenario, the model can be represented as follows:

Future Salary = b + w1 × Student ID + w2 × GPA

Here, w1 represents the weight assigned to feature "Student ID", symbolizing its predictive power regarding the target "Future Salary". In theory, w1 should ideally be precisely zero. However, in practical applications, it is plausible that the model assigns a non-zero value to w1, thereby compromising the model's performance. This is why, for both regression and classification tasks, the management of identifiers is a crucial step immediately following the handling uncommon features stage, as depicted in Table 1.

Table 1. The Data Preprocessing Pipeline for Regression and Classification.

Step	Regression	Classification
Loading data	Y	Y
Splitting data into training, validation and test	Y	Y
Handling uncommon features	Y	Y
Handling identifiers	Y	Y
Handling missing data	Y	Y
Encoding categorical data	Y	Y
Scaling data	Y	Y
Handling class imbalance	N	Y