top of page

How to Handle Identifiers?



Welcome back to the "Machine Learning Fundamentals" series at Data Science Flow! In this blog post, we will be discussing the following topics:

  1. What are identifiers?

  2. How to handle identifiers?

Now let us begin!


What are Identifiers?


In our blog post 4, we discussed an important rule regarding the features, underscoring that it is critical to ensure that the training, validation and test data must have the same set of features.


In this blog post, we introduce another important rule concerning a specific kind of features known as "Identifiers". To be specific, identifiers are features with unique values for each sample in the data. In simpler terms, they serve as the "identifying tags" for data entries. For instance, consider a dataset containing student records; in this context, the "Student ID" column is an identifier, exhibiting a unique value for each student.


Now that we have a clear understanding of what identifiers are, let us delve into the guideline. To clarify, it is advisable to eliminate identifiers from the training, validation and test data, particularly in the context of supervised learning. This recommendation stems from the inherent nature of identifiers, which should not exert any influence on the target variable. For instance, consider "Student ID", which should not be related to a student's future salary. If we were to include identifiers in the training data, there is a risk that our model might inadvertently assign some degree of predictive influence to these identifiers. Even if this influence is minimal, it has the potential to undermine the overall performance of our model.


To illustrate this point further, let us consider an example. Suppose our training data comprises the features "Student ID" and "GPA", and the target variable "Future Salary". For the sake of illustration, let us assume we intend to train a linear regression model, utilizing these features to predict the target. In this scenario, the model can be represented as follows:


Future Salary = b + w1 × Student ID + w2 × GPA


Here, w1 represents the weight assigned to feature "Student ID", symbolizing its predictive power regarding the target "Future Salary". In theory, w1 should ideally be precisely zero. However, in practical applications, it is plausible that the model assigns a non-zero value to w1, thereby compromising the model's performance. This is why, for both regression and classification tasks, the management of identifiers is a crucial step immediately following the handling uncommon features stage, as depicted in Table 1.

Table 1. The Data Preprocessing Pipeline for Regression and Classification.

​​​​Step

​​​​Regression

​​​Classification

​​Loading data

Y

Y

​​​Splitting data into training, validation and test

Y

Y

​​​Handling uncommon features

Y

Y

​​​Handling identifiers

Y

Y

​​​​Handling missing data

Y

Y

​​​​Encoding categorical data

Y

Y

​​​​Scaling data

Y

Y

​​​​Handling class imbalance

N

Y

How to Handle Identifiers?


As discussed earlier, we should eliminate identifiers from the data. The code snippet below demonstrates how to remove the identifier "Student ID" from the training, validation and test data.

# Remove the identifier ('Student ID') from the training data
df_train = df_train.drop(columns=['Student ID'])

# Remove the identifier ('Student ID') from the validation data
df_val = df_val.drop(columns=['Student ID'])

# Remove the identifier ('Student ID') from the test data
df_test = df_test.drop(columns=['Student ID'])

Takeaways


In this blog post, we have covered the concept of identifiers and explored methods for handling them. Here are the key takeaways:

  • It is recommended to remove the identifiers from the training, validation and test data, especially when dealing with supervised learning tasks

Thank you for reading. We will see you next time!

bottom of page