How to Handle Uncommon Features?

Welcome back to the "Machine Learning Fundamentals" series at Data Science Flow! In this blog post, we will be discussing the following topics:

What are uncommon features?
How to handle uncommon features?

Now let us begin!

What are Uncommon Features?

In our blog post 3, we discussed an important rule regarding the data, underscoring that while the model may utilize the training and validation data repeatedly for refinement, it should interact with the test data only once.

In this blog post, we introduce another important rule regarding the features. To be specific, it is critical to ensure that the training, validation and test data must have the same set of features. That is, there must be no uncommon features, which only appear in one or two, but not all three datasets.

To illustrate, consider the following scenario:

Training data includes features x1 and x2
Validation data includes feature x1
Test data includes feature x1

In this case, x2 is an uncommon feature as it exists in the training data but not in the validation and test data.

It is essential to address uncommon features to ensure successful model fine-tuning and evaluation on validation and test data. For instance, consider the scenario where we train a model using training data containing features x1 and x2, and then attempt to fine-tune and evaluate it using validation and test data with only x1. In such a case, the fine-tuning and evaluation process will fail because the model expects two features but only one is available. This is why, for both regression and classification tasks, the management of uncommon features is a crucial step immediately following the data splitting stage, as depicted in Table 1.

Table 1. The Data Preprocessing Pipeline for Regression and Classification.

Step	Regression	Classification
Loading data	Y	Y
Splitting data into training, validation and test	Y	Y
Handling uncommon features	Y	Y
Handling identifiers	Y	Y
Handling missing data	Y	Y
Encoding categorical data	Y	Y
Scaling data	Y	Y
Handling class imbalance	N	Y

How to Handle Uncommon Features?

In this section, we will first explore the process of identifying uncommon features and then delve into strategies for addressing them.

Identifying Uncommon Features

You might be wondering why we need to discuss the identification of uncommon features. Can't we simply visually inspect the training, validation, and test data, much like we did in the toy example earlier (where we had x1 and x2 in the training data and only x1 in the validation and test data) to spot the uncommon feature (x2)?

While this approach may work for data with just a few features, manually pinpointing uncommon features in data containing a large number of features can be quite arduous. Imagine having to do this for a dataset with 100 features; it would involve meticulously examining each feature to determine its presence in both the training, validation, and test data, a task that is far from enjoyable.

Fortunately, we can automate this process with the help of the code snippet below. The concept behind it is to initially identify the common features shared by the training, validation, and test data. Subsequently, any feature not classified as a common feature can be categorized as an uncommon one.

import numpy as np

def uncommon_feature_checker(
        df_train, 
        df_val, 
        df_test, 
        target = None
    ):
    
    """
    The uncommon feature checker

    Parameters
    ----------
    df_train : the dataframe of training data
    df_val : the dataframe of validation data
    df_test : the dataframe of test data
    target : the name of the target, None by defaul

    Returns
    ----------
    The uncommon features in the training, validation and test data
    """

    # Get the features in the training data
    feature_train = df_train.columns if target == None else df_train.columns[df_train.columns != target]

    # Get the features in the validation data
    feature_val = df_val.columns if target == None else df_val.columns[df_val.columns != target]

    # Get the features in the test data
    feature_test = df_test.columns if target == None else df_test.columns[df_test.columns != target]
    
    # Get the common features between the training, validation and test data
    common_feature = np.intersect1d(
        np.intersect1d(feature_train, feature_val), 
        feature_test
    )
                
    # Get the features in the training data but not in the validation or test data
    uncommon_feature_train = np.setdiff1d(feature_train, common_feature)

    # Get the features in the validation data but not in the training or test data
    uncommon_feature_val = np.setdiff1d(feature_val, common_feature)

    # Get the features in the test data but not in the training or validation data
    uncommon_feature_test = np.setdiff1d(feature_test, common_feature)

    return [uncommon_feature_train, uncommon_feature_val, uncommon_feature_test]

Handling Uncommon Features

Once uncommon features have been identified, various approaches can be employed to manage them. The course of action is contingent upon the features' relevance to our objectives and the feasibility of their creation. To elaborate, if these features lack value, they are eliminated. Conversely, if they hold value, a determination is made regarding the practicality and reasonableness of their generation. If such generation is feasible, the features are created; if not, they are eliminated.

The code snippet below demonstrates how to remove the identified uncommon features from the training, validation and test data.

# Remove the identified uncommon features from the training data
df_train = df_train.drop(columns=uncommon_feature_train)

# Remove the identified uncommon features from the validation data
df_val = df_val.drop(columns=uncommon_feature_val)

# Remove the identified uncommon features from the test data
df_test = df_test.drop(columns=uncommon_feature_test)

Takeaways

In this blog post, we have covered the concept of uncommon features and explored methods for identifying and handling such features. Here are the key takeaways:

It is crucial to guarantee that the training, validation, and test data share an identical set of features
We can use the provided code snippet to automatically identify uncommon features
We can either generate or remove uncommon features depending on their relevance to our objectives and the feasibility of their creation

Thank you for reading. We will see you next time!

Let Data Science Flow

How to Handle Uncommon Features?

What are Uncommon Features?

How to Handle Uncommon Features?

Identifying Uncommon Features

Handling Uncommon Features

Takeaways

Recent Posts

Comments