How to Handle Missing Data?

Welcome back to the "Machine Learning Fundamentals" series at Data Science Flow! In this blog post, we will be discussing the following topics:

What is missing data?
How to handle missing data?

Now let us begin!

What is Missing Data?

Missing data refers to values that are not recorded or present in a dataset, even though we might expect them to be. These missing values can be represented in various ways, such as blank spaces, "NA" or "N/A" (indicating "Not Applicable"), or "NaN" (meaning "Not A Number").

However, it is important to recognize that while the symbols mentioned can function as placeholders for missing data, they can also represent meaningful values that are not missing. To illustrate, consider the "Degree" column in an employee information dataset, which can contain values such as "Bachelor", "Master", or "Doctor". In this context, the "NA" in the column may not indicate missing data but rather signify that the employee does not possess a degree.

Notably, some tree-based implementations like XGBoost have built-in capabilities to handle missing values, so you can leave gaps in the data without explicit handling them. However, as the majority of machine learning models do not accommodate missing values, addressing these gaps in the data preprocessing stage becomes essential. Consequently, for both regression and classification tasks, managing missing data becomes a pivotal step right after handling identifiers, as indicated in Table 1.

Table 1. The Data Preprocessing Pipeline for Regression and Classification.

Step	Regression	Classification
Loading data	Y	Y
Splitting data into training, validation and test	Y	Y
Handling uncommon features	Y	Y
Handling identifiers	Y	Y
Handling missing data	Y	Y
Encoding categorical data	Y	Y
Scaling data	Y	Y
Handling class imbalance	N	Y

How to Handle Missing Data?

The approach to dealing with missing data is contingent upon various factors within the data, such as the characteristics of the variable containing missing values, the proportion of missing values, and the specific machine learning model being used. In this segment, we will delve into two methods for managing missing data: removing missing data and imputing missing data.

Removing Missing Data

The most straightforward way to handle missing data is by removing rows that contain missing values. The code snippet below demonstrates how to do so using pandas dropna function.

# Remove rows with missing values from the training data
df_train = df_train.dropna()

Nevertheless, it is important to note that as we remove rows with missing values, the non-missing data within these rows is also discarded. This might not be ideal, considering that some of these non-missing values could carry significant predictive power and might have been collected at considerable expense.

Another possible disadvantage of eliminating rows containing missing values is the risk of losing a significant amount of data, especially when certain features exhibit a high ratio of missing values. For example, if 90% of a feature's values are missing, employing this method would result in the removal of 90% of the data, which is clearly less than ideal.

To address this concern, one potential solution is to initially remove features with a high ratio of missing values and then eliminate rows. Using the earlier example where there are no other features with missing values, this approach would retain all the data samples, in contrast to the scenario where removing rows would lead to a loss of 90% of the data. The following code snippet illustrates the implementation of the two-stage approach.

# State 1: Drop features with high ratio of missing values from the training data
df_train = df_train.drop(columns=[features_with_high_ratio_of_missing_values])

# State 2: Remove rows with missing values from the training data
df_train = df_train.dropna()

Imputing Missing Data

Besides removing missing data, imputation is an alternative approach to handling missing data. The most straightforward form of imputation involves using the statistics of a feature's non-missing values to estimate and fill in the missing values. Depending on the nature of the feature, these statistics might encompass the mean (solely for numerical data), median (solely for numerical data), or mode (applicable for both numerical and categorical data).

Nevertheless, it is essential to calculate these statistics based on the training data and subsequently apply them to impute missing values across the training, validation, and test data. This practice is vital due to the typically larger size of the training data among the three sets, ensuring that the statistics derived from the training data are more precise and representative. The code snippet below demonstrates how to do so using sklearn SimpleImputer.

from sklearn.impute import SimpleImputer
import numpy as np

# Get the SimpleImputer object with np.nan as the placeholder for missing values and mean as the imputation strategy
si = SimpleImputer(missing_values=np.nan, strategy='mean')

# Impute missing values in df_train, df_val and df_test 
df_train = si.fit_transform(df_train)
df_val = si.transform(df_val)
df_test = si.transform(df_test)

Please note that we applied the fit_transform function specifically to the training data, as previously explained, which computes the means and then applies them solely within the training dataset. Conversely, with the validation and test data, we utilized the transform function, applying the means calculated from the training data to these datasets without re-calculating them.

Takeaways

In this blog post, we have covered the concept of missing data and explored methods for handling them. Here are the key takeaways:

While symbols like blank spaces, "NA", "N/A", or "NaN" may serve as placeholders for missing values, they can equally denote significant values that are not missing.
Certain tree-based implementations, such as XGBoost, come equipped with inherent capabilities to manage missing values, enabling you to retain gaps in the data without the need for explicit handling.
The simplest approach to manage missing data involves discarding rows that contain missing values. However, in cases where there are features with a high proportion of missing values, it is advisable to first remove those specific features and then eliminate the rows with missing values.
When utilizing the statistics of a feature to fill in its missing values, it is crucial to compute these statistics using the training data and then employ them to impute missing values across the training, validation, and test data.

Thank you for reading. We will see you next time!

Let Data Science Flow

How to Handle Missing Data?

What is Missing Data?

How to Handle Missing Data?

Removing Missing Data

Imputing Missing Data

Takeaways

Recent Posts

コメント