top of page

What is Data Preprocessing?



Welcome back to the "Machine Learning Fundamentals" series at Data Science Flow! In this blog post, we will be discussing the following topics:

  1. What is data preprocessing?

  2. Why do we need data preprocessing?

  3. The data preprocessing pipeline

Now let us begin!


What is Data Preprocessing?


Data preprocessing is the process of massaging data before feeding it into machine learning models. It involves many steps to ensure that the data is in a suitable format for training and testing the models.


Why Do We Need Data Preprocessing?


In practice, it is often observed that data scientists dedicate approximately 80% of their time to data preprocessing. This significant portion of their efforts is committed before they even embark on model training. Now, why is data preprocessing so crucial? It is indispensable because bypassing this step would result in either a complete failure or subpar performance in downstream workflow such as model training.


The Data Preprocessing Pipeline


The data preprocessing pipeline differs depending on the specific machine learning task at hand. In this blog post, we will outline the major steps involved in two prevalent scenarios: Regression and Classification. In subsequent posts, we will delve deeper into specific steps within this process.


Regression


As mentioned in our previous blog post in this series (Blog Post 0), regression falls under the category of supervised learning in which the target variable, such as house price, is continuous and can take an infinite number of values. Typically, the data preprocessing pipeline for regression comprises seven essential steps:

  1. Loading data

  2. Splitting data into training, validation, and test sets

  3. Handling uncommon features

  4. Handling identifiers

  5. Handling missing data

  6. Encoding categorical data

  7. Scaling data


Classification


Unlike regression, classification belongs to the realm of supervised learning where the target variable, such as the outcome of a flu test, is discrete and can only assume a finite number of values. Despite this distinction, the data preprocessing pipeline for classification usually mirrors that of regression, except for one additional step at the end: handling class imbalance.


Regression vs Classification


For illustrative purposes, we have consolidated the regression and classification pipelines into Table 1. In this table, "Y" signifies that the particular step is necessary for the corresponding task (whether it is regression or classification), whereas "N" denotes that the step is not needed. As demonstrated, both regression and classification encompass the same set of steps, with the sole exception being the last one (handling class imbalance), which is exclusively required for classification.


Table 1. The Data Preprocessing Pipeline for Regression and Classification.

Step

​Regression

​Classification

Loading data

Y

Y

Splitting data into training, validation and test

​Y

Y

​Handling uncommon features

​Y

Y

​Handling identifiers

​Y

Y

​Handling missing data

​Y

Y

​Encoding categorical data

Y

Y

​Scaling data

Y

Y

​Handling class imbalance

N

Y


Takeaways


In this blog post, we have introduced the concept of data preprocessing and highlighted its importance in the machine learning workflow. We have also outlined the data preprocessing pipeline for both regression and classification. Here are the key takeaways:

  • In the absence of data preprocessing, the subsequent processes would either fail completely or perform inadequately

  • In contrast to regression, addressing class imbalance becomes a necessity in the context of classification

Thank you for reading. We will see you next time!

Comments


bottom of page