top of page

How to Split Data?

Updated: Oct 18, 2023



Welcome back to the "Machine Learning Fundamentals" series at Data Science Flow! In this blog post, we will be discussing the following topics:

  1. What is splitting data?

  2. How to split data?

Now let us begin!


What is Splitting Data?


As previously mentioned in our blog post 2, splitting data is the second step in data preprocessing. We can also see this from Table 1, which shows that for both regression and classification, data splitting immediately follows the data loading stage.


Table 1. The Data Preprocessing Pipeline for Regression and Classification.

​​​Step

​​​Regression

​Classification

​Loading data

​Y

Y

​​Splitting data into training, validation and test

Y

Y

​​​Handling uncommon features

Y

Y

​​​Handling identifiers

Y

Y

​​​Handling missing data

Y

Y

​​​Encoding categorical data

Y

Y

​​​Scaling data

Y

Y

​​​Handling class imbalance

N

Y

To illustrate the concept of data splitting, envision our dataset as a treasure trove of information. Subsequently, data splitting is the process of dividing this treasure into three completely distinct chests:

  • Training data: This chest contains the data used to train our machine learning model. It is where the model learns patterns in the data.

  • Validation data: This chest contains the data used to fine-tune our model. It is where we experiment with different hyperparameters to enhance the model's performance.

  • Test data: This chest contains the data used to evaluate our model. It is where we assess how well the model generalizes to new, unseen data.

We can consider the connection between the model, training, validation, and test data in the following way: Picture the model as a student attending a class, the training data as the primary textbook for that class, the validation data as an additional workbook, and the test data as the final exam. This parallels the process of the model being trained on the training data, fine-tuned on the validation data, and evaluated on the test data, much like how a student learns from the textbook, practices using the workbook, and faces the final exam.


The analogy mentioned above can also provide insights into some essential guidelines concerning training, validation, and test data:

  • They should be mutually exclusive (i.e., devoid of any overlap) since any intersection between training and validation data or between training and test data would lead to an inflated assessment of the model's performance. This resembles the scenario where any overlap between the textbook and workbook or between the textbook and final exam would result in an overestimation of the student's grasp of the subject.

  • While the model may utilize the training and validation data repeatedly for refinement, it should interact with the test data only once. To be precise, we evaluate the model's performance only after the training and fine-tuning phases are complete, and we do this evaluation only once. This mirrors the student's ability to review the textbook and workbook multiple times but sit for the final exam only once, which occurs after the learning and practice phases are finished.

How to Split Data?


The approach for splitting data varies based on several factors, such as whether the data is in the form of time series and the data's size. Now let us get to the specifics.


Time Series VS Non-Time Series


It appears that the process of dividing time series data differs significantly from that of non-time series data. To elaborate, when working with time series data, it is essential to partition the data based on chronological order. To clarify further, it is important that the training data is gathered prior to the collection of the validation data, and the validation data is gathered prior to the collection of the test data. Failure to do so could lead to data leakage, as the training data might include future information that would not be available at the time of prediction in a production setting. This could ultimately render the model's performance on test data an inaccurate reflection of its performance in the production environment.

For example, let us assume we are dealing with data gathered in the previous year. In this case, we might choose two specific time points, such as August and October, to delineate the data. More precisely, the training data would comprise data collected before August, the validation data would comprise data collected between August and October, and the test data would comprise data collected after October.

In the context of non-time-series data, random splitting is typically suitable as there is no inherent temporal information to consider. This randomness ensures that the training data does not inadvertently incorporate any future information, which is crucial for preserving the integrity of the modeling process. However, when we aim to maintain the distribution of a specific variable after the split, especially in regression and classification tasks, where it is typically desirable to uphold the distribution of the target variable, a stratified approach should be employed.


For instance, consider a dataset where the target is the sale price of residential homes. It is essential to ensure that the distribution of the target remains consistent post-split. Neglecting this practice could lead to a scenario where the training data lacks representation of extremely low or high prices (which, being infrequent, might be omitted from the training data if we were to split it randomly). This situation could be particularly problematic for tree-based models such as Random Forest, which is currently among the most extensively employed models in the field.


To elaborate, if these models were trained exclusively on data that lacked instances of very low or high prices, they would be limited in their predictive abilities within a real-world production environment. In simpler terms, their predictions would consistently fall within a range that excludes very low or high prices, an outcome that clearly lacks accuracy and comprehensiveness.


The code snippet below illustrates the utilization of the sklearn train_test_split function for data splitting while preserving the distribution of a continuous target variable. Since this function is designed for discrete targets, the initial step involves converting the continuous target values into discrete intervals. Subsequently, the data can be split, taking into account the stratification of these intervals.

from sklearn.model_selection import train_test_split

# Bin the target values into discrete intervals
intervals = pd.cut(y, 5, labels=False)

# Split the data into combined training and validation data and test data
X_train_val, X_test, y_train_val, y_test, intervals_train_val, intervals_test = train_test_split(
    X,
    y,
    intervals,
    test_size=0.2,
    random_state=42,
    stratify=intervals
)

# Split the combined training and validation data into training and validation data
X_train, X_val, y_train, y_val, intervals_train, intervals_val = train_test_split(
    X_train_val,
    y_train_val,
    intervals_train_val,
    test_size=0.25,
    random_state=42,
    stratify=intervals_train_val
)

Much like in regression, in the context of classification tasks (where the target variable is discrete and comprises distinct classes), it is imperative to employ a stratified data splitting approach because we also seek to preserve the distribution of classes after the split. The code below demonstrates how to split data while preserving the distribution of a discrete target. It is important to note that, unlike regression, there is no need to create intervals as the target is already discrete.

from sklearn.model_selection import train_test_split

# Split the data into combined training and validation data and test data
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Split the combined training and validation data into training and validation data
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val,
    y_train_val,
    test_size=0.25,
    random_state=42,
    stratify=y_train_val
)

Large VS Small


In the code snippets provided above, we divided the data into training, validation, and test sets with proportions of 60%, 20%, and 20%, respectively. While there is no one-size-fits-all rule for these ratios, a common guideline is to allocate a larger portion to the training set when dealing with larger datasets. This approach is favored because, on the one hand, a larger training sample provides the model with more data to learn from, ultimately leading to better model performance. On the other hand, on a large dataset, even a relatively small proportion reserved for validation and test can be adequate. For example, in the case of a dataset with 1 million samples, allocating only 1% for validation and test would still yield 1000 samples, which can often be sufficient for fine-tuning and evaluation.


Takeaways


In this blog post, we have covered the concept of splitting data and explored methods for splitting data using sklearn train_test_split function. Here are the key takeaways:

  • While the model may utilize the training and validation data repeatedly for refinement, it should interact with the test data only once

  • For time series data, it is crucial to partition the data based on chronological order

  • For non-time series data, it is generally preferred to utilize a stratified method for data partitioning in order to maintain the distribution of the target variable

  • Typically, the larger the dataset the larger the proportion for training

Thank you for reading. We will see you next time!

Comments


bottom of page