How to Load Data?

Welcome back to the "Machine Learning Fundamentals" series at Data Science Flow! In this blog post, we will be discussing the following topics:

What is loading data?
How to use pandas functions for loading Comma-Separated Values (CSV) files

Now let us begin!

What is Loading Data?

As previously mentioned in our blog post 1, loading data is the first step in data preprocessing. We can also see this from Table 1, where "Y" signifies that the particular step is necessary for the corresponding task (whether it is regression or classification), while "N" denotes that the step is not needed. As the table demonstrates, loading data is the initial stage for both regression and classification.

Table 1. The Data Preprocessing Pipeline for Regression and Classification.

Step	Regression	Classification
Loading data	Y	Y
Splitting data into training, validation and test	Y	Y
Handling uncommon features	Y	Y
Handling identifiers	Y	Y
Handling missing data	Y	Y
Encoding categorical data	Y	Y
Scaling data	Y	Y
Handling class imbalance	N	Y

Specifically, the act of loading data entails the procedure of transferring data into memory to facilitate downstream tasks. The approach to loading data depends on the format and size of the data. In the following section, we will demonstrate how to employ pandas functions for loading data in the form of CSV files, which happen to be one of the most prevalent data formats.

How to Load CSV Files?

Before we get into the specifics, let us begin by providing an explanation of what a CSV file is.

Comma-separated values (CSV) is a text file format that uses commas to separate values. A CSV file stores tabular data (numbers and text) in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file.

-- Wikipeida

As an example, consider Table 1 displayed above, which represents the type of tabular data typically stored in a CSV file. In this format, rows correspond to individual records, and columns represent the fields within those records.

Utilizing the pandas read_csv function allows us to read a CSV file into a DataFrame. Initially, we will demonstrate how to employ the function when the data is small enough to fit into memory. Following that, we will address a more intricate situation where the data exceeds the available memory capacity.

Small Files

In the case of a small CSV file, it is feasible to load the entire dataset into memory. The following code snippet provides an example of how to employ this function to load a CSV file named "small_data.csv."

import pandas as pd

# Load csv file named 'small_data.csv'
df = pd.read_csv('small_data.csv')

It is important to note that we are assuming both the CSV file and the code snippet (located in a .py or .ipynb file) are in the same directory. However, if this is not the case, we will need to specify the data file's path when using the function, as demonstrated in the following code.

# Load csv file named 'small_data.csv'
df = pd.read_csv('/path/to/small_data.csv')

Large Files

When the CSV file is too large to fit into memory, we can opt for a "Divide-and-Conquer" strategy by breaking the data into small chunks and loading one portion at a time. This concept is exemplified in the following code, where the chunksize parameter dictates the maximum number of lines to be read from the file in each chunk. Within the for-loop, we load each chunk containing no more than 1000 rows at a time (and display the row count for each chunk).

import pandas as pd

# Load csv file named 'large_data.csv'
df = pd.read_csv('/large_data.csv', chunksize=1000)

# Iterate each chunk (and display the row count)
for df_chunk in df:
    print(df_chunk.shape[0])

Takeaways

In this blog post, we have covered the concept of loading data and explored methods for loading small and large CSV files into memory using pandas read_csv function. Here are the key takeaways:

For small CSV file, we can use the function to load the entire dataset into memory
For CSV file that is too large to fit into memory, we can employ the chunksize parameter of the function to break the data into small chunks and load one portion at a time

Thank you for reading. We will see you next time!

Let Data Science Flow