Processing Tabular Data
In this section, you will learn how to load tabular data into a Python development environment so that it can be used for TensorFlow modeling. You will use pandas and scikit-learn to utilize the classes and functions that are useful for processing data. You will also explore methods that can be used to preprocess this data.
Tabular data can be loaded into memory by using the pandas read_csv
function and passing the path into the dataset. The function is well suited and easy to use for loading in tabular data and can be used as follows:
df = pd.read_csv('path/to/dataset')
In order to normalize the data, you can use a scaler that is available in scikit-learn. There are multiple scalers that can be applied; StandardScaler
will normalize the data so that the fields of the dataset have a mean of 0
and a standard deviation of 1
. Another common scaler that is used is MinMaxScaler
, which will rescale the dataset so that the fields have a minimum value of 0
and a maximum value of 1
.
To use a scaler, it must be initialized and fit to the dataset. By doing this, the dataset can be transformed by the scaler. In fact, the fitting and transformation processes can be performed in one step by using the fit_transform
method, as follows:
scaler = StandardScaler() transformed_df = scaler.fit_transform(df)
In the first exercise, you will learn how to use pandas and scikit-learn to load a dataset and preprocess it so that it is suitable for modeling.
Exercise 2.01: Loading Tabular Data and Rescaling Numerical Fields
The dataset, Bias_correction_ucl.csv
, contains information for bias correction of the next-day maximum and minimum air temperature forecast for Seoul, South Korea. The fields represent temperature measurements of the given date, the weather station at which the metrics were measured, model forecasts of weather-related metrics such as humidity, and projections for the temperature of the following day. You are required to preprocess the data to make all the columns normally distributed with a mean of 0
and a standard deviation of 1
. You will demonstrate the effects with the Present_Tmax
column, which represents the maximum temperature on the given date at a given weather station.
Note
The dataset can be found here: https://packt.link/l83pR.
Perform the following steps to complete this exercise:
- Open a new Jupyter notebook to implement this exercise. Save the file as
Exercise2-01.ipnyb
. - In a new Jupyter Notebook cell, import the pandas library, as follows:
import pandas as pd
Note
You can find the documentation for pandas at the following link: https://pandas.pydata.org/docs/.
- Create a new pandas DataFrame named
df
and read theBias_correction_ucl.csv
file into it. Examine whether your data is properly loaded by printing the resultant DataFrame:df = pd.read_csv('Bias_correction_ucl.csv') df
Note
Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.
The output will be as follows:
- Drop the
date
column using thedrop
method of the DataFrame and pass in the name of the column. Thedate
column will be dropped as it is a non-numerical field and rescaling will not be possible when non-numerical fields exist. Since you are dropping a column, both theaxis=1
argument and theinplace=True
argument should be passed:df.drop('Date', inplace=True, axis=1)
- Plot a histogram of the
Present_Tmax
column that represents the maximum temperature across dates and weather stations within the dataset:ax = df['Present_Tmax'].hist(color='gray') ax.set_xlabel("Temperature") ax.set_ylabel("Frequency")
The output will be as follows:
The resultant histogram shows the distribution of values for the
Present_Tmax
column. You can see that the temperature values vary from 20 to 38 degrees Celsius. Plotting a histogram of the feature values is a good way to view the distribution of values to understand whether scaling is required as a preprocessing step. - Import the
StandardScaler
class from scikit-learn's preprocessing package. Initialize the scaler, fit the scaler, and transform the DataFrame using the scaler'sfit_transform
method. Create a new DataFrame,df2
, using the transformed DataFrame since the result of thefit_transform
method is a NumPy array. The standard scaler will transform the numerical fields so that the mean of the field is0
and the standard deviation is1
:from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df2 = scaler.fit_transform(df) df2 = pd.DataFrame(df2, columns=df.columns)
Note
The values for the mean and standard deviation of the resulting transformed data can be input into the scaler.
- Plot a histogram of the transformed
Present_Tmax
column:ax = df2['Present_Tmax'].hist(color='gray') ax.set_xlabel("Normalized Temperature") ax.set_ylabel("Frequency")
The output will be as follows:
The resulting histogram shows that the temperature values range from around -3
to 3
degrees Celsius, as evidenced by the range on the x axis of the histogram. By using the standard scaler, the values will always have a mean of 0
and a standard deviation of 1
. Having the features normalized can speed up the model training process.
In this exercise, you successfully imported tabular data using the pandas library and performed some preprocessing using the scikit-learn library. The preprocessing of data included dropping the date
column and scaling the numerical fields so that they have a mean value of 0
and a standard deviation of 1
.
In the following activity, you will load in tabular data using the pandas library and scale that data using the MinMax
scaler present in scikit-learn. You will do so on the same dataset that you used in the prior exercise, which describes the bias correction of air temperature forecasts for Seoul, South Korea.
Activity 2.01: Loading Tabular Data and Rescaling Numerical Fields with a MinMax Scaler
In this activity, you are required to load tabular data and rescale the data using a MinMax
scaler. The dataset, Bias_correction_ucl.csv
, contains information for bias correction of the next-day maximum and minimum air temperature forecast for Seoul, South Korea. The fields represent temperature measurements of the given date, the weather station at which the metrics were measured, model forecasts of weather-related metrics such as humidity, and projections for the temperature the following day. You are required to scale the columns so that the minimum value of each column is 0
and the maximum value is 1
.
Perform the following steps to complete this activity:
- Open a new Jupyter notebook to implement this activity.
- Import pandas and the
Bias_correction_ucl.csv
dataset. - Read the dataset using the pandas
read_csv
function. - Drop the
date
column of the DataFrame. - Plot a histogram of the
Present_Tmax
column. - Import
MinMaxScaler
and fit it to and transform the feature DataFrame. - Plot a histogram of the transformed
Present_Tmax
column.You should get an output similar to the following:
Note
The solution to this activity can be found via this link.
One method of converting non-numerical fields such as categorical or date fields is to one-hot encode them. The one-hot encoding process creates a new column for each unique value in the provided column, while each row has a value of 0
except for the one that corresponds to the correct column. The column headers of the newly created dummy columns correspond to the unique values. One-hot encoding can be achieved by using the get_dummies
function of the pandas library and passing in the column to be encoded. An optional argument is to provide a prefix feature that adds a prefix to the column headers. This can be useful for referencing the columns:
dummies = pd.get_dummies(df['feature1'], prefix='feature1')
Note
When using the get_dummies
function, NaN
values are converted into all zeros.
In the following exercise, you'll learn how to preprocess non-numerical fields. You will utilize the same dataset that you used in the previous exercise and activity, which describes the bias correction of air temperature forecasts for Seoul, South Korea.
Exercise 2.02: Preprocessing Non-Numerical Data
In this exercise, you will preprocess the date
column by one-hot encoding the year and the month from the date
column using the get_dummies
function. You will join the one-hot-encoded columns with the original DataFrame and ensure that all the fields in the resultant DataFrame are numerical.
Perform the following steps to complete this exercise:
- Open a new Jupyter notebook to implement this exercise. Save the file as
Exercise2-02.ipnyb
. - In a new Jupyter Notebook cell, import the pandas library, as follows:
import pandas as pd
- Create a new pandas DataFrame named
df
and read theBias_correction_ucl.csv
file into it. Examine whether your data is properly loaded by printing the resultant DataFrame:df = pd.read_csv('Bias_correction_ucl.csv')
Note
Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.
- Change the data type of the
date
column toDate
using the pandasto_datetime
function:df['Date'] = pd.to_datetime(df['Date'])
- Create dummy columns for
year
using the pandasget_dummies
function. Pass in the year of thedate
column as the first argument and add a prefix to the columns of the resultant DataFrame. Print out the resultant DataFrame:year_dummies = pd.get_dummies(df['Date'].dt.year, \ prefix='year') year_dummies
The output will be as follows:
The resultant DataFrame contains only 0s and 1s.
1
corresponds to the value present in the originaldate
column. Null values will have 0s for all columns in the newly created DataFrame. - Repeat this for the month by creating dummy columns from the month of the
date
column. Print out the resulting DataFrame:month_dummies = pd.get_dummies(df['Date'].dt.month, \ prefix='month') month_dummies
The output will be as follows:
The resultant DataFrame now contains only 0s and 1s for the month in the
date
column. - Concatenate the original DataFrame and the dummy DataFrames you created in Steps 5 and 6:
df = pd.concat([df, month_dummies, year_dummies], \ axis=1)
- Drop the original
date
column since it is now redundant:df.drop('Date', axis=1, inplace=True)
- Verify that all the columns are now of the numerical data type:
df.dtypes
The output will be as follows:
Here, you can see that all the data types of the resultant DataFrame are numerical. This means they can now be passed into an ANN for modeling.
In this exercise, you successfully imported tabular data and preprocessed the date
column using the pandas and scikit-learn libraries. You utilized the get_dummies
function to convert categorical data into numerical data types.
Note
Another method to attain a numerical data type from date data types is by using the pandas.Series.dt
accessor object. More information about the available options can be found here: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html.
Processing non-numerical data is an important step in creating performant models. If possible, any domain knowledge should be imparted to the training data features. For example, when forecasting the temperature using the date, like the dataset used in the prior exercises and activity of this chapter, encoding the month would be helpful since the temperature is likely highly correlated with the month of the year. Encoding the day of the week, however, may not be useful as there is likely no correlation between the day of the week and temperature. Using this domain knowledge can aid the model to learn the underlying relationship between the features and the target.
In the next section, you will learn how to process image data so that it can be input into machine learning models.