Cleaning up raw datasets with fastai
Now that we have explored a variety of datasets that are curated by fastai, there is one more topic left to cover in this chapter: how to clean up datasets with fastai. Cleaning up datasets includes dealing with missing values and converting categorical values into numeric identifiers. We need to apply these cleanup steps to datasets because deep learning models can only be trained with numeric data. If we try to train the model with datasets that contain non-numeric data, including missing values and alphanumeric identifiers in categorical columns, the training process will fail. In this section, we are going to review the facilities provided by fastai to make it easy to clean up datasets, and thus make the datasets ready to train deep learning models.
Getting ready
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the cleaning_up_datasets.ipynb
notebook in the ch2
directory of your repository.
How to do it…
In this section, you will be running through the cleaning_up_datasets.ipynb
notebook to address missing values in the ADULT_SAMPLE
dataset and replace categorical values with numeric identifiers.
Once you have the notebook open in your fastai environment, complete the following steps:
- Run the first two cells to import the necessary libraries and set up the notebook for fastai.
- Recall the Examining tabular datasets with fastai section of this chapter. When you checked to see which columns in the
ADULT_SAMPLE
dataset had missing values, you found that some columns did indeed have missing values. We are going to identify the columns inADULT_SAMPLE
that have missing values, and use the facilities of fastai to apply transformations to the dataset that deal with the missing values in those columns, and then replace those categorical values with numeric identifiers. - First, let's ingest the
ADULT_SAMPLE
curated dataset again:path = untar_data(URLs.ADULT_SAMPLE)
- Now, create a pandas DataFrame for the dataset and check for the number of missing values in each column. Note which columns have missing values:
df = pd.read_csv(path/'adult.csv') df.isnull().sum()
- To deal with these missing values (and prepare categorical columns), we will use the fastai
TabularPandas
class (https://docs.fast.ai/tabular.core.html#TabularPandas). To use this class, we need to prepare the following parameters:a) procs is the list of transformations that will be applied to
TabularPandas
. Here, we will specify that we want missing values to be filled (FillMissing
) and that we will replace values in categorical columns with numeric identifiers (Categorify
).b) dep_var specifies which column is the dependent variable; that is, the target that we want to ultimately predict with the model. In the case of
ADULT_SAMPLE
, the dependent variable issalary
.c) cont and cat are lists of the columns in the dataset. They are continuous and categorical, respectively. Continuous columns contain numeric values, such as integers or floating-point values. Categorical values contain category identifiers, such as names of US states, days of the week, or colors. We use the
cont_cat_split()
(https://docs.fast.ai/tabular.core.html#cont_cat_split) function to automatically identify the continuous and categorical columns:procs = [FillMissing,Categorify] dep_var = 'salary' cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
- Now, create a
TabularPandas
object calleddf_no_missing
using these parameters. This object will contain the dataset with missing values replaced and the values in the categorical columns replaced with numeric identifiers:df_no_missing = TabularPandas(df, procs, cat, cont, y_names = dep_var)
- Apply the
show
API todf_no_missing
to display samples of its contents. Note that the values in the categorical columns are maintained when the object is displayed usingshow()
. What about replacing the categorical values with numeric identifiers? Don't worry – we'll see that result in the next step: - Now, display some sample contents of
df_no_missing
using theitems.head()
API. This time, the categorical columns contain the numeric identifiers rather than the original values. This is an example of a benefit provided by fastai: the switch between the original categorical values and the numeric identifiers is handled elegantly. If you need to see the original values, you can use theshow()
API, which transforms the numeric values in categorical columns back into their original values, while theitems.head()
API shows the actual numeric identifiers in the categorical columns: - Finally, let's confirm that the missing values were handled correctly. As you can see, the two columns that originally had missing values no longer have missing values in
df_no_missing
:
By following these steps, you have seen how fastai makes it easy to prepare a dataset to train a deep learning model. It does this by replacing missing values and converting the values in the categorical columns into numeric identifiers.
How it works…
In this section, you saw several ways that fastai makes it easy to perform common data preparation steps. The TabularPandas
class provides a lot of value by making it easy to execute common steps to prepare a tabular dataset (including replacing missing values and dealing with categorical columns). The cont_cat_split()
function automatically identifies continuous and categorical columns in your dataset. In conclusion, fastai makes the cleanup process easy and less error prone than it would be if you had to hand code all the functions required to accomplish these dataset cleanup steps.