Removing observations with missing data
Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding observations with missing data. CCA can be applied to both categorical and numerical variables. With CCA, we preserve the distribution of the variables after the imputation, provided the data is missing at random and only in a small proportion of observations. However, if data is missing across many variables, CCA may lead to the removal of a large portion of the dataset.
Note
Use CCA only when a small number of observations are missing and you have good reasons to believe that they are not important to your model.
How to do it...
Let’s begin by making some imports and loading the dataset:
- Let’s import
pandas
,matplotlib
, and the train/test split function from scikit-learn:import matplotlib.pyplot as plt import pandas as pd from sklearn.model_selection import train_test_split
- Let’s load and display the dataset described in the Technical requirements section:
data = pd.read_csv("credit_approval_uci.csv") data.head()
In the following image, we see the first 5 rows of data:
Figure 1.1 – First 5 rows of the dataset
- Let’s proceed as we normally would if we were preparing the data to train machine learning models; by splitting the data into a training and a test set:
X_train, X_test, y_train, y_test = train_test_split( data.drop("target", axis=1), data["target"], test_size=0.30, random_state=42, )
- Let’s now make a bar plot with the proportion of missing data per variable in the training and test sets:
fig, axes = plt.subplots( 2, 1, figsize=(15, 10), squeeze=False) X_train.isnull().mean().plot( kind='bar', color='grey', ax=axes[0, 0], title="train") X_test.isnull().mean().plot( kind='bar', color='black', ax=axes[1, 0], title="test") axes[0, 0].set_ylabel('Fraction of NAN') axes[1, 0].set_ylabel('Fraction of NAN') plt.show()
The previous code block returns the following bar plots with the fraction of missing data per variable in the training (top) and test sets (bottom):
Figure 1.2 – Proportion of missing data per variable
- Now, we’ll remove observations if they have missing values in any variable:
train_cca = X_train.dropna() test_cca = X_test.dropna()
Note
pandas’ dropna()
drops observations with any missing value by default. We can remove observations with missing data in a subset of variables like this: data.dropna(subset=["A3", "A4"])
.
- Let’s print and compare the size of the original and complete case datasets:
print(f"Total observations: {len(X_train)}") print(f"Observations without NAN: {len(train_cca)}")
We removed more than 200 observations with missing data from the training set, as shown in the following output:
Total observations: 483 Observations without NAN: 264
- After removing observations from the training and test sets, we need to align the target variables:
y_train_cca = y_train.loc[train_cca.index] y_test_cca = y_test.loc[test_cca.index]
Now, the datasets and target variables contain the rows without missing data.
- To drop observations with missing data utilizing
feature-engine
, let’s import the required transformer:from feature_engine.imputation import DropMissingData
- Let’s set up the imputer to automatically find the variables with missing data:
cca = DropMissingData(variables=None, missing_only=True)
- Let’s fit the transformer so that it finds the variables with missing data:
cca.fit(X_train)
- Let’s inspect the variables with NAN that the transformer found:
cca.variables_
The previous command returns the names of the variables with missing data:
['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A14']
- Let’s remove the rows with missing data in the training and test sets:
train_cca = cca.transform(X_train) test_cca = cca.transform(X_test)
Use
train_cca.isnull().sum()
to corroborate the absence of missing data in the complete case dataset. DropMissingData
can automatically adjust the target after removing missing data from the training set:train_c, y_train_c = cca.transform_x_y( X_train, y_train) test_c, y_test_c = cca.transform_x_y(X_test, y_test)
The previous code removed rows with nan
from the training and test sets and then re-aligned the target variables.
Note
To remove observations with missing data in a subset of variables, use DropMissingData(variables=['A3', 'A4'])
. To remove rows with nan
in at least 5% of the variables, use DropMissingData(threshold=0.95)
.
How it works...
In this recipe, we plotted the proportion of missing data in each variable and then removed all observations with missing values.
We used pandas
isnull()
and mean()
methods to determine the proportion of missing observations in each variable. The isnull()
method created a Boolean vector per variable with True
and False
values indicating whether a value was missing. The mean()
method took the average of these values and returned the proportion of missing data.
We used pandas
plot.bar()
to create a bar plot of the fraction of missing data per variable. In Figure 1.2, we saw the fraction of nan
per variable in the training and test sets.
To remove observations with missing values in any variable, we used pandas’ dropna()
, thereby obtaining a complete case dataset.
Finally, we removed missing data using Feature-engine’s DropMissingData()
. This imputer automatically identified and stored the variables with missing data from the train set when we called the fit()
method. With the transform()
method, the imputer removed observations with nan
in those variables. With transform_x_y()
, the imputer removed rows with nan
from the data sets and then realigned the target variable.
See also
If you want to use DropMissingData()
within a pipeline together with other Feature-engine or scikit-learn transformers, check out Feature-engine’s Pipeline
: https://Feature-engine.trainindata.com/en/latest/user_guide/pipeline/Pipeline.html. This pipeline can align the target with the training and test sets after removing rows.