Replacing missing values with an arbitrary number
We can replace missing data with an arbitrary value. Commonly used values are 999
, 9999
, or -1
for positive distributions. This method is used for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in the Imputing categorical variables recipe.
When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value of the distribution.
Note
We’d use arbitrary number imputation when data is not missing at random, use non-linear models, or when the percentage of missing data is high. This imputation technique distorts the original variable distribution.
In this recipe, we will impute missing data with arbitrary numbers using pandas
, scikit-learn
, and feature-engine
.
How to do it...
Let’s begin by importing the necessary tools and loading the data:
- Import
pandas
and the required functions and classes:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from feature_engine.imputation import ArbitraryNumberImputer
- Let’s load the dataset described in the Technical requirements section:
data = pd.read_csv("credit_approval_uci.csv")
- Let’s separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split( data.drop("target", axis=1), data["target"], test_size=0.3, random_state=0, )
We will select arbitrary values greater than the maximum value of the distribution.
- Let’s find the maximum value of four numerical variables:
X_train[['A2','A3', 'A8', 'A11']].max()
The previous command returns the following output:
A2 76.750 A3 26.335 A8 28.500 A11 67.000 dtype: float64
We’ll use
99
for the imputation because it is bigger than the maximum values of the numerical variables in step 4. - Let’s make a copy of the original DataFrames:
X_train_t = X_train.copy() X_test_t = X_test.copy()
- Now, we replace the missing values with
99
:X_train_t[["A2", "A3", "A8", "A11"]] = X_train_t[[ "A2", "A3", "A8", "A11"]].fillna(99) X_test_t[["A2", "A3", "A8", "A11"]] = X_test_t[[ "A2", "A3", "A8", "A11"]].fillna(99)
Note
To impute different variables with different values using pandas
fillna()
, use a dictionary like this: imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "
A11": 9999}
.
Now, we’ll impute missing values with an arbitrary number using scikit-learn
.
- Let’s set up
imputer
to replace missing values with99
:imputer = SimpleImputer(strategy='constant', fill_value=99)
Note
If your dataset contains categorical variables, SimpleImputer()
will add 99
to those variables as well if any values are missing.
- Let’s fit
imputer
to a slice of the train set containing the variables to impute:vars = ["A2", "A3", "A8", "A11"] imputer.fit(X_train[vars])
- Replace the missing values with
99
in the desired variables:X_train_t[vars] = imputer.transform(X_train[vars]) X_test_t[vars] = imputer.transform(X_test[vars])
Go ahead and check the lack of missing values by executing
X_test_t[["A2", "A3", "
A8", "A11"]].isnull().sum()
.To finish, let’s impute missing values using
feature-engine
. - Let’s set up the
imputer
to replace missing values with99
in 4 specific variables:imputer = ArbitraryNumberImputer( arbitrary_number=99, variables=["A2", "A3", "A8", "A11"], )
Note
ArbitraryNumberImputer()
will automatically select all numerical variables in the train set for imputation if we set the variables
parameter to None
.
- Finally, let’s replace the missing values with
99
:X_train = imputer.fit_transform(X_train) X_test = imputer.transform(X_test)
Note
To impute different variables with different numbers, set up ArbitraryNumberImputer()
as follows: ArbitraryNumberImputer(imputater_dict = {"A2": -1, "A3": -1, "A8": 999, "
A11": 9999})
.
We have now replaced missing data with arbitrary numbers using three different open-source libraries.
How it works...
In this recipe, we replaced missing values in numerical variables with an arbitrary number using pandas
, scikit-learn
, and feature-engine
.
To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using pandas’ max()
. We chose 99
because it was greater than the maximum values of the selected variables. In step 5, we used pandas
fillna()
to replace the missing data.
To replace missing values using scikit-learn
, we utilized SimpleImputer()
, with the strategy
set to constant
, and specified 99
in the fill_value
argument. Next, we fitted the imputer to a slice of the train set with the numerical variables to impute. Finally, we replaced missing values using transform()
.
To replace missing values with feature-engine
we used ArbitraryValueImputer()
, specifying the value 99
and the variables to impute as parameters. Next, we applied the fit_transform()
method to replace missing data in the train set and the transform()
method to replace missing data in the test set.