Performing multivariate imputation by chained equations
Multivariate imputation methods, as opposed to univariate imputation, use multiple variables to estimate the missing values. Multivariate Imputation by Chained Equations (MICE) models each variable with missing values as a function of the remaining variables in the dataset. The output of that function is used to replace missing data.
MICE involves the following steps:
- First, it performs a simple univariate imputation to every variable with missing data. For example, median imputation.
- Next, it selects one specific variable, say,
var_1
, and sets the missing values back to missing. - It trains a model to predict
var_1
using the other variables as input features. - Finally, it replaces the missing values of
var_1
with the output of the model.
MICE repeats steps 2 to 4 for each of the remaining variables.
An imputation cycle concludes once all the variables have been modeled. MICE carries out multiple imputation cycles, typically 10. That is, we repeat steps 2 to 4 for each variable 10 times. The idea is that by the end of the cycles, we should have found the best possible estimates of the missing data for each variable.
Note
Multivariate imputation can be a useful alternative to univariate imputation in situations where we don’t want to distort the variable distributions. Multivariate imputation is also useful when we are interested in having good estimates of the missing data.
In this recipe, we will implement MICE using scikit-learn.
How to do it...
To begin the recipe, let’s import the required libraries and load the data:
- Let’s import the required Python libraries, classes, and functions:
import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import BayesianRidge from sklearn.experimental import ( Â Â Â Â enable_iterative_imputer ) from sklearn.impute import ( Â Â Â Â IterativeImputer, Â Â Â Â SimpleImputer )
- Let’s load some numerical variables from the dataset described in the Technical requirements section:
variables = [ Â Â Â Â "A2", "A3", "A8", "A11", "A14", "A15", "target"] data = pd.read_csv( Â Â Â Â "credit_approval_uci.csv", Â Â Â Â usecols=variables)
- Let’s divide the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split( Â Â Â Â data.drop("target", axis=1), Â Â Â Â data["target"], Â Â Â Â test_size=0.3, Â Â Â Â random_state=0, )
- Let’s create a MICE imputer using Bayes regression, specifying the number of iteration cycles and setting
random_state
for reproducibility:imputer = IterativeImputer( Â Â Â Â estimator= BayesianRidge(), Â Â Â Â max_iter=10, Â Â Â Â random_state=0, ).set_output(transform="pandas")
Note
IterativeImputer()
contains other useful arguments. For example, we can specify the first imputation strategy using the initial_strategy
parameter. We can choose from the mean, median, mode, or arbitrary imputation. We can also specify how we want to cycle over the variables, either randomly or from the one with the fewest missing values to the one with the most.
- Let’s fit
IterativeImputer()
so that it trains the estimators to predict the missing values in each variable:imputer.fit(X_train)
Note
We can use any regression model to estimate the missing data with IterativeImputer()
.
- Finally, let’s fill in the missing values in both the train and test sets:
X_train_t = imputer.transform(X_train) X_test_t = imputer.transform(X_test)
Note
To corroborate the lack of missing data, we can execute X_train_t.isnull().sum()
.
To wrap up the recipe, let’s impute the variables with a simple univariate imputation method and compare the effect on the variables’ distribution.
- Let’s set up scikit-learn’s
SimpleImputer()
to perform mean imputation, and then transform the datasets:imputer_simple = SimpleImputer( Â Â Â Â strategy="mean").set_output(transform="pandas") X_train_s = imputer_simple.fit_transform(X_train) X_test_s = imputer_simple.transform(X_test)
- Let’s now make a histogram of the
A3
variable after MICE imputation, followed by a histogram of the same variable after mean imputation:fig, axes = plt.subplots( Â Â Â Â 2, 1, figsize=(10, 10), squeeze=False) X_test_t["A3"].hist( Â Â Â Â bins=50, ax=axes[0, 0], color="blue") X_test_s["A3"].hist( Â Â Â Â bins=50, ax=axes[1, 0], color="green") axes[0, 0].set_ylabel('Number of observations') axes[1, 0].set_ylabel('Number of observations') axes[0, 0].set_xlabel('A3') axes[1, 0].set_xlabel('A3') axes[0, 0].set_title('MICE') axes[1, 0].set_title('Mean imputation') plt.show()
In the following plot, we see that mean imputation distorts the variable distribution, with more observations toward the mean value:
Figure 1.10 – Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter
How it works...
In this recipe, we performed multivariate imputation using IterativeImputer()
from scikit-learn
. When we fit the model, IterativeImputer()
carried out the steps that we described in the introduction of the recipe. That is, it imputed all variables with the mean. Then it selected one variable and set its missing values back to missing. And finally, it fitted a Bayes regressor to estimate that variable based on the others. It repeated this procedure for each variable. That was one cycle of imputation. We set it to repeat this process 10 times. By the end of this procedure, IterativeImputer()
had one Bayes regressor trained to predict the values of each variable based on the other variables in the dataset. With transform()
, it uses the predictions of these Bayes models to impute the missing data.
IterativeImputer()
can only impute missing data in numerical variables based on numerical variables. If you want to use categorical variables as input, you need to encode them first. However, keep in mind that it will only carry out regression. Hence it is not suitable to estimate missing data in discrete or categorical variables.
See also
To learn more about MICE, take a look at the following resources:
- A multivariate technique for multiplying imputing missing values using a sequence of regression models: https://www.researchgate.net/publication/244959137
- Multiple Imputation by Chained Equations: What is it and how does it work?: https://www.jstatsoft.org/article/download/v045i03/550