Implementing missing value imputation algorithms
From here on, all missing value analysis will be done in R because very statistically specialized and simple-to-use packages that do not exist in the Python ecosystem have been developed for this language.
Suppose we need to calculate the Pearson correlation coefficient between the two numerical variables, Age
and Fare
, of the Titanic disaster dataset. Let's first consider the case where missing values are eliminated.
Removing missing values
The impact of applying listwise and pairwise deletion techniques is evident in the calculation of Pearson's correlation between numerical variables in the Titanic dataset. Let's load the data and select only numeric features:
library(dplyr) dataset_url <- 'http://bit.ly/titanic-data-csv' tbl <- readr::read_csv(dataset_url) tbl_num <- tbl %>% select( where(is.numeric) )
If you now calculate the correlation matrix for the two techniques...