What missing values are and how to deal with them
Data describing real-world phenomena often has a lot of missing data. A lack of data is a fact that cannot be overlooked, especially if the analyst wants to do an advanced study of the dataset to understand how much the variables in it are correlated.
The consequences of mishandling missing values can be many:
- The statistical power of variables with missing values is reduced, especially if a significant number of values are missing for a single variable.
- The representativeness of the dataset subject to missing values may also be diminished, and thus the dataset in question may not correctly represent the substantive characteristics of the set of all observations of a phenomenon.
- Any statistical estimates may not converge to the values of the entire population, thus introducing bias.
- The results of the analysis performed may not be correct.
But, first, let’s look at the possible causes...