Chapter 3: Identifying and Fixing Missing Values
I think I speak for many data scientists when I say that rarely is there something so seemingly small and trivial that is as of much consequence as the missing value. We spend a good deal of our time worrying about missing values because they can have a dramatic, and surprising, effect on our analysis. This is most likely to happen when missing values are not random – that is, when they are correlated with a feature or target. For example, let's say we are doing a longitudinal study of earnings, but individuals with lower education are more likely to skip the earnings question each year. There is a decent chance that this will bias our parameter estimate for education.
Of course, identifying missing values is not even half of the battle. We then need to decide how to handle them. Do we remove any observation with a missing value for one or more features? Do we impute a value based on a sample-wide statistic such as the...