Identifying and Fixing Missing Values
I think I speak for many data analysts and scientists when I write, rarely is there something so seemingly small and trivial that is of as much consequence as a missing value. We spend a good deal of our time worrying about missing values because they can have a dramatic, and surprising, effect on our analysis. This is most likely to happen when missing values are not random, but are correlated with a dependent variable. For example, if we are doing a longitudinal study of earnings, but individuals with lower education are more likely to skip the earnings question each year, there is a decent chance that this will bias our parameter estimate for education.
Of course, identifying missing values is not even half of the battle. We then need to decide how to handle them. Do we remove any observation with a missing value for one or more variables? Do we impute a value based on a sample-wide statistic like the mean? Or assign a value based on a...