Cleaning missing values
We go over some of the most straightforward approaches for handling missing values in this recipe. This includes dropping observations where there are missing values; assigning a sample-wide summary statistic, such as the mean, to the missing values; and assigning values based on the mean value for an appropriate subset of the data.
How to do it...
We will find and then remove observations from the NLS data that have mainly missing data for key variables. We will also use pandas methods to assign alternative values to missing values, such as the variable mean:
- Let’s load the NLS data and select some of the educational data.
import pandas as pd nls97 = pd.read_csv("data/nls97g.csv", low_memory=False) nls97.set_index("personid", inplace=True) schoolrecordlist = ['satverbal','satmath','gpaoverall', 'gpaenglish', 'gpamath','gpascience','highestdegree...