Cleaning missing values
In this section, we'll go over some of the most straightforward approaches for handling missing values. This includes dropping observations where there are missing values; assigning a sample-wide summary statistic, such as the mean, to the missing values; and assigning values based on the mean value for an appropriate subset of the data:
- Let's load the NLS data and select some of the educational data:
import pandas as pd nls97 = pd.read_csv("data/nls97b.csv") nls97.set_index("personid", inplace=True) schoolrecordlist = ['satverbal','satmath','gpaoverall','gpaenglish', 'gpamath','gpascience','highestdegree', 'highestgradecompleted'] schoolrecord = nls97[schoolrecordlist] schoolrecord.shape (8984, 8)
- We can use the techniques we explored in the previous section to identify missing values.
schoolrecord.isnull...