Now that we have an understanding of the data we are working with, let's take a look at our missing values:
- To do this, we can use the isnull method available to us in pandas for DataFrames. This method returns a boolean same-sized object indicating if the values are null.
- We will then sum these to see which columns have missing data:
X.isnull().sum()
>>>>
boolean 1 city 1 ordinal_column 0 quantitative_column 1 dtype: int64
Here, we can see that three of our columns are missing values. Our course of action will be to impute these missing values.
If you recall, we implemented scikit-learn's Imputer class in a previous chapter to fill in numerical data. Imputer does have a categorical option, most_frequent, however it only works on categorical data that has been encoded as integers...