Exploring the data
Jumping straight into modeling the data is a misstep almost every new data scientist makes; we get too eager to get to the reward stage, so we forget about the fact that most of the time is actually spent doing the boring stuff of cleaning up our data and getting familiar with it. In this recipe, we will explore the census dataset.
Getting ready
To execute this recipe, you need to have a working Spark environment. You should have already gone through the previous recipe where we loaded the census data into a DataFrame.
No other prerequisites are required.
How to do it...
First, we list all the columns we want to keep:
cols_to_keep = census.dtypes cols_to_keep = ( ['label','age' ,'capital-gain' ,'capital-loss' ,'hours-per-week' ] + [ e[0] for e in cols_to_keep[:-1] if e[1] == 'string' ] )
Next, we select the numerical and categorical features as we will be exploring these separately:
census_subset = census.select(cols_to_keep) cols_num...