Transforming the data
Machine learning (ML) is a field of study that aims at using machines (computers) to understand world phenomena and predict their behavior. In order to build an ML model, all our data needs to be numeric. Since almost all of our features are categorical, we need to transform our features. In this recipe, we will learn how to use a hashing trick and dummy encoding.
Getting ready
To execute this recipe, you need to have a working Spark environment. You would have already gone through the Loading the data recipe where we loaded the census data into a DataFrame.
No other prerequisites are required.
How to do it...
We will be reducing the dimensionality of our dataset roughly by half, so first we need to extract the total number of distinct values in each column:
len_ftrs = [] for col in cols_cat: ( len_ftrs .append( (col , census .select(col) .distinct() .count() ) ...