Standardizing the data
Data standardization (or normalization) is important for a number of reasons:
- Some algorithms converge faster on standardized (or normalized) data
- If your input variables are on vastly different scales, the interpretability of coefficients might be hard or conclusions drawn might be wrong
- For some models, the optimal solution might be wrong if you do not standardize
In this recipe, we will show you how to standardize the data so if your modeling project requires standardized data, you will know how to do it.
Getting ready
To execute this recipe, you need to have a working Spark environment. You would have already gone through the previous recipe where we encoded the census data.
No other prerequisites are required.
How to do it...
MLlib offers a method to do most of this work for us. Even though the following code might be confusing at first, we will walk through it step by step:
standardizer = feat.StandardScaler(True, True) sModel = standardizer.fit(final_data.map(lambda row...