In this chapter, we will move on to the currently supported machine learning module of PySpark—the ML module. The ML module, like MLLib, exposes a vast array of machine learning models, almost completely covering the spectrum of the most-used (and usable) models. The ML module, however, operates on Spark DataFrames, making it much more performant as it can leverage the tungsten execution optimizations.
In this chapter, you will learn about the following recipes:
- Introducing Transformers
- Introducing Estimators
- Introducing Pipelines
- Selecting the most predictable features
- Predicting forest coverage types
- Estimating forest elevation
- Clustering forest cover types
- Tuning hyperparameters
- Extracting features from text
- Discretizing continuous variables
- Standardizing continuous variables
- Topic mining
In this chapter, we will use data we downloaded...