In this chapter, we will cover how to build machine learning models with PySpark's MLlib module. Even though it is now being deprecated and most of the models are now being moved to the ML module, if you store your data in RDDs, you can use MLlib to do machine learning. You will learn the following recipes:
- Loading the data
- Exploring the data
- Testing the data
- Transforming the data
- Standardizing the data
- Creating an RDD for training
- Predicting hours of work for census respondents
- Forecasting the income level of census respondents
- Building a clustering model
- Computing performance statistics