You are likely interested in training regression models that can take huge datasets as input, beyond what you can do in scikit-learn. Apache Spark is a good candidate for this scenario. As we mentioned in the previous chapter, Apache Spark can easily run training algorithms on a cluster of machines using Elastic MapReduce (EMR) on AWS. We will explain how to set up EMR clusters in the next chapter. In this section, we'll explain how you can use the Spark ML library to train linear regression algorithms:
- The first step is to create a dataframe from our training data:
housing_df = sql.read.csv(SRC_PATH + 'train.csv', header=True, inferSchema=True)
The following screenshot shows the first few rows of the dataset:
- Typically, Apache Spark requires the input dataset to have a single column with a vector of numbers...