Creating an RDD for training
Before we can train an ML model, we need to create an RDD where each element is a labeled point. In this recipe, we will use the final_data
RDD we created in the previous recipe to prepare our RDD for training.
Getting ready
To execute this recipe, you need to have a working Spark environment. You would have already gone through the previous recipe when we standardized the encoded census data.
No other prerequisites are required.
How to do it...
Many of the MLlib models require an RDD of labeled points to train. The next code snippets will create such an RDD for us to build classification and regression model.
Classification
Here's the snippet to create the classification RDD of labeled points that we will be using to predict whether someone is making more than $50,000:
final_data_income = ( final_data .map(lambda row: reg.LabeledPoint( row[0] , row[1:] ) )
Regression
Here's the snippet to create the regression RDD of labeled points that...