How do we predict customer churn with Spark?
Predicting customer churn in Apache Spark is similar to predicting any other binary outcome. Spark provides a number of algorithms to do such a prediction. While we'll focus on Random Forest, you can potentially look at other algorithms within the MLLib library to perform the prediction. We'll follow the typical steps of building a machine learning pipeline that we had discussed in our earlier MLLib chapter.
The typical stages include:
- Stage 1: Loading data/defining schema
- Stage 2: Exploring/visualizing the data set
- Stage 3: Performing necessary transformations
- Stage 4: Feature engineering
- Stage 5: Model training
- Stage 6: Model evaluation
- Stage 7: Model monitoring
Data set description
Since we are going to target the telecom industry, we'll use one of the popular data sets around generally used for telecommunication demonstrations. It was originally published in Discovering Knowledge in Data (http://www.dataminingconsultant.com/DKD.htm)...