Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Scala Machine Learning Projects

You're reading from   Scala Machine Learning Projects Build real-world machine learning and deep learning projects with Scala

Arrow left icon
Product type Paperback
Published in Jan 2018
Publisher Packt
ISBN-13 9781788479042
Length 470 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Md. Rezaul Karim Md. Rezaul Karim
Author Profile Icon Md. Rezaul Karim
Md. Rezaul Karim
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Analyzing Insurance Severity Claims FREE CHAPTER 2. Analyzing and Predicting Telecommunication Churn 3. High Frequency Bitcoin Price Prediction from Historical and Live Data 4. Population-Scale Clustering and Ethnicity Prediction 5. Topic Modeling - A Better Insight into Large-Scale Texts 6. Developing Model-based Movie Recommendation Engines 7. Options Trading Using Q-learning and Scala Play Framework 8. Clients Subscription Assessment for Bank Telemarketing using Deep Neural Networks 9. Fraud Analytics Using Autoencoders and Anomaly Detection 10. Human Activity Recognition using Recurrent Neural Networks 11. Image Classification using Convolutional Neural Networks 12. Other Books You May Enjoy

GBT regressor for predicting insurance severity claims

In order to minimize a loss function, Gradient Boosting Trees (GBTs) iteratively train many decision trees. On each iteration, the algorithm uses the current ensemble to predict the label of each training instance.

Then the raw predictions are compared with the true labels. Thus, in the next iteration, the decision tree will help correct previous mistakes if the dataset is re-labeled to put more emphasis on training instances with poor predictions.

Since we are talking about regression, it would be more meaningful to discuss the regression strength of GBTs and its losses computation. Suppose we have the following settings:

  • N data instances
  • yi = label of instance i
  • xi = features of instance i

Then the F(xi) function is the model's predicted label; for instance, it tries to minimize the error, that is, loss:

Now, similar to decision trees, GBTs also:

  • Handle categorical features (and of course numerical features too)
  • Extend to the multiclass classification setting
  • Perform both the binary classification and regression (multiclass classification is not yet supported)
  • Do not require feature scaling
  • Capture non-linearity and feature interactions, which are greatly missing in LR, such as linear models
Validation while training: Gradient boosting can overfit, especially when you have trained your model with more trees. In order to prevent this issue, it is useful to validate while carrying out the training.

Since we have already prepared our dataset, we can directly jump into implementing a GBT-based predictive model for predicting insurance severity claims. Let's start with importing the necessary packages and libraries:

import org.apache.spark.ml.regression.{GBTRegressor, GBTRegressionModel} 
import org.apache.spark.ml.{Pipeline, PipelineModel} 
import org.apache.spark.ml.evaluation.RegressionEvaluator 
import org.apache.spark.ml.tuning.ParamGridBuilder 
import org.apache.spark.ml.tuning.CrossValidator 
import org.apache.spark.sql._ 
import org.apache.spark.sql.functions._ 
import org.apache.spark.mllib.evaluation.RegressionMetrics 

Now let's define and initialize the hyperparameters needed to train the GBTs, such as the number of trees, number of max bins, number of folds to be used during cross-validation, number of maximum iterations to iterate the training, and finally max tree depth:

val NumTrees = Seq(5, 10, 15) 
val MaxBins = Seq(5, 7, 9) 
val numFolds = 10 
val MaxIter: Seq[Int] = Seq(10) 
val MaxDepth: Seq[Int] = Seq(10) 

Then, again we instantiate a Spark session and implicits as follows:

val spark = SparkSessionCreate.createSession() 
import spark.implicits._ 

Now that we care an estimator algorithm, that is, GBT:

val model = new GBTRegressor()
.setFeaturesCol("features")
.setLabelCol("label")

Now, we build the pipeline by chaining the transformations and predictor together as follows:

val pipeline = new Pipeline().setStages((Preproessing.stringIndexerStages :+ Preproessing.assembler) :+ model) 

Before we start performing the cross-validation, we need to have a paramgrid. So let's start creating the paramgrid by specifying the number of maximum iteration, max tree depth, and max bins as follows:

val paramGrid = new ParamGridBuilder() 
      .addGrid(model.maxIter, MaxIter) 
      .addGrid(model.maxDepth, MaxDepth) 
      .addGrid(model.maxBins, MaxBins) 
      .build() 

Now, for a better and stable performance, let's prepare the K-fold cross-validation and grid search as a part of model tuning. As you can guess, I am going to perform 10-fold cross-validation. Feel free to adjust the number of folds based on you settings and dataset:

println("Preparing K-fold Cross Validation and Grid Search") 
val cv = new CrossValidator() 
      .setEstimator(pipeline) 
      .setEvaluator(new RegressionEvaluator) 
      .setEstimatorParamMaps(paramGrid) 
      .setNumFolds(numFolds) 

Fantastic, we have created the cross-validation estimator. Now it's time to train the GBT model:

println("Training model with GradientBoostedTrees algorithm ") 
val cvModel = cv.fit(Preproessing.trainingData) 

Now that we have the fitted model, that means it is now capable of making predictions. So let's start evaluating the model on the train and validation set, and calculating RMSE, MSE, MAE, R-squared, and so on:

println("Evaluating model on train and test data and calculating RMSE") 
val trainPredictionsAndLabels = cvModel.transform(Preproessing.trainingData).select("label", "prediction").map { case Row(label: Double, prediction: Double) => (label, prediction) }.rdd 

val validPredictionsAndLabels = cvModel.transform(Preproessing.validationData).select("label", "prediction").map { case Row(label: Double, prediction: Double) => (label, prediction) }.rdd val trainRegressionMetrics = new RegressionMetrics(trainPredictionsAndLabels) val validRegressionMetrics = new RegressionMetrics(validPredictionsAndLabels)

Great! We have managed to compute the raw prediction on the train and the test set. Let's hunt for the best model:

val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel] 

As already stated, by using GBT it is possible to measure feature importance so that at a later stage we can decide which features are to be used and which ones are to be dropped from the DataFrame. Let's find the feature importance of the best model we just created previously, for all features in ascending order as follows:

val featureImportances = bestModel.stages.last.asInstanceOf[GBTRegressionModel].featureImportances.toArray 
val FI_to_List_sorted = featureImportances.toList.sorted.toArray  

Once we have the best fitted and cross-validated model, we can expect good prediction accuracy. Now let's observe the results on the train and the validation set:

val output = "n=====================================================================n" + s"Param trainSample: ${Preproessing.trainSample}n" + 
      s"Param testSample: ${Preproessing.testSample}n" + 
      s"TrainingData count: ${Preproessing.trainingData.count}n" + 
      s"ValidationData count: ${Preproessing.validationData.count}n" + 
      s"TestData count: ${Preproessing.testData.count}n" +      "=====================================================================n" +   s"Param maxIter = ${MaxIter.mkString(",")}n" + 
      s"Param maxDepth = ${MaxDepth.mkString(",")}n" + 
      s"Param numFolds = ${numFolds}n" +      "=====================================================================n" +   s"Training data MSE = ${trainRegressionMetrics.meanSquaredError}n" + 
      s"Training data RMSE = ${trainRegressionMetrics.rootMeanSquaredError}n" + 
      s"Training data R-squared = ${trainRegressionMetrics.r2}n" + 
      s"Training data MAE = ${trainRegressionMetrics.meanAbsoluteError}n" + 
      s"Training data Explained variance = ${trainRegressionMetrics.explainedVariance}n" +      "=====================================================================n" +    s"Validation data MSE = ${validRegressionMetrics.meanSquaredError}n" + 
      s"Validation data RMSE = ${validRegressionMetrics.rootMeanSquaredError}n" + 
      s"Validation data R-squared = ${validRegressionMetrics.r2}n" + 
      s"Validation data MAE = ${validRegressionMetrics.meanAbsoluteError}n" + 
      s"Validation data Explained variance = ${validRegressionMetrics.explainedVariance}n" +      "=====================================================================n" +   s"CV params explained: ${cvModel.explainParams}n" + 
      s"GBT params explained: ${bestModel.stages.last.asInstanceOf[GBTRegressionModel].explainParams}n" + s"GBT features importances:n ${Preproessing.featureCols.zip(FI_to_List_sorted).map(t => s"t${t._1} = ${t._2}").mkString("n")}n" +      "=====================================================================n" 

Now, we print the preceding results as follows:

println(results)
>>> ===================================================================== Param trainSample: 1.0 Param testSample: 1.0 TrainingData count: 141194 ValidationData count: 47124 TestData count: 125546 ===================================================================== Param maxIter = 10 Param maxDepth = 10 Param numFolds = 10 ===================================================================== Training data MSE = 2711134.460296872 Training data RMSE = 1646.5522950385973 Training data R-squared = 0.4979619968485668 Training data MAE = 1126.582534126603 Training data Explained variance = 8336528.638733303 ===================================================================== Validation data MSE = 4796065.983773314 Validation data RMSE = 2189.9922337244293 Validation data R-squared = 0.13708582379658474 Validation data MAE = 1289.9808960385383 Validation data Explained variance = 8724866.468978886 ===================================================================== CV params explained: estimator: estimator for selection (current: pipeline_9889176c6eda) estimatorParamMaps: param maps for the estimator (current: [Lorg.apache.spark.ml.param.ParamMap;@87dc030) evaluator: evaluator used to select hyper-parameters that maximize the validated metric (current: regEval_ceb3437b3ac7) numFolds: number of folds for cross validation (>= 2) (default: 3, current: 10) seed: random seed (default: -1191137437) GBT params explained: cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. (default: false) checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations (default: 10) featuresCol: features column name (default: features, current: features) impurity: Criterion used for information gain calculation (case-insensitive). Supported options: variance (default: variance) labelCol: label column name (default: label, current: label) lossType: Loss function which GBT tries to minimize (case-insensitive). Supported options: squared, absolute (default: squared) maxBins: Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature. (default: 32) maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 5, current: 10) maxIter: maximum number of iterations (>= 0) (default: 20, current: 10) maxMemoryInMB: Maximum memory in MB allocated to histogram aggregation. (default: 256) minInfoGain: Minimum information gain for a split to be considered at a tree node. (default: 0.0) minInstancesPerNode: Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1. (default: 1) predictionCol: prediction column name (default: prediction) seed: random seed (default: -131597770) stepSize: Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator. (default: 0.1) subsamplingRate: Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0) GBT features importance: idx_cat1 = 0.0 idx_cat2 = 0.0 idx_cat3 = 0.0 idx_cat4 = 3.167169394850417E-5 idx_cat5 = 4.745749854188828E-5 ... idx_cat111 = 0.018960701085054904 idx_cat114 = 0.020609596772820878 idx_cat115 = 0.02281267960792931 cont1 = 0.023943087007850663 cont2 = 0.028078353534251005 ... cont13 = 0.06921704925937068 cont14 = 0.07609111789104464 =====================================================================

So our predictive model shows an MAE of about 1126.582534126603 and 1289.9808960385383 for the training and test sets respectively. The last result is important for understanding the feature importance (the preceding list is abridged to save space but you should receive the full list). Especially, we can see that the first three features are not important at all so we can safely drop them from the DataFrame. We will provide more insight in the next section.

Now finally, let us run the prediction over the test set and generate the predicted loss for each claim from the clients:

println("Run prediction over test dataset") 
cvModel.transform(Preproessing.testData) 
      .select("id", "prediction") 
      .withColumnRenamed("prediction", "loss") 
      .coalesce(1) 
      .write.format("com.databricks.spark.csv") 
      .option("header", "true") 
      .save("output/result_GBT.csv") 

The preceding code should generate a CSV file named result_GBT.csv. If we open the file, we should observe the loss against each ID, that is, claim. We will see the contents for both LR, RF, and GBT at the end of this chapter. Nevertheless, it is always a good idea to stop the Spark session by invoking the spark.stop() method.

You have been reading a chapter from
Scala Machine Learning Projects
Published in: Jan 2018
Publisher: Packt
ISBN-13: 9781788479042
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image