Predicting forest coverage types
In this recipe, we will learn how to process data and build two classification models that aim to forecast the forest coverage type: the benchmark logistic regression model and the random forest classifier. The problem we have at hand is multinomial, that is, we have more than two classes that we want to classify our observations into.
Getting ready
To execute this recipe, you will need a working Spark environment and you would have already loaded the data into the forest
DataFrame.
No other prerequisites are required.
How to do it...
Here's the code that will help us build the logistic regression model:
forest_train, forest_test = ( forest .randomSplit([0.7, 0.3], seed=666) ) vectorAssembler = feat.VectorAssembler( inputCols=forest.columns[0:-1] , outputCol='features' ) selector = feat.ChiSqSelector( labelCol='CoverType' , numTopFeatures=10 , outputCol='selected' ) logReg_obj = cl.LogisticRegression( labelCol='CoverType' ...