ML pipelines for real-life machine learning applications
This is the first of two recipes which cover the ML pipeline in Spark 2.0. For a advanced treatment of ML pipelines with additional details such as API and parameter extraction, see later chapters in this book.
In this recipe, we attempt to have a single pipeline that can tokenize text, use HashingTF (an old trick) to map term frequencies, run a regression to fit a model, and then predict which group a new term belongs to (for example, news filtering, gesture classification, and so on).
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
- Import the necessary packages for the Spark session to gain access to the cluster and
log4j.Logger
to reduce the amount of output produced by Spark:
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.LogisticRegression...