Creating and using Datasets from RDDs and back again
In this recipe, we explore how to use RDD and interact with Dataset to build a multi-stage machine learning pipeline. Even though the Dataset (conceptually thought of as RDD with strong type-safety) is the way forward, you still have to be able to interact with other machine learning algorithms or codes that return/operate on RDD for either legacy or coding reasons. In this recipe, we also explore how to create and convert from Dataset to RDD and back.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter3
- Import the necessary packages for Spark session to get access to the cluster and
Log4j.Logger
to reduce the amount of output produced by Spark.
import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.SparkSession
- Define a Scala case class to model data for processing...