Working with the Dataset API using a Scala Sequence
In this recipe, we examine the new Dataset and how it works with the seq Scala data structure. We often see a relationship between the LabelPoint data structure used with ML libraries and a Scala sequence (that is, seq data structure) that play nicely with dataset.
The Dataset is being positioned as a unifying API going forward. It is important to note that DataFrame is still available as an alias described as Dataset[Row]
. We have covered the SQL examples extensively via DataFrame recipes, so we concentrate our efforts on other variations for dataset.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- Set up the package location where the program will reside
package spark.ml.cookbook.chapter3
- Import the necessary packages for a Spark session to get access to the cluster and
Log4j.Logger
to reduce the amount of output produced by Spark.
import org.apache.log4j.{Level, Logger...