Common operations with the new Dataset API
In this recipe, we cover the Dataset API, which is the way for data wrangling in Spark 2.0 and beyond. In Chapter 3, Spark's Three Data Musketeers for Machine Learning - Perfect Together we covered three detailed recipes for dataset, and in this chapter we cover some of the common, repetitive operations that are required to work with these new API sets. Additionally, we demonstrate the query plan generated by the Spark SQL Catalyst optimizer.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- We will use a JSON data file named
cars.json
, which has been created for this example:
name,city Bears,Chicago Packers,Green Bay Lions,Detroit Vikings,Minnesota
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
- Import the necessary packages for the Spark session to get access to the cluster and
log4j.Logger
to reduce the amount of output produced...