Creating DataFrames from Scala data structures
In this recipe, we explore the DataFrame
API, which provides a higher level of abstraction than RDDs for working with data. The API is similar to R and Python data frame facilities (pandas).
DataFrame
simplifies coding and lets you use standard SQL to retrieve and manipulate data. Spark keeps additional information about DataFrames, which helps the API to manipulate the frames with ease. Every DataFrame
will have a schema (either inferred from data or explicitly defined) which allows us to view the frame like an SQL table. The secret sauce of SparkSQL and DataFrame is that the catalyst optimizer will work behind the scenes to optimize access by rearranging calls in the pipeline.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter3
- Set up the imports related to DataFrames and the required...