Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
In this recipe, we explore the differences in creating RDD, DataFrame, and Dataset a text file and their relationship to each other via a short sample code:
Dataset: spark.read.textFile() RDD: spark.sparkContext.textFile() DataFrame: spark.read.text()
Note
Assume spark
is the session name
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
- Import the necessary packages for the Spark session to gain access to the cluster and
log4j.Logger
to reduce the amount of output produced by Spark:
import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.SparkSession
- We also define a
case class
to host the data used:
case class Beatle(id: Long, name: String)
- Set the output level to
ERROR
to reduce Spark's logging output:
Logger.getLogger("org").setLevel...