Analytics with the Dataset API
Datasets are similar to RDDs; however, instead of using Java or Kryo Serialization, they use a specialized Encoder to serialize the objects for processing or transmitting over the network. While both encoders and standard serialization are responsible for turning an object into bytes, encoders are generated dynamically and use a format that allows Spark to perform many operations such as filtering, sorting, and hashing without deserializing the bytes back into an object. Source: https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets.
Creating Datasets
The following Scala example creates a Dataset and DataFrame from an RDD. Enter the scala shell with the spark-shell command:
scala> case class Dept(dept_id: Int, dept_name: String) defined class Dept scala> val deptRDD = sc.makeRDD(Seq(Dept(1,"Sales"),Dept(2,"HR"))) deptRDD: org.apache.spark.rdd.RDD[Dept] = ParallelCollectionRDD[0] at makeRDD at <console>:26 scala> val...