Dataset API
Dataset is a newer interface added to Spark 1.6. It is a distributed collection of data. The Dataset API is available in Java and Scala, but not in Python and R. The Dataset API uses Resilient Distributed Datasets (RDDs) and hence provides additional features of RDDs, such as fixed typing. It also uses Spark SQL’s optimized engine for faster queries.
Since a lot of the data engineering and data science community is already familiar with Python and uses it extensively for data architectures in production, PySpark also provides an equivalent API for DataFrames for this purpose. Let’s take a look at it in the next section.