When to use RDDs, Datasets, and DataFrames?
The following table describes the scenarios in which RDDs, Datasets, or DataFrames are to be used:
Scenario |
What to use? |
---|---|
Use of the Python programming language |
RDDs or DataFrames |
Use of the R programming language |
DataFrames |
Use of the Java or Scala programming languages |
RDDs, Datasets, or DataFrames |
Unstructured data such as images and videos |
RDDs |
Use of low level transformations, actions, and controls data flow programmatically |
RDDs |
Use of high-level domain-specific APIs |
Datasets and DataFrames |
Use of functional programming constructs to process data |
RDDs |
Use of higher level expressions including SQLs |
Datasets and DataFrames |
Imposing structure is not needed and low-level optimizations are not needed |
RDDs |
High compile time safety and rich optimizations |
Datasets |
No compile time safety and rich optimizations are needed |
DataFrames |
Unification is needed across Spark libraries |
Datasets or DataFrames |