Transforming data by using Apache Spark
Apache Spark supports transformations with three different Application Programming Interfaces (APIs): Resilient Distributed Datasets (RDDs), DataFrames, and Datasets. We will learn about RDDs and DataFrame transformations in this chapter. Datasets are just extensions of DataFrames, with additional features like being type-safe (where the compiler will strictly check for data types) and providing an object-oriented (OO) interface.
The information in this section applies to all flavors of Spark available on Azure: Synapse Spark, Azure Databricks Spark, and HDInsight Spark.
What are RDDs?
RDDs are an immutable fault-tolerant collection of data objects that can be operated on in parallel by Spark. These are the most fundamental data structures that Spark operates on. RDDs support a wide variety of data formats such as JSON, comma-separated values (CSV), Parquet, and so on.
Creating RDDs
There are many ways to create an RDD. Here is...