Programming in PySpark
This section provides a quick introduction to programming with Python in Spark. We will start with the basic data structures in Spark.
Resilient Distributed Datasets (RDD) is the primary data structure in Spark. It is a distributed collection of objects and has the following three main features:
- Resilient: When any node fails, affected partitions will be reassigned to healthy nodes, which makes Spark fault-tolerant
- Distributed: Data resides on one or more nodes in a cluster, which can be operated on in parallel
- Dataset: This contains a collection of partitioned data with their values or metadata
RDD was the main data structure in Spark before version 2.0. After that, it was replaced by the DataFrame, which is also a distributed collection of data but organized into named columns. DataFrames utilize the optimized execution engine of Spark SQL. Therefore, they are conceptually similar to a table in a relational...