Apache Spark core
The RDD is the core data structure of the Apache Spark architecture. RDDs store and preserve data distributed and partitioned over multiple processors and servers so operations can be executed concurrently.
Data frames have been added, later on, to extend RDDs with SQL functionality. The original Apache Spark machine learning library, MLlib, uses RDDs that operate at a lower level (API). The more recent ML library allows data scientists to describe transformation and actions using SQL.
Note
Deprecation RDD-based API for MLlib
The RDD-based classes and methods in MLlib have moved to maintenance mode in Spark 2.0 and will be completely removed in Spark 3.0
Why Spark?
The introduction of the Hadoop ecosystem more than 10 years ago, opened the door to large-scale data processing and analytics. The Hadoop framework relies on a very effective distributed filesystem, HDFS, suitable for processing a large number of files containing sequential data. However, this reliance on the distributed...