Before we get going on creating any kind of a pipeline, we should take a minute to familiarize ourselves with what Spark is and what it offers us.
Spark, built for both speed and ease of use, is a superfast open source engine that was designed with the large-scale processing of data in mind.
Through the advanced Directed Acyclic Graph (DAG) execution engine that supports cyclic data flow and in-memory computing, programs and scripts can run up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk.
Spark consists of the following components:
- Spark Core: This is the underlying engine of Spark, utilizing the fundamental programming abstraction called Resilient Distributed Datasets (RDDs). RDDs are small logical chunks of data Spark uses as "object collections".
- Spark SQL: This provides a new data abstraction called...