Understanding the basics of Apache Spark
Apache Spark is an open source data processing engine designed for distributed large-scale processing of data. This means that if you have smaller datasets, say 10s or even a few 100s of GB, a tuned traditional database may provide faster processing times. The main differentiator for Apache Spark is its capability to perform in-memory intermediate computations, which makes Apache Spark much faster than Hadoop MapReduce.
Apache Spark is built for speed, flexibility, and ease of use. Apache Spark offers more than 70 high-level data processing operators that make it easy for data engineers to build data applications, so it is easy to write data processing logic using Apache Spark APIs. Being flexible means that Spark works as a unified data processing engine and works on several types of data workloads such as batch applications, streaming applications, interactive queries, and even ML algorithms.
Figure 5.26 shows the Apache Spark components...