Spark is the distributed in-memory processing engine that runs machine learning algorithms in distributed mode by using abstract APIs. Using a Spark machine learning framework, machine learning algorithms can be applied on large volumes of data, represented as resilient distributed datasets. Spark machine learning libraries come with a rich set of utilities, components, and tools that let you write in-memory, processed, distributed code in an efficient and fault-tolerant manner. The following diagram represents the Spark architecture at a high level:
There are three Java virtual machine (JVM) based components in Spark: they are Driver, Spark executor, and Cluster Manager. These explained as follows:
- Driver: The Driver Program runs on a logically or physically segregated node as a separate process and is responsible for launching the...