Understanding the Apache Spark ML machine learning library
Apache Spark is a distributed data processing framework for large-scale data processing. It allows Spark-based applications to load and process data across a cluster of distributed machines in memory to speed up the processing time.
A Spark cluster consists of a master node and worker nodes for running different Spark applications. Each application that runs in a Spark cluster has a driver program and its own set of processes, which are coordinated by the SparkSession object in the driver program. The SparkSession
object in the driver program connects to a cluster manager (for example, Mesos, YARN, Kubernetes, or Spark's standalone cluster manager), which is responsible for allocating resources in the cluster for the Spark application. Specifically, the cluster manager acquires resources on worker nodes called executors to run computations and store data for the Spark application. Executors are configured with resources...