Apache Spark is the largest open source process in data processing. Since its release, Apache Spark has seen rapid adoption by enterprises across a wide range of industries. Apache Spark is a fast, in-memory data processing engine, with elegant and expressive development APIs to allow data workers to efficiently execute streaming. In addition, Apache Spark facilitates ML and SQL workloads that require fast iterative access to datasets.
The focus of the current chapter is Apache Spark, which is an open source system for fast, large-scale data processing and ML.
The Data Science virtual machine provides you with a standalone (single node in-process) instance of the Apache Spark platform.