Spark for data analytics
Soon after the Spark project was successful in the AMP labs, it was made open source in 2010 and transferred to the Apache Software Foundation in 2013. It is currently being led by Databricks.
Spark offers many distinct advantages over other distributed computing platforms, such as:
- A faster execution platform for both iterative machine learning and interactive data analysis
- Single stack for batch processing, SQL queries, real-time stream processing, graph processing, and complex data analytics
- Provides high-level API to develop a diverse range of distributed applications by hiding the complexities of distributed programming
- Seamless support for various data sources such as RDBMS, HBase, Cassandra, Parquet, MongoDB, HDFS, Amazon S3, and so on
The following is a pictorial representation of in-memory data sharing for iterative algorithms:
Spark hides the complexities in writing the core MapReduce jobs and provides most of the functionalities through simple function calls. Because of its simplicity, it is able to cater to wider and bigger audience groups such as data scientists, data engineers, statisticians, and R/Python/Scala/Java developers.
The Spark architecture broadly consists of a data storage layer, management framework, and API. It is designed to work on top of an HDFS filesystem, and thereby leverages the existing ecosystem. Deployment could be as a standalone server or on distributed computing frameworks such as Apache Mesos or YARN. An API is provided for Scala, the language in which Spark is written, along with Java, R and Python.