Pros and cons
An increasing number of organizations are adopting Spark as their distributed data processing platform for real-time, or pseudo real-time operations.
There are several reasons for the fast adoption of Spark:
- Supported by a large and dedicated community of developers
- In-memory persistency is ideal for iterative computation found in machine learning and statistical inference algorithms
- Excellent performance and scalability that can be extended with the streaming library for pseudo-real time computation or infinite loop
- Apache Spark leverages Scala functional capabilities and a large number of open source Java libraries
- Spark can leverage the Mesos or Yarn cluster managers, which reduces the complexity of defining fault-tolerance and load balancing between worker nodes
- Spark is to be integrated with commercial Hadoop vendors such as Cloudera
However, no platform is perfect and Spark is no exception. The most common complaints or concerns regarding Spark are:
- Creating a Spark application...