Apache Spark is distributed in-memory data processing system. It provides rich set of API in Java, Scala, and Python. Spark API can be used to develop applications which can do batch and real-time data processing and analytics, machine learning, and graph processing of huge volumes of data on a single clustering platform.
Spark development was started in 2009 by a team at Berkeley's AMPLab for improving the performance of MapReduce framework.
MapReduce is another distributed batch processing framework developed by Yahoo in context to Google research paper.
What they found was that an application which involves an iterative approach to solving certain problems can be improvised by reducing disk I/O. Spark allows us to cache a large set of data in memory and applications which uses iterative approach of transformation can use benefit of caching to...