Understanding Apache Spark
Spark is an open source framework built to perform analytics on large datasets. Unlike other tools such as R, Python, and MathLab that are using in-memory processing, Spark gives you the possibility to scale out. And thanks to its expressiveness and interactivity, it also improves developer productivity.
There are entire books dedicated to Spark. It has a vast number of components and lots of areas to explore. In this book, we aim to get you started with the fundamentals. You should then be more comfortable exploring the documentation if you want to.
The purpose of Spark is to perform analytics on a collection. This collection could be in-memory and you could run your analytics using multiple threads, but if your collection is becoming too large, you are going to reach the memory limit of your system.
Spark solved this issue by creating an object to hold all of this data. Instead of keeping everything in the local computer's memory, Spark chunks the data into multiple...