A brief overview
How can we use Cassandra and Spark together for data analysis? How can we use Map/Reduce-like processing when using Spark? What are the general data transformations that can be performed on the data stored in Cassandra using Spark? This is a very brief overview of these capabilities. All Spark-related discussions are centered around the programming aspects. The clustering, deployments, methods of running jobs, and so on are beyond the scope of this chapter.
The most important data abstraction in Spark is Resilient Distributed Dataset (RDD). For all practical purposes, RDD can be considered as an in-memory table of data coming from its data source. The data source can be text files, files stored in HDFS, Cassandra column families, HBase column families, and so on.
Note
RDD is immutable and hence it is highly reusable and can be cached. Because of the immutability of the RDD, there is an absolute guarantee on the final results because no other process can change its contents...