Architecture
Remember that, although Spark is used for the speed of its in-memory distributed processing, it doesn't provide storage. You can use the Host (local) filesystem to read and write your data, but if your data volumes are big enough to be described as big data, then it makes sense to use a cloud-based distributed storage system such as OpenStack Swift Object Storage, which can be found in many cloud environments and can also be installed in private data centers.
Note
In case very high I/O is needed, HDFS would also be an option. More information on HDFS can be found here: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
The development environment
The Scala language will be used for the coding samples in this book. This is because, as a scripting language, it produces less code than Java. It can also be used from the Spark shell as well as compiled with Apache Spark applications. We will be using the sbt tool to compile the Scala code, which we...