Summary
In this chapter, we explored briefly how to understand a data processing architecture. First, we explored how to interact with a distributed file system. Then, we provided installation instructions for a Cloudera VM and how to get started in a data environment. Finally, we described the main features of Apache Spark and ran a practical example of how to implement a word count algorithm.
Apache Spark is highly recommended for all the data community, because it provides a robust and fast data processing tool. It also provides a library for Sql-like querying, graph processing, and machine learning algorithms.