Chapter 15. Understanding Data Processing using Apache Spark
In this chapter, we will present the main features of data processing architecture and the Cloudera platform distribution. Then, we will explore how to use a distributed filesystem and how to managing files from terminal and using a web interface. Finally, we will describe the use of Apache Spark, which is an open source, big data processing framework built with the goal of being fast and easy to use. Apache Spark provides us with a unified framework to manage big data processing requirements, such as data streaming, machine learning, and analytics.
In this chapter, we will cover these topics:
- Understanding data processing
- Platform for data processing
- An introduction to the distributed file system
- An introduction to Apache Spark
- Understanding data processing
Since the first edition of this book in 2013, there has been big changes in the data-driven scene. With the emerge of buzzwords such as big data, data science, and deep...