Hadoop is an open-source framework for working with large quantities of data spread across a single computer to thousands of computers. Hadoop is composed of four modules:
- Hadoop Core
- Hadoop Distributed File System (HDFS)
- Yet Another Resource Negotiator (YARN)
- MapReduce
The Hadoop Core makes up the components needed to run the other three modules. HDFS is a Java-based file system that has been designed to be distributed and is capable of storing large files across many machines. By large files, we are talking terabytes. YARN manages the resources and scheduling in your Hadoop framework. The MapReduce engine allows you to process data in parallel.
There are several other projects that can be installed to work with the Hadoop framework. In this chapter, you will use Hive and Ambari. Hive allows you to read and write data using SQL. You will use Hive to run the...