In the last chapter, we learned about how to create and configure a Hadoop cluster, HDFS architecture, various file formats, and the best practices for a Hadoop cluster. We also learned about Hadoop high availability techniques.
Since we now know how to create and configure a Hadoop cluster, in this chapter, we will learn about various techniques of data ingestion into a Hadoop cluster. We know about the advantages of Hadoop, but now, we need data in our Hadoop cluster to utilize its real power.
Data ingestion is considered the very first step in the Hadoop data life cycle. Data can be ingested into Hadoop as either a batch or a (real-time) stream of records. Hadoop is a complete ecosystem, and MapReduce is a batch ecosystem of Hadoop.
The following diagram shows various data ingestion tools:
We will learn about each tool in detail in the next few sections...