Chapter 3. Configuring the Hadoop Ecosystem
Hadoop is a powerful distributed data processing system. The cluster which we configured in the previous chapter is a ready-to-use system, but if you start using Hadoop in this configuration for any real-life applications, you will very soon discover that MapReduce provides a very low-level way to access and process the data. You will need to figure out lots of things on your own. You will need to decide how to export data from external sources and upload it into Hadoop in the most efficient way. You will need to figure out what format to store data in and write the Java code to implement data processing in the MapReduce paradigm. The Hadoop ecosystem includes a number of side projects that have been created to address different aspects of loading, processing, and extracting data. In this chapter, we will go over setting up and configuring several popular and important Hadoop ecosystem projects:
Sqoop for extracting data from external data sources...