Summary
This chapter discussed the problem of how to retrieve data from across the network and make it available for processing in Hadoop. As we saw, this is actually a more general challenge and though we may use Hadoop-specific tools, such as Flume, the principles are not unique. In particular, we covered an overview of the types of data we may want to write to Hadoop, generally categorizing it as network or file data. We explored some approaches for such retrieval using existing command-line tools. Though functional, the approaches lacked sophistication and did not suit extension into more complex scenarios.
We looked at Flume as a flexible framework for defining and managing data (particularly from log files) routing and delivery, and learned the Flume architecture which sees data arrive at sources, be processed through channels, and then written to sinks.
We then explored many of Flume's capabilities such as how to use the different types of sources, sinks, and channels. We saw how the...