Configuring Flume
In this recipe, we will cover how to configure Flume for data ingestion. Flume is a general tool that consumes a log stream or Twitter feeds.
In any organization, we might have hundreds of web servers serving web pages, and we may need to quickly parse these logs for ads targeting or triggering events. These Apache web server logs can be streamed to Flume, from where they can be constantly uploaded to HDFS for processing.
In simple terms, Flume is a distributed, reliable, and efficient way of collecting and aggregating data into HDFS. It has the concepts of Flume agents, channels, and sinks, which together make a robust system. There can be multiple sources, channels, and output paths like a file system on a non-HDFS or HDFS filesystem, or being used by other consumers downstream.
Getting ready
For this recipe, make sure that you have completed the Hadoop cluster setup recipe and have at least a healthy HDFS. Flume can be installed on any node in the cluster, but it is good...