Understanding the CDH components
As mentioned earlier, there are several top-level Apache open source projects that are part of CDH. Let's discuss these components in detail.
Apache Hadoop
CDH comes with Apache Hadoop, a system that we have already been introduced to, for high-volume storage and computing. The subcomponents that are part of Hadoop are HDFS, Fuse-DFS, MapReduce, and MapReduce 2 (YARN). Fuse-DFS is a module that helps to mount HDFS to the user space. Once mounted, HDFS will be accessible like any other traditional filesystem.
Apache Flume NG
Apache Flume NG Version 1.x is a distributed framework that handles the collection and aggregation of large amounts of log data. This project was primarily built to handle streaming data. Flume is robust, reliable, and fault tolerant. Though Flume was built to handle the streaming of log data, its flexibility when handling multiple data sources makes it easy to configure it to handle event data. Flume can handle almost any kind of data...