Packt+ | Advance your knowledge in tech

You're reading from Apache Flume: Distributed Log Collection for Hadoop

Product type Book

Published in Jul 2013

Publisher Packt

ISBN-13 9781782167914

Pages 108 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Table of Contents (15) Chapters

Apache Flume: Distributed Log Collection for Hadoop

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Overview and Architecture

2. Flume Quick Start

3. Channels

4. Sinks and Sink Processors

5. Sources and Channel Selectors

6. Interceptors, ETL, and Routing

7. Monitoring Flume

8. There Is No Spoon – The Realities of Real-time Distributed Data Collection

Index

Flume events

The basic payload of data transported by Flume is called an event. An event is composed of zero or more headers and a body.

The headers are key/value pairs that can be used to make routing decisions or carry other structured information (such as the timestamp of the event or hostname of the server where the event originated). You can think of it as serving the same function as HTTP headers—a way to pass additional information that is distinct from the body.

The body is an array of bytes that contains the actual payload. If your input is comprised of tailed logfiles, the array is most likely a UTF-8 encoded String containing a line of text.

Flume may add additional headers automatically (for example, when a source adds the hostname where the data is sourced or creating an event's timestamp), but the body is mostly untouched unless you edit it en-route using interceptors.

Interceptors, channel selectors, and sink processors

An interceptor is a point in your data flow where you can inspect and alter Flume events. You can chain zero or more interceptors after a source creates an event or before a sink sends the event wherever it is destined. If you are familiar with the AOP Spring Framework, it is similar to a MethodInterceptor. In Java Servlets it is similar to a ServletFilter. Here's an example of what using four chained interceptors on a source might look like:

Channel selectors are responsible for how data moves from a source to one or more channels. Flume comes packaged with two channel selectors, which cover most use cases you might have, although you can write your own if needed. A replicating channel selector (the default) simply puts a copy of the event into each channel assuming you have configured more than one. In contrast, a multiplexing channel selector can write to different channels depending on certain header information. Combined with Interceptor logic, this duo forms the foundation for routing input to different channels.

Finally, a sink processor is the mechanism by which you can create failover paths for your sinks or load balance events across multiple sinks from a channel.

Tiered data collection (multiple flows and/or agents)

You can chain your Flume agents depending on your particular use case. For example, you may want to insert an agent in a tiered fashion to limit the number of clients trying to connect directly to your Hadoop cluster. More likely your source machines don't have sufficient disk space to deal with a prolonged outage or maintenance window, so you create a tier with lots of disk space between your sources and your Hadoop cluster.

In the following diagram you can see there are two places data is created (on the left) and two final destinations for the data (the HDFS and ElasticSearch cloud bubbles on the right). To make things more interesting, let's say one of the machines generates two kinds of data (let's call them square and triangle data). You can see in the lower-left agent we use a multiplexing channel selector to split the two kinds of data into different channels. The rectangle channel is then routed to the agent in the upper-right corner (along with the data coming from the upper-left agent). The combined volume of events is written together in HDFS in datacenter 1. Meanwhile the triangle data is sent to the agent that writes to ElasticSearch in datacenter 2. Keep in mind that the data transformations can occur after any source or before any sink. How all of these components can be used to build complicated data workflows will become clear as the book proceeds.

You're reading from Apache Flume: Distributed Log Collection for Hadoop

Table of Contents (15) Chapters close

Flume events

Interceptors, channel selectors, and sink processors

Tiered data collection (multiple flows and/or agents)

Personalised recommendations for you

Table of Contents (15) Chapters