Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Apache Flume: Distributed Log Collection for Hadoop

You're reading from  Apache Flume: Distributed Log Collection for Hadoop

Product type Book
Published in Jul 2013
Publisher Packt
ISBN-13 9781782167914
Pages 108 pages
Edition 1st Edition
Languages
Toc

Flume events


The basic payload of data transported by Flume is called an event. An event is composed of zero or more headers and a body.

The headers are key/value pairs that can be used to make routing decisions or carry other structured information (such as the timestamp of the event or hostname of the server where the event originated). You can think of it as serving the same function as HTTP headers—a way to pass additional information that is distinct from the body.

The body is an array of bytes that contains the actual payload. If your input is comprised of tailed logfiles, the array is most likely a UTF-8 encoded String containing a line of text.

Flume may add additional headers automatically (for example, when a source adds the hostname where the data is sourced or creating an event's timestamp), but the body is mostly untouched unless you edit it en-route using interceptors.

Interceptors, channel selectors, and sink processors

An interceptor is a point in your data flow where you can inspect and alter Flume events. You can chain zero or more interceptors after a source creates an event or before a sink sends the event wherever it is destined. If you are familiar with the AOP Spring Framework, it is similar to a MethodInterceptor. In Java Servlets it is similar to a ServletFilter. Here's an example of what using four chained interceptors on a source might look like:

Channel selectors are responsible for how data moves from a source to one or more channels. Flume comes packaged with two channel selectors, which cover most use cases you might have, although you can write your own if needed. A replicating channel selector (the default) simply puts a copy of the event into each channel assuming you have configured more than one. In contrast, a multiplexing channel selector can write to different channels depending on certain header information. Combined with Interceptor logic, this duo forms the foundation for routing input to different channels.

Finally, a sink processor is the mechanism by which you can create failover paths for your sinks or load balance events across multiple sinks from a channel.

Tiered data collection (multiple flows and/or agents)

You can chain your Flume agents depending on your particular use case. For example, you may want to insert an agent in a tiered fashion to limit the number of clients trying to connect directly to your Hadoop cluster. More likely your source machines don't have sufficient disk space to deal with a prolonged outage or maintenance window, so you create a tier with lots of disk space between your sources and your Hadoop cluster.

In the following diagram you can see there are two places data is created (on the left) and two final destinations for the data (the HDFS and ElasticSearch cloud bubbles on the right). To make things more interesting, let's say one of the machines generates two kinds of data (let's call them square and triangle data). You can see in the lower-left agent we use a multiplexing channel selector to split the two kinds of data into different channels. The rectangle channel is then routed to the agent in the upper-right corner (along with the data coming from the upper-left agent). The combined volume of events is written together in HDFS in datacenter 1. Meanwhile the triangle data is sent to the agent that writes to ElasticSearch in datacenter 2. Keep in mind that the data transformations can occur after any source or before any sink. How all of these components can be used to build complicated data workflows will become clear as the book proceeds.

You have been reading a chapter from
Apache Flume: Distributed Log Collection for Hadoop
Published in: Jul 2013 Publisher: Packt ISBN-13: 9781782167914
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}