Packt+ | Advance your knowledge in tech

You're reading from Apache Flume: Distributed Log Collection for Hadoop

Product type Book

Published in Jul 2013

Publisher Packt

ISBN-13 9781782167914

Pages 108 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Table of Contents (15) Chapters

Apache Flume: Distributed Log Collection for Hadoop

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Overview and Architecture

2. Flume Quick Start

3. Channels

4. Sinks and Sink Processors

5. Sources and Channel Selectors

6. Interceptors, ETL, and Routing

7. Monitoring Flume

8. There Is No Spoon – The Realities of Real-time Distributed Data Collection

Index

Preface

Hadoop is a great open source tool for sifting tons of unstructured data into something manageable, so that your business can gain better insight into your customers, needs. It is cheap (can be mostly free), scales horizontally as long as you have space and power in your data center, and can handle problems your traditional data warehouse would be crushed under. That said, a little known secret is that your Hadoop cluster requires you to feed it with data; otherwise, you just have a very expensive heat generator. You will quickly find, once you get past the “playing around” phase with Hadoop, that you will need a tool to automatically feed data into your cluster. In the past, you had to come up with a solution for this problem, but no more! Flume started as a project out of Cloudera when their integration engineers had to keep writing tools over and over again for their customers to import data automatically. Today the project lives with the Apache Foundation, is under active development, and boasts users who have been using it in their production environments for years.

In this book I hope to get you up and running quickly with an architectural overview of Flume and a quick start guide. After that we’ll deep-dive into the details on many of the more useful Flume components, including the very important File Channel for persistence of in-flight data records and the HDFS Sink for buffering and writing data into HDFS, the Hadoop Distributed File System. Since Flume comes with a wide variety of modules, chances are that the only tool you’ll need to get started is a text editor for the configuration file.

By the end of the book, you should know enough to build out a highly available, fault tolerant, streaming data pipeline feeding your Hadoop cluster.

What this book covers

Chapter 1, Overview and Architecture, introduces the reader to Flume and the problem space that it is trying to address (specifically with regard to Hadoop). An architectural overview is given of the various components to be covered in the later chapters.

Chapter 2, Flume Quick Start, serves to get you up and running quickly, including downloading Flume, creating a “Hello World” configuration, and running it.

Chapter 3, Channels, covers the two major channels most people will use and the configuration options available for each.

Chapter 4, Sinks and Sink Processors, goes into great detail on using the HDFS Flume output, including compression options and options for formatting the data. Failover options are also covered to create a more robust data pipeline.

Chapter 5, Sources and Channel Selectors, will introduce several of the Flume input mechanisms and their configuration options. Switching between different channels based on data content is covered, allowing for the creation of complex data flows.

Chapter 6, Interceptors, ETL, and Routing, explains how to transform data in flight as well as extract information from the payload to use with channel selectors to make routing decisions. Tiering Flume agents is covered using Avro serialization, as well as using the Flume command line as a standalone Avro client for testing and importing data manually.

Chapter 7, Monitoring Flume, discusses various options available to monitor Flume both internally and externally including Monit, Nagios, Ganglia, and custom hooks.

Chapter 8, There Is No Spoon – The Realities of Real-time Distributed Data Collection, is a collection of miscellaneous things to consider that are outside the scope of just configuring and using Flume.

What you need for this book

You’ll need a computer with a Java Virtual Machine installed, since Flume is written in Java. If you don’t have Java on your computer, you can download it from http://java.com/.

You will also need an Internet connection so you can download Flume to run the Quick Start example.

This book covers Apache Flume 1.3.0, including a few items back-ported into Cloudera’s Flume CDH4 distribution.

Who this book is for

This book is for people responsible for implementing the automatic movement of data from various systems into a Hadoop cluster. If it is your job to load data into Hadoop on a regular basis, this book should help you code yourself out of manual monkey-work or from writing a custom tool you’ll be supporting for as long as you work at your company.

Only basic Hadoop knowledge of HDFS is required. Some custom implementations are covered should your needs necessitate it. For this level of implementation, you will need to know how to program in Java.

Finally, you’ll need your favorite text editor since most of this book covers how to configure various Flume components via the agent’s text configuration file.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: “We can include other contexts through the use of the include directive.”

A block of code is set as follows:

agent.sinks.k1.hdfs.path=/logs/apache/access
agent.sinks.k1.hdfs.filePrefix=access
agent.sinks.k1.hdfs.fileSuffix=.log

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

agent.sources.s1.command=uptime
agent.sources.s1.restart=true
agent.sources.s1.restartThrottle=60000

Any command-line input or output is written as follows:

$ tar -zxf apache-flume-1.3.1.tar.gz
$ cd apache-flume-1.3.1

New terms and important words are shown in bold.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title via the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.