Packt+ | Advance your knowledge in tech

You're reading from Hadoop Beginner's Guide Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services ‚Äì just a willingness to learn the basics from this practical step-by-step tutorial.

Product type Paperback

Published in Feb 2013

Publisher Packt

ISBN-13 9781849517300

Length 398 pages

Edition 1st Edition

Tools

Hadoop

Concepts

Data Processing

Table of Contents (19) Chapters

Hadoop Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. What It's All About FREE CHAPTER

2. Getting Hadoop Up and Running

3. Understanding MapReduce

4. Developing MapReduce Programs

5. Advanced MapReduce Techniques

6. When Things Break

7. Keeping Things Running

8. A Relational View on Data with Hive

9. Working with Relational Databases

10. Data Collection with Flume

11. Where to Go Next

Pop Quiz Answers

Index

A

AccountRecordMapper class / What just happened?
add jar command / What just happened?
advanced techniques, MapReduce
- about / Simple, advanced, and in-between
- joins / Joins
- graph algorithms / Graph algorithms
- language-independent data structures, using / Using language-independent data structures
agent
- about / What just happened?
- writing, to multiple sinks / Time for action – writing to multiple sinks, What just happened?
alternative distributions
- about / Alternative distributions
- reasons / Why alternative distributions?
- bundling / Bundling
- free and commercial extensions / Free and commercial extensions
- Cloudera Distribution / Cloudera Distribution for Hadoop
- Hortonworks Data Platform / Hortonworks Data Platform
- MapR / MapR
- IBM InfoSphere Big Insights / IBM InfoSphere Big Insights
- selecting / Choosing a distribution
Apache projects
- HBase / HBase
- Oozie / Oozie
- Whir / Whir
- Mahout / Mahout
- MRUnit / MRUnit
Apache Software Foundation / A better way – introducing Sqoop
ApplicationManager
- about / Upcoming Hadoop changes
array wrapper classes
- about / Array wrapper classes
- ArrayWritable / Array wrapper classes
- TwoDArrayWritable / Array wrapper classes
aternative schedulers, MapReduce management
- Capacity Scheduler / Capacity Scheduler
- Fair Scheduler / Fair Scheduler
- enabling / Enabling alternative schedulers
- using / When to use alternative schedulers
Avro
- about / Candidate technologies, Sources, Sinks
- URL / Introducing Avro
- downloading / Time for action – getting and installing Avro
- installing / Time for action – getting and installing Avro
- setting up / What just happened?
- advantages / Avro and schemas
- schemas / Avro and schemas
- schema, defining / Time for action – defining the schema
- source Avro data, creating with Ruby / Time for action – creating the source Avro data with Ruby, What just happened?
- data, consuming with Java / Time for action – consuming the Avro data with Java
- using, within MapReduce / Using Avro within MapReduce
- graphs / Have a go hero – graphs in Avro
- features / Going forward with Avro
Avro, within MapReduce
- shape summaries, generating / Time for action – generating shape summaries in MapReduce, What just happened?
- output data, examining with Ruby / Time for action – examining the output data with Ruby, What just happened?
- output data, examining with Java / Time for action – examining the output data with Java, What just happened?
Avro-mapred JAR files / What just happened?
Avro client
- about / What just happened?
Avro code
- about / What just happened?
Avro data
- creating, with Ruby / Time for action – creating the source Avro data with Ruby, What just happened?
- consuming, with Java / Time for action – consuming the Avro data with Java
AvroJob
- about / Using Avro within MapReduce
AvroKey
- about / Using Avro within MapReduce
AvroMapper
- about / Using Avro within MapReduce
AvroReducer
- about / Using Avro within MapReduce
AvroValue
- about / Using Avro within MapReduce
AWS
- about / AWS – infrastructure on demand from Amazon, A note about AWS
- Elastic Compute Cloud (EC2) / Elastic Compute Cloud (EC2)
- Simple Storage Service (S3) / Simple Storage Service (S3)
- Elastic MapReduce (EMR) / Elastic MapReduce (EMR)
- considerations / AWS considerations
AWS account
- creating / Creating an AWS account
- needed services, signing up / Signing up for the necessary services
- management console / Time for action – WordCount on EMR using the management console
AWS credentials
- about / AWS credentials
- account ID / AWS credentials
- access key / AWS credentials
- secret access key / AWS credentials
- key pairs / AWS credentials
AWS developer forums
- URL / Mailing lists and forums
AWS ecosystem
- about / The AWS ecosystem
- URL / The AWS ecosystem
AWS management console
- used, for WordCount on EMR / Time for action – WordCount on EMR using the management console
- URL / Time for action – running UFO analysis on EMR
AWS resources
- about / AWS resources
- HBase on EMR / HBase on EMR
- SimpleDB / SimpleDB
- DynamoDB / DynamoDB

B

BackupNameNode
- about / Upcoming Hadoop changes
base HDFS directory
- changing / Time for action – changing the base HDFS directory
big data processing
- about / Big data processing
- aspects / The value of data
- historical trends / Historically for the few and not the many
- different approach / A different approach, Share nothing, Expect failure, Move processing, not data, Build applications, not infrastructure
Bloom filter
- about / Using a data representation instead of raw data
breadth-first search (BFS) / Graphs and MapReduce – a match made somewhere

C

C++ interface
- using / Using languages other than Java with Hadoop
candidate technologies
- about / Candidate technologies
- Protocol Buffers / Candidate technologies
- Thrift / Candidate technologies
- Avro / Candidate technologies
capacity
- adding, to local Hadoop cluster / Adding capacity to a local Hadoop cluster
- adding, to EMR job flow / Adding capacity to an EMR job flow
Capacity Scheduler
- about / Capacity Scheduler
capacityScheduler directory / Enabling alternative schedulers
Cascading
- about / Cascading
- URL / Cascading
CDH
- about / Cloudera Distribution for Hadoop
ChainMapper class
- using / Time for action – using ChainMapper for field validation/analysis
channels
- about / Channels
CheckpointNameNode
- about / Upcoming Hadoop changes
city() function / What just happened?
classic data processing systems
- about / Classic data processing systems
- scale-up / Scale-up
- scale-out approach / Early approaches to scale-out
Cloud computing, with AWS
- about / Cloud computing with Amazon Web Services
- third approach / A third way
- types of cost / Different types of costs
Cloudera
- about / A better way – introducing Sqoop
- URL / A better way – introducing Sqoop
Cloudera Distribution
- about / Cloudera Distribution for Hadoop
- URL / Cloudera Distribution for Hadoop
cluster access control
- about / Cluster access control
- Hadoop security model / The Hadoop security model
cluster masters, killing
- JobTracker, killing / Time for action – killing the JobTracker, What just happened?
- replacement JobTracker, starting / Starting a replacement JobTracker
- JobTracker, moving / Have a go hero – moving the JobTracker to a new host
- NameNode process, killing / Time for action – killing the NameNode process
- replacement NameNode, starting / Starting a replacement NameNode
- NameNode process / The role of the NameNode in more detail
- files / File systems, files, blocks, and nodes
- filesystem / File systems, files, blocks, and nodes
- blocks / File systems, files, blocks, and nodes
- nodes / File systems, files, blocks, and nodes
- fsimage / The single most important piece of data in the cluster – fsimage
- DataNode start-up / DataNode startup
- safe mode / Safe mode
- SecondaryNameNode / SecondaryNameNode
- NameNode failure / So what to do when the NameNode process has a critical failure?
- BackupNode / BackupNode/CheckpointNode and NameNode HA
- CheckpointNode / BackupNode/CheckpointNode and NameNode HA
- NameNode HA / BackupNode/CheckpointNode and NameNode HA
column-oriented databases
- about / Pruning data to fit in the cache
combiner class
- about / Apart from the combiner…maybe
- features / Why have a combiner?
- adding, to WordCount / Time for action – WordCount with a combiner
command line job management
- about / Command line job management
command output
- capturing, to flat file / Time for action – capturing the output of a command to a flat file, What just happened?
commodity hardware
- about / What is commodity hardware anyway?
commodity versus enterprise class storage
- about / Commodity versus enterprise class storage
common architecture, Hadoop
- about / Common architecture
- advantages / What it is and isn't good for
- disadvantages / What it is and isn't good for
CompressedWritable wrapper class
- about / Other wrapper classes
conferences
- about / Conferences
- URL / Conferences
configuration, Flume / Time for action – installing and configuring Flume, What just happened?
configuration, MySQL
- for remote connections / Time for action – configuring MySQL to allow remote connections, What just happened?
configuration, Sqoop / Time for action – downloading and configuring Sqoop, What just happened?
configuration files, Flume / Understanding the Flume configuration files
considerations, AWS / AWS considerations
correlated failures
- about / The risk of correlated failures
counters
- adding / Counters, status, and other output
CPU / memory / storage ratio, Hadoop cluster
- about / Processor / memory / storage ratio
CREATE DATABASE statement / What just happened?
CREATE FUNCTION command / What just happened?
CREATE TABLE command
- about / What just happened?
cron
- about / Scheduling
curl utility / Getting network traffic into Hadoop, What just happened?

D

data
- getting, into Hadoop / Getting data into Hadoop
- exporting, from MySQL to HDFS / Time for action – exporting data from MySQL to HDFS, What just happened?
- importing, into Hive / Importing data into Hive using Sqoop
- exporting, from MySQL into Hive / Time for action – exporting data from MySQL into Hive, What just happened?
- importing, from raw query / Time for action – importing data from a raw query, What just happened?
- getting, out of Hadoop / Getting data out of Hadoop
- writing, from within reducer / Writing data from within the reducer
- importing, from Hadoop into MySQL / Time for action – importing data from Hadoop into MySQL, What just happened?
- about / Data data everywhere...
- types / Types of data
- copying, from web server into HDFS / Time for action – getting web server data into Hadoop, What just happened?
- hidden issues / Hidden issues, A common framework approach
- lifecycle / Data lifecycle
- staging / Staging data
- scheduling / Scheduling
data, types
- network traffic / Types of data
- file data / Types of data
database
- accessing, from mapper / Accessing the database from the mapper
data import
- improving, type mapping used / Time for action – using a type mapping, What just happened?
data input/output formats
- about / Input/output
- files / Files, splits, and records
- splits / Files, splits, and records
- records / Files, splits, and records
- InputFormat / InputFormat and RecordReader
- RecordReaders / InputFormat and RecordReader
- Hadoop-provided input formats / Hadoop-provided InputFormat
- Hadoop-provided record readers / Hadoop-provided RecordReader
- OutputFormats / OutputFormat and RecordWriter
- RecordWriters / OutputFormat and RecordWriter
- Hadoop-provided OutputFormats / Hadoop-provided OutputFormat
- Sequence files / Don't forget Sequence files
DataJoinMapperBase class
- about / DataJoinMapper and TaggedMapperOutput
data lifecycle management
- about / The bigger picture
DataNode
- about / Location of the master nodes
data paths
- about / Common data paths
dataset analysis
- UFO sighting dataset / Getting the UFO sighting dataset
- Java shape and location analysis / Java shape and location analysis
datatype issues
- about / Datatype issues
datatypes, HiveQL
- Boolean types / What just happened?
- Integer types / What just happened?
- Floating point types / What just happened?
- Textual types / What just happened?
datum
- about / What just happened?
default properties
- about / Default values
- browsing / Time for action – browsing default properties
default security, Hadoop security model
- demonstrating / Time for action – demonstrating the default security
default storage location, Hadoop configuration properties
- about / Default storage location
depth-first search (DFS) / Graphs and MapReduce – a match made somewhere
DESCRIBE TABLE command / What just happened?
description property element
- about / Additional property elements
dfs.data.dir property / Where to write data
dfs.default.name variable
- about / What just happened?
dfs.name.dir property / Where to write data
dfs.replication variable
- about / What just happened?
different approach, big data processing
- about / A different approach
dirty data, Hive tables
- handling / Handling dirty data in Hive
- query output, exporting / Time for action – exporting query output, What just happened?
Distributed Cache
- used, for improving Java location data output / Time for action – using the Distributed Cache to improve location output, What just happened?
driver class, 0.20 MapReduce Java API
- about / The Driver class
dual approach
- about / A dual approach
DynamoDB
- about / Integration with other AWS products, DynamoDB
- URL / Integration with other AWS products, DynamoDB

E

EC2 / Considering RDS
edges
- about / Graph 101
Elastic Compute Cloud (EC2)
- about / Elastic Compute Cloud (EC2), Signing up for the necessary services
- URL / Elastic Compute Cloud (EC2)
Elastic MapReduce
- about / Using Elastic MapReduce
- using / Using Elastic MapReduce
Elastic MapReduce (EMR)
- URL / Elastic MapReduce (EMR)
- about / Elastic MapReduce (EMR), Signing up for the necessary services
employee database
- setting up / Time for action – setting up the employee database, What just happened?
employee table
- exporting, into HDFS / Have a go hero – exporting the employee table into HDFS
EMR
- about / A note on EMR
- benefits / A note on EMR
- as, prototyping platform / EMR as a prototyping platform
/ Considering RDS
EMR command-line tools
- about / The EMR command-line tools
EMR Hadoop
- versus, local Hadoop / Comparison of local versus EMR Hadoop
EMR job flow
- capacity, adding / Adding capacity to an EMR job flow
- expanding / Expanding a running job flow
Enterprise Application Integration (EAI)
- about / A common framework approach
ETL tools
- about / Oozie
- Pentaho Kettle / Oozie
- Spring Batch / Oozie
evaluate methods / What just happened?
events
- about / It's all about events
exec
- about / Sources
export command / What just happened?

F

failover sink processor
- about / Handling sink failure
failure types, Hadoop
- about / Types of failure
- Hadoop node failures / Hadoop node failure
- cluster masters, killing / Killing the cluster masters
Fair Scheduler
- about / Fair Scheduler
fairScheduler directory / Enabling alternative schedulers
features, Sqoop
- incremental merge / Incremental merge
- partial exports, avoiding / Avoiding partial exports
- code generator / Sqoop as a code generator
file channel
- about / Channels
file data
- about / Types of data
FileInputFormat
- about / Hadoop-provided InputFormat
FileOutputFormat
- about / Hadoop-provided OutputFormat
files
- getting, into Hadoop / Getting files into Hadoop
- versus logs / Logs versus files
file_roll sink
- about / What just happened?
final property element
- about / Additional property elements
First In, First Out (FIFO) queue
- about / Job priorities and scheduling
flat file
- command output, capturing to / Time for action – capturing the output of a command to a flat file, What just happened?
Flume
- about / A common framework approach, Introducing Apache Flume, To Sqoop or to Flume..., Cloudera Distribution for Hadoop
- URL / Introducing Apache Flume
- versioning / A note on versioning
- configuring / Time for action – installing and configuring Flume, What just happened?
- installing / Time for action – installing and configuring Flume, What just happened?
- used, for capturing network data / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
- logging, into console / Time for action – logging to the console, What just happened?
- network data, writing to log files / Writing network data to log files, What just happened?
- source / Sources
- sinks / Sinks
- channels / Channels
- configuration files / Understanding the Flume configuration files
- timestamps, adding / Time for action – adding timestamps, What just happened?
- sink failure, handling / Handling sink failure
- features / Next, the world
flume.root.logger variable
- about / What just happened?
Flume NG
- about / A note on versioning
Flume OG
- about / A note on versioning
FLUSH PRIVILEGES command / What just happened?
fsimage class / Configuring multiple locations for the fsimage class
fsimage location
- adding, to NameNode / Time for action – adding an additional fsimage location
fully distributed mode
- about / Three modes

G

GenericRecord class
- about / What just happened?
Google File System (GFS)
- URL / Thanks, Google
GRANT statement / What just happened?
granular access control, Hadoop security model
- about / More granular access control
graph algorithms
- about / Graph algorithms
- Graph 101 / Graph 101
- Graphs and MapReduce / Graphs and MapReduce – a match made somewhere
- nodes / Graphs and MapReduce – a match made somewhere
- graph, representing / Representing a graph, What just happened?
- pointer-based representations / Representing a graph
- adjacency matrix representations / Representing a graph
- adjacency list representations / Representing a graph
- common coloring technique / Representing a graph
- white nodes / Representing a graph
- graph nodes / Representing a graph
- black nodes / Representing a graph
- overview / Overview of the algorithm
- states, for node / Overview of the algorithm
- mapper / The mapper
- reducer / The reducer
- iterative application / Iterative application
- source code, creating / Time for action – creating the source code
- first run / Time for action – the first run, What just happened?
- second run / Time for action – the second run, What just happened?
- third run / Time for action – the third run, What just happened?
- fourth run / Time for action – the fourth and last run, What just happened?
- multiple jobs, running / Running multiple jobs
- final thoughts / Final thoughts on graphs
graphs, Avro
- about / Have a go hero – graphs in Avro

H

Hadoop
- about / Hadoop
- components / Parts of Hadoop
- common building blocks / Common building blocks
- architectural principles / Common building blocks
- HDFS / HDFS
- MapReduce / MapReduce
- HDFS and MapReduce / Better together
- common architecture / Common architecture
- on local Ubuntu host / Hadoop on a local Ubuntu host
- on Windows / Other operating systems
- on Mac OS X / Other operating systems
- prerequisites / Time for action – checking the prerequisites
- setting up / Setting up Hadoop
- versions / A note on versions, Sqoop and Hadoop versions
- downloading / Time for action – downloading Hadoop
- SSH, setting up / Time for action – setting up SSH
- configuring / Configuring and running Hadoop
- used, for calculating Pi / Time for action – using Hadoop to calculate Pi
- running / Time for action – using Hadoop to calculate Pi
- modes / Three modes
- base folder, configuring / Configuring the base directory and formatting the filesystem
- filesystem, formatting / Configuring the base directory and formatting the filesystem
- base HDFS directory, changing / Time for action – changing the base HDFS directory
- NameNode, formatting / Time for action – formatting the NameNode
- starting / Time for action – starting Hadoop
- HDFS, using / Time for action – using HDFS
- WordCount, running / Time for action – WordCount, the Hello World of MapReduce
- WordCount, executing on larger body of text / Have a go hero – WordCount on a larger body of text
- monitoring / Monitoring Hadoop from the browser
- HDFS web UI / The HDFS web UI
- MapReduce web UI / The MapReduce web UI
- failure / Failure
- embrace failure / Embrace failure
- failure, types / Types of failure
- scaling / Scaling
- data paths / Common data paths
- as archive store / Hadoop as an archive store
- as preprocessing step / Hadoop as a preprocessing step
- as data input tool / Hadoop as a data input tool
- data, getting into / Getting data into Hadoop
- network traffic, getting into / Getting network traffic into Hadoop, What just happened?
- web server data, getting into / Time for action – getting web server data into Hadoop, What just happened?
- files, getting into / Getting files into Hadoop
- alternative distributions / Alternative distributions
- programming abstractions / Other programming abstractions
Hadoop, into MySQL
- data, importing from / Time for action – importing data from Hadoop into MySQL, What just happened?
Hadoop-provided input formats
- about / Hadoop-provided InputFormat
- FileInputFormat / Hadoop-provided InputFormat
- SequenceFileInputFormat / Hadoop-provided InputFormat
- TextInputFormat / Hadoop-provided InputFormat
Hadoop-provided OutputFormats
- about / Hadoop-provided OutputFormat
- FileOutputFormat / Hadoop-provided OutputFormat
- NullOutputFormat / Hadoop-provided OutputFormat
- SequenceFileOutputFormat / Hadoop-provided OutputFormat
- TextOutputFormat / Hadoop-provided OutputFormat
Hadoop-provided record readers
- about / Hadoop-provided RecordReader
- LineRecordReader / Hadoop-provided RecordReader
- SequenceFileRecordReader / Hadoop-provided RecordReader
Hadoop-specific data types
- about / Hadoop-specific data types
- Writable interface / The Writable and WritableComparable interfaces
- wrapper classes / Introducing the wrapper classes
hadoop/lib directory / Enabling alternative schedulers
Hadoop changes
- about / Upcoming Hadoop changes
- MapReduce 2.0 or MRV2 / Upcoming Hadoop changes
- YARN (Yet Another Resource Negotiator) / Upcoming Hadoop changes
Hadoop cluster
- setting up / Setting up a cluster
- hosts / How many hosts?
- usable space on node, calculating / Calculating usable space on a node
- master nodes, location / Location of the master nodes
- hardware, sizing / Sizing hardware
- processor / memory / storage ratio / Processor / memory / storage ratio
- EMR, as prototyping platform / EMR as a prototyping platform
- special node requirements / Special node requirements
- storage types / Storage types
- networking configuration / Hadoop networking configuration
- commodity hardware / What is commodity hardware anyway?
- node and running balancer, adding / Have a go hero – adding a node and running balancer
Hadoop community
- about / Sources of information
- source code / Source code
- mailing lists and forums / Mailing lists and forums
- LinkedIn groups / LinkedIn groups
- HUGs / HUGs
- conferences / Conferences
Hadoop configuration properties
- about / Hadoop configuration properties
- default properties / Default values
- property elements / Additional property elements
- default storage location / Default storage location
- setting / Where to set properties
Hadoop dependencies / Hadoop dependencies
Hadoop failure
- hardware failures / Hardware failure
- host failures / Host failure
- host corruption / Host corruption
- correlated failures / The risk of correlated failures
Hadoop FAQ
- URL / Other operating systems
hadoop fs command / What just happened?
Hadoop Java API, for MapReduce
- about / The Hadoop Java API for MapReduce
- 0.20 MapReduce Java API / The 0.20 MapReduce Java API
hadoop job -history command / What just happened?
hadoop job -kill command / What just happened?
hadoop job -list all command / What just happened?
hadoop job -set-priority command / Job priorities and scheduling, What just happened?
hadoop job -status command / What just happened?
Hadoop networking configuration
- about / Hadoop networking configuration
- blocks, placing / How blocks are placed
- rack-awareness script / Rack awareness
- default rack configuration, examining / Time for action – examining the default rack configuration
- rack awareness script, adding / Time for action – adding a rack awareness script, What just happened?
Hadoop node failures
- dfsadmin command / The dfsadmin command
- cluster setup / Cluster setup, test files, and block sizes
- test files / Cluster setup, test files, and block sizes
- block sizes / Cluster setup, test files, and block sizes
- fault tolerance / Fault tolerance and Elastic MapReduce
- Elastic MapReduce / Fault tolerance and Elastic MapReduce
- DataNode process, killing / Time for action – killing a DataNode process, What just happened?
- NameNode and DataNode communication / NameNode and DataNode communication
- NameNode log delving / Have a go hero – NameNode log delving
- replication factor / Time for action – the replication factor in action, What just happened?
- missing blocks, causing intentionally / Time for action – intentionally causing missing blocks, What just happened?
- data loss / When data may be lost
- block corruption / Block corruption
- TaskTracker process, killing / Time for action – killing a TaskTracker process, What just happened?
- DataNode and TaskTracker failures, comparing / Comparing the DataNode and TaskTracker failures
- permanent failure / Permanent failure
Hadoop Pipes
- about / Using languages other than Java with Hadoop
Hadoop security model
- about / The Hadoop security model
- default security, demonstrating / Time for action – demonstrating the default security
- user identity / User identity
- granular access control / More granular access control
- working around, via physical access control / Working around the security model via physical access control
Hadoop Streaming
- about / Using languages other than Java with Hadoop
- working / How Hadoop Streaming works
- advantages / Why to use Hadoop Streaming, Differences in jobs when using Streaming
- using, in WordCount / Time for action – implementing WordCount using Streaming, What just happened?
Hadoop Summit
- about / Conferences
Hadoop versioning
- about / A note on versions
hardware failure
- about / Hardware failure
HBase
- about / What it is and isn't good for, Sinks, HBase
- URL / HBase
HBase on EMR
- about / HBase on EMR
HDFS
- about / Parts of Hadoop, HDFS
- features / HDFS
- using / Time for action – using HDFS, What just happened?
- managing / Managing HDFS
- data, writing / Where to write data
- balancer, using / Using balancer
- rebalancing / When to rebalance
- employee table, exporting into / Have a go hero – exporting the employee table into HDFS
- and Sqoop / Sqoop and HDFS
- network traffic, writing onto / Time for action – writing network traffic onto HDFS, What just happened?
HDFS web UI
- about / The HDFS web UI
hidden issues, data
- about / Hidden issues
- network data, keeping on network / Keeping network data on the network
- Hadoop dependencies / Hadoop dependencies
- reliability / Reliability
- common framework approach / A common framework approach
historical trends, big data processing
- about / Historically for the few and not the many
- classic data processing systems / Classic data processing systems
- limiting factors / Limiting factors
Hive
- overview / Overview of Hive
- benefits / Why use Hive?
- setting up / Setting up Hive
- prerequisites / Prerequisites
- downloading / Getting Hive
- installing / Time for action – installing Hive
- using / Using Hive
- table for UFO data, creating / Time for action – creating a table for the UFO data, What just happened?
- UFO data, adding to table / Time for action – inserting the UFO data
- data, validating / Validating the data
- table, validating / Time for action – validating the table, What just happened?
- bucketing / Bucketing, clustering, and sorting... oh my!
- clustering / Bucketing, clustering, and sorting... oh my!
- sorting / Bucketing, clustering, and sorting... oh my!
- user-defined functions / User-Defined Function
- versus, Pig / Hive versus Pig
- features / What we didn't cover
- data, importing into / Importing data into Hive using Sqoop
Hive, on AWS
- UFO analysis, running on EMR / Time for action – running UFO analysis on EMR, What just happened?
- interactive job flows, using for development / Using interactive job flows for development
- interactive EMR cluster, using / Have a go hero – using an interactive EMR cluster
Hive and SQL views
- about / Hive and SQL views
- using / Time for action – using views, What just happened?
Hive data
- importing, into MySQL / Time for action – importing Hive data into MySQL, What just happened?
Hive exports
- and Sqoob / Sqoop and Hive exports
Hive partitions
- about / Sqoop and Hive partitions
- and Sqoop / Sqoop and Hive partitions
HiveQL
- about / What just happened?
- datatypes / What just happened?
HiveQL command
- about / Hive versus Pig
HiveQL query planner
- about / Hive versus Pig
Hive tables
- about / Hive tables – real or not?
- creating, from existing file / Time for action – creating a table from an existing file, What just happened?
- join, performing / Time for action – performing a join, What just happened?
- join, improving / Have a go hero – improve the join to use regular expressions
- dirty data, handling / Handling dirty data in Hive
- partitioning / Partitioning the table
- partitioned UFO sighting table, creating / Time for action – making a partitioned UFO sighting table, What just happened?
Hive transforms / User-Defined Function
Hortonworks
- about / Hortonworks Data Platform
Hortonworks Data Platform
- about / Hortonworks Data Platform
- URL / Hortonworks Data Platform
host failure
- about / Host failure
HTTPClient
- about / Have a go hero
HTTP Components / Have a go hero
HTTP protocol
- about / What just happened?
HUGs
- about / HUGs

I

IBM InfoSphere Big Insights
- about / IBM InfoSphere Big Insights
- URL / IBM InfoSphere Big Insights
InputFormat class
- about / InputFormat and RecordReader, Using Avro within MapReduce
INSERT command / What just happened?
insert statement
- versus update statement / Inserts versus updates
installation, Flume / Time for action – installing and configuring Flume, What just happened?
installation, MySQL / Time for action – installing and setting up MySQL, What just happened?
installation, Sqoop / Time for action – downloading and configuring Sqoop, What just happened?
interactive EMR cluster
- using / Have a go hero – using an interactive EMR cluster
interactive job flows
- using, for development, / Using interactive job flows for development
Iterator object / What just happened?

J

java.sql.Date / What just happened?
Java Development Kit (JDK) / Time for action – checking the prerequisites
Java HDFS interface / Have a go hero
Java IllegalArgumentExceptions / What just happened?
Java shape and location analysis
- about / Java shape and location analysis
- ChainMapper, using for record validation / Time for action – using ChainMapper for field validation/analysis, What just happened?
- issues, with output data / Too many abbreviations
- Distributed Cache, using / Using the Distributed Cache, Time for action – using the Distributed Cache to improve location output
JDBC / Writing data from within the reducer
JDBC channel
- about / Channels
JobConf class / Where to set properties
job priorities, MapReduce management
- changing / Job priorities and scheduling, Time for action – changing job priorities and killing a job
- scheduling / Time for action – changing job priorities and killing a job
JobTracker
- about / Location of the master nodes
JobTracker UI
- about / The MapReduce web UI
joins
- about / Joins
- disadvantages / When this is a bad idea
- map-side, versus reduce-side joins / Map-side versus reduce-side joins
- account and sales information, mtaching / Matching account and sales information
- reduce-side join, implementing / Matching account and sales information
- map-side joins, implementing / Implementing map-side joins
- limitations / To join or not to join...

K

key/value data
- about / Why key/value data?
- real-world examples / Some real-world examples
- MapReduce, using / MapReduce as a series of key/value transformations
key/value pairs
- about / Key/value pairs, What it mean
- key/value data / Why key/value data?

L

language-independent data structures
- about / Using language-independent data structures
- candidate technologies / Candidate technologies
- Avro / Introducing Avro
LineCounters / What just happened?
LineRecordReader
- about / Hadoop-provided RecordReader
LinkedIn groups
- about / LinkedIn groups
- URL / LinkedIn groups
list jars command / What just happened?
load balancing sink processor
- about / Handling sink failure
LOAD DATA statement / Be careful with data file access rights
local flat file
- remote file, capturing to / Time for action – capturing a remote file in a local flat file, What just happened?
local Hadoop
- versus, EMR Hadoop / Comparison of local versus EMR Hadoop
local standalone mode
- about / Three modes
log file
- network traffic, capturing to / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
logrotate
- about / Scheduling
logs
- versus files / Logs versus files

M

0.20 MapReduce Java API
- about / The 0.20 MapReduce Java API
- Mapper class / The Mapper class
- Reducer class / The Reducer class
- driver class / The Driver class
Mahout
- about / Mahout
- URL / Mahout
map-side joins
- about / Map-side versus reduce-side joins
- implementing, Distributed Cache used / Using the Distributed Cache
- data pruning, for fiting cache / Pruning data to fit in the cache
- data representation, using / Using a data representation instead of raw data
- multiple mappers, using / Using multiple mappers
mapper
- database, accessing from / Accessing the database from the mapper
mapper and reducer implementations
- about / Hadoop-provided mapper and reducer implementations
Mapper class, 0.20 MapReduce Java API
- about / The Mapper class
- setup method / The Mapper class
- map method / The Mapper class
- cleanup method / The Mapper class
mappers / MapReduce
- about / Mappers and primary key columns
MapR
- about / MapR
- URL / MapR
mapred.job.tracker property / What about MapReduce?
mapred.job.tracker variable
- about / What just happened?
mapred.map.max.attempts
- about / Hadoop's handling of failing tasks
mapred.max.tracker.failures
- about / Hadoop's handling of failing tasks
mapred.reduce.max.attempts
- about / Hadoop's handling of failing tasks
MapReduce
- about / Parts of Hadoop, MapReduce
- features / MapReduce
- used, as key/value transformations / MapReduce as a series of key/value transformations
- Hadoop Java API / The Hadoop Java API for MapReduce
- advanced techniques / Simple, advanced, and in-between
/ Staging data
MapReduce 2.0 or MRV2
- about / Upcoming Hadoop changes
MapReducejob analysis
- developing / Counters, status, and other output, Time for action – creating counters, task states, and writing log output
MapReduce management
- about / MapReduce management
- command line job management / Command line job management
- job priorities / Job priorities and scheduling
- scheduling / Job priorities and scheduling
- alternative schedulers / Alternative schedulers
- alternative schedulers, enabling / Enabling alternative schedulers
- alternative schedulers, using / When to use alternative schedulers
MapReduce programs
- writing / Writing MapReduce programs
- classpath, setting up / Time for action – setting up the classpath, What just happened?
- WordCount, implementing / Time for action – implementing WordCount, What just happened?
- JAR file, building / Time for action – building a JAR file, What just happened?
- WordCount, on local Hadoop cluster / Time for action – running WordCount on a local Hadoop cluster
- WordCount, running on EMR / Time for action – running WordCount on EMR
- pre-0.20 Java MapReduce API / The pre-0.20 Java MapReduce API
- Hadoop-provided mapper and reducer implementations / Hadoop-provided mapper and reducer implementations
MapReduce programs development
- languages, using / Using languages other than Java with Hadoop
- large dataset, analyzing / Analyzing a large dataset
- counters / Counters, status, and other output
- status / Counters, status, and other output
- job analysis workflow, developing / Counters, status, and other output
- counters, creating / Time for action – creating counters, task states, and writing log output
- task states / Time for action – creating counters, task states, and writing log output
MapReduce web UI
- about / The MapReduce web UI
map wrapper classes
- AbstractMapWritable / Map wrapper classes
- MapWritable / Map wrapper classes
- SortedMapWritable / Map wrapper classes
master nodes
- location / Location of the master nodes
mean time between failures (MTBF) / Commodity versus enterprise class storage
memory channel
- about / Channels
Message Passing Interface (MPI)
- about / Upcoming Hadoop changes
MetaStore
- about / What we didn't cover
modes
- local standalone mode / Three modes
- pseudo-distributed mode / Three modes
- fully distributed mode / Three modes
MRUnit
- URL / MRUnit
- about / MRUnit
multi-level Flume networks
- about / Time for action – multi level Flume networks, What just happened?
MultipleInputs class / What just happened?
multiple sinks
- agent, writing to / Time for action – writing to multiple sinks, What just happened?
multiplexing
- about / Selectors replicating and multiplexing
multiplexing source selector
- about / Selectors replicating and multiplexing
MySQL
- setting up / Setting up MySQL, Time for action – installing and setting up MySQL, What just happened?
- installing / Time for action – installing and setting up MySQL, What just happened?
- configuring, for remote connections / Time for action – configuring MySQL to allow remote connections, What just happened?
- Hive data, importing into / Time for action – importing Hive data into MySQL, What just happened?
MySQL, into Hive
- data, exporting from / Time for action – exporting data from MySQL into Hive, What just happened?
MySQL, to HDFS
- data, exporting from / Time for action – exporting data from MySQL to HDFS, What just happened?
mysql command / To Sqoop or to Flume...
mysql command-line utility
- about / What just happened?
- options / What just happened?
mysqldump utility
- about / Using MySQL tools and manual import
MySQL tools
- used, for exporting data into Hadoop / Using MySQL tools and manual import

N

NameNode
- formatting / Time for action – formatting the NameNode
- about / Location of the master nodes
- managing / Managing the NameNode
- multiple locations, configuring / Configuring multiple locations for the fsimage class
- fsimage location, adding / Time for action – adding an additional fsimage location
- fsimage copies, writing / Where to write the fsimage copies
- host, swaping / Swapping to another NameNode host
NameNode host, swapping
- disaster recovery / Having things ready before disaster strikes
- swapping, to new NameNode host / Time for action – swapping to a new NameNode host, What just happened?
Netcat
- about / What just happened?, Sources
network
- data, keeping on / Keeping network data on the network
network data
- keeping, on network / Keeping network data on the network
- capturing, Flume used / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
- writing, to log files / Writing network data to log files, What just happened?
Network File System (NFS) / Network storage
network storage
- about / Network storage
network traffic
- about / Types of data
- getting, into Hadoop / Getting network traffic into Hadoop, What just happened?
- capturing, to log file / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
- writing, onto HDFS / Time for action – writing network traffic onto HDFS, What just happened?
Node inner class / What just happened?
NullOutputFormat
- about / Hadoop-provided OutputFormat
NullWritable wrapper class
- about / Other wrapper classes

O

ObjectWritable wrapper class
- about / Other wrapper classes
Oozie
- about / Oozie
- URL / Oozie
Open JDK / Time for action – checking the prerequisites
OutputFormat class
- about / OutputFormat and RecordWriter

P

partitioned UFO sighting table
- creating / Time for action – making a partitioned UFO sighting table, What just happened?
Pentaho Kettle
- URL / Oozie
Pi
- calculating, Hadoop used / Time for action – using Hadoop to calculate Pi
Pig
- about / Hive versus Pig, Pig
- URL / Pig
Pig Latin
- about / Hive versus Pig
pre-0.20 Java MapReduce API
- about / The pre-0.20 Java MapReduce API
primary key column
- about / Mappers and primary key columns
primitive wrapper classes
- about / Primitive wrapper classes
- BooleanWritable / Primitive wrapper classes
- ByteWritable / Primitive wrapper classes
- DoubleWritable / Primitive wrapper classes
- FloatWritable / Primitive wrapper classes
- IntWritable / Primitive wrapper classes
- LongWritable / Primitive wrapper classes
- VIntWritable / Primitive wrapper classes
- VLongWritable / Primitive wrapper classes
process ID (PID) / Time for action – killing a DataNode process
programming abstractions
- about / Other programming abstractions
- Pig / Pig
- Cascading / Cascading
Project Gutenberg
- URL / Have a go hero – WordCount on a larger body of text
property elements
- about / Additional property elements
- description / Additional property elements
- final / Additional property elements
Protocol Buffers
- URL / Candidate technologies
- about / Candidate technologies, Introducing Apache Flume
pseudo-distributed mode
- about / Three modes
- configuring / Time for action – configuring the pseudo-distributed mode
- configuration variables / What just happened?

Q

query output, Hive
- exporting / Time for action – exporting query output, What just happened?

R

raw query
- data, importing from / Time for action – importing data from a raw query, What just happened?
RDBMS
- about / Hadoop as an archive store
RDS
- considering / Considering RDS
real-world examples, key/value data
- about / Some real-world examples
RecordReader class
- about / InputFormat and RecordReader
RecordWriters class
- about / OutputFormat and RecordWriter
reduce-side join
- about / Map-side versus reduce-side joins
- implementing / Matching account and sales information
- implementing, MultipleInputs used / Time for action – reduce-side join using MultipleInputs
- DataJoinMapper class / DataJoinMapper and TaggedMapperOutput
- TaggedMapperOutput class / DataJoinMapper and TaggedMapperOutput
ReduceJoinReducer class / What just happened?
reducer
- data, writing from / Writing data from within the reducer
- SQL import files, writing from / Writing SQL import files from the reducer
Reducer class, 0.20 MapReduce Java API
- about / The Reducer class
- reduce method / The Reducer class
- setup method / The Reducer class
- run method / The Reducer class
- cleanup method / The Reducer class
reducers / MapReduce
Redundant Arrays of Inexpensive Disks (RAID) / Single disk versus RAID
remote connections
- MySQL, configuring for / Time for action – configuring MySQL to allow remote connections, What just happened?
remote file
- capturing, to local flat file / Time for action – capturing a remote file in a local flat file, What just happened?
remote procedure call (RPC) framework
- about / Going forward with Avro
replicating
- about / Selectors replicating and multiplexing
ResourceManager
- about / Upcoming Hadoop changes
Ruby API
- URL / What just happened?

S

SalesRecordMapper class / What just happened?
scale-out approach
- about / Early approaches to scale-out
- benefits / Early approaches to scale-out
scale-up approach
- about / Scale-up
- advantages / Scale-up
scaling
- capacity, adding to local Hadoop cluster / Adding capacity to a local Hadoop cluster
- capacity, adding to EMR job flow / Adding capacity to an EMR job flow
schemas, Avro
- defining / Time for action – defining the schema
- Sighting_date field / What just happened?
- City field / What just happened?
- Shape field / What just happened?
- Duration field / What just happened?
SecondaryNameNode
- about / Location of the master nodes
selective import
- performing / Time for action – a more selective import, What just happened?
SELECT statement
- about / Using MySQL tools and manual import
SequenceFile class
- about / Don't forget Sequence files
SequenceFileInputFormat
- about / Hadoop-provided InputFormat
SequenceFileOutputFormat
- about / Hadoop-provided OutputFormat
SequenceFileRecordReader
- about / Hadoop-provided RecordReader
SerDe
- about / What we didn't cover
SimpleDB / What just happened?
- about / SimpleDB
- URL / SimpleDB
Simple Storage Service (S3)
- about / Simple Storage Service (S3), Signing up for the necessary services
- URL / Simple Storage Service (S3)
single disk versus RAID
- about / Single disk versus RAID
sink
- about / What just happened?, Sinks
sink failure
- handling / Handling sink failure
skip mode
- about / Using Hadoop's skip mode
source
- about / What just happened?, Sources
source code
- about / Source code
special node requirements, Hadoop cluster
- about / Special node requirements
Spring Batch
- URL / Oozie
SQL import files
- writing, from reducer / Writing SQL import files from the reducer
Sqoop
- URL, for homepage / A better way – introducing Sqoop
- installing / Time for action – downloading and configuring Sqoop, What just happened?
- configuring / Time for action – downloading and configuring Sqoop, What just happened?
- downloading / Time for action – downloading and configuring Sqoop, What just happened?
- versions / Sqoop and Hadoop versions
- and HDFS / Sqoop and HDFS
- primary key columns / Mappers and primary key columns
- mappers / Mappers and primary key columns
- architecture / Sqoop's architecture
- used, for importing data into Hive / Importing data into Hive using Sqoop
- and Hive partitions / Sqoop and Hive partitions
- field and line terminators / Field and line terminators
- and Hive exports / Sqoop and Hive exports
- export, re-running / Time for action – fixing the mapping and re-running the export, What just happened?
- mapping, fixing / Time for action – fixing the mapping and re-running the export, What just happened?
- features / Incremental merge, Sqoop as a code generator
- as code generator / Sqoop as a code generator
- about / To Sqoop or to Flume..., Cloudera Distribution for Hadoop
sqoop command-line utility / What just happened?
Sqoop exports
- versus Sqoop imports / Differences between Sqoop imports and exports
Sqoop imports
- versus Sqoop exports / Differences between Sqoop imports and exports
start-balancer.sh script / Using balancer
stop-balancer.sh script / Using balancer
Storage Area Network (SAN) / Network storage
storage types, Hadoop cluster
- about / Storage types
- commodity, versus enterprise class storage / Commodity versus enterprise class storage
- single disk, versus RAID / Single disk versus RAID
- balancing / Finding the balance
- network storage / Network storage
Streaming WordCount mapper
- about / Differences in jobs when using Streaming
syslogd
- about / Sources

T

TaggedMapperOutput class
- about / DataJoinMapper and TaggedMapperOutput
task failures, due to data
- about / Task failure due to data
- dirty data, handling through code / Handling dirty data through code
- skip mode, using / Using Hadoop's skip mode
- dirty data, handling by skip mode / Time for action – handling dirty data by using skip mode, What just happened?
task failures, due to software
- about / Task failure due to software
- slow running tasks / Failure of slow running tasks, Time for action – causing task failure
- HDFS programmatic access / Have a go hero – HDFS programmatic access
- slow-running tasks, handling / Hadoop's handling of slow-running tasks
- speculative execution / Speculative execution
- failing tasks, handling / Hadoop's handling of failing tasks
TextInputFormat
- about / Hadoop-provided InputFormat
TextOutputFormat
- about / Hadoop-provided OutputFormat
Thrift
- about / Candidate technologies, Introducing Apache Flume
- URL / Candidate technologies
timestamp() function / What just happened?
TimestampInterceptor class / What just happened?
timestamps
- used, for writing data into directory / Time for action – adding timestamps, What just happened?
- adding / Time for action – adding timestamps, What just happened?
traditional relational databases
- about / Pruning data to fit in the cache
type mapping
- used, for improving data import / Time for action – using a type mapping, What just happened?

U

Ubuntu
- about / What just happened?
UDFMethodResolver interface / What just happened?
UDP syslogd source / It's all about events
UFO analysis
- running, on EMR / Time for action – running UFO analysis on EMR
ufodata / What just happened?
UFO dataset
- UFO data, summarizing / Time for action – summarizing the UFO data, What just happened?
- UFO shapes, examining / Examining UFO shapes
- shape data, summarizing / Time for action – summarizing the shape data, What just happened?
- sighting duration, correlating to UFO shape / Time for action – correlating of sighting duration to UFO shape, What just happened?
- Streaming scripts, using outside Hadoop / Using Streaming scripts outside Hadoop
- shape/time analysis, performing from command line / Time for action – performing the shape/time analysis from the command line, What just happened?
UFO data table, Hive
- creating / Time for action – creating a table for the UFO data, What just happened?
- data, loading / Time for action – inserting the UFO data, What just happened?
- data, validating / Validating the data, What just happened?
- redefining, with correct column separator / Time for action – redefining the table with the correct column separator, What just happened?
UFO sighting dataset
- getting / Getting the UFO sighting dataset
UFO sighting records
- sighting date / Getting the UFO sighting dataset
- recorded date / Getting the UFO sighting dataset
- location date / Getting the UFO sighting dataset
- shape / Getting the UFO sighting dataset
- duration / Getting the UFO sighting dataset
- description / Getting the UFO sighting dataset
Unix chmod / What just happened?
update statement
- versus insert statement / Inserts versus updates
user-defined functions (UDF)
- about / User-Defined Function
- adding / Time for action – adding a new User Defined Function (UDF), What just happened?
user identity, Hadoop security model
- about / User identity
- super user / The super user
USE statement / What just happened?

V

VersionedWritable wrapper class
- about / Other wrapper classes
versioning
- about / A note on versioning

W

web server data
- getting, into Hadoop / Time for action – getting web server data into Hadoop, What just happened?
WHERE clause / What just happened?
Whir
- about / Whir
- URL / Whir
WordCount example
- executing / Time for action – WordCount, the Hello World of MapReduce, What just happened?, Have a go hero – WordCount on a larger body of text
- mapper and reducer implementations, using / Time for action – WordCount the easy way
- start-up / Startup
- input, splitting / Splitting the input
- task assignment / Task assignment
- task start-up / Task startup
- JobTracker monitoring / Ongoing JobTracker monitoring
- mapper input / Mapper input
- mapper execution / Mapper execution
- mapper output / Mapper output and reduce input
- reduce input / Mapper output and reduce input
- partitioning / Partitioning
- optional partition function / The optional partition function
- reducer input / Reducer input
- reducer execution / Reducer execution
- reducer output / Reducer output
- shutdown / Shutdown
- combiner class, using / Apart from the combiner…maybe, Time for action – WordCount with a combiner
- reducer, using as combiner / When you can use the reducer as the combiner
- fixing, to work with combiner / Time for action – fixing WordCount to work with a combiner
- implementing, Streaming used / Time for action – implementing WordCount using Streaming, What just happened?
WordCount example, on EMR
- AWS management console used / Time for action – WordCount on EMR using the management console
wrapper classes
- about / Introducing the wrapper classes
- primitive wrapper classes / Primitive wrapper classes
- array wrapper classes / Array wrapper classes
- map wrapper classes / Map wrapper classes
- writable wrapper classes / Time for action – using the Writable wrapper classes
- CompressedWritable / Other wrapper classes
- ObjectWritable / Other wrapper classes
- NullWritable / Other wrapper classes
- VersionedWritable / Other wrapper classes
writable wrapper classes
- about / Time for action – using the Writable wrapper classes
- exercises / Have a go hero – playing with Writables