Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Hadoop Beginner's Guide

You're reading from   Hadoop Beginner's Guide Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services ‚Äì just a willingness to learn the basics from this practical step-by-step tutorial.

Arrow left icon
Product type Paperback
Published in Feb 2013
Publisher Packt
ISBN-13 9781849517300
Length 398 pages
Edition 1st Edition
Tools
Arrow right icon
Toc

Table of Contents (19) Chapters Close

Hadoop Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
1. What It's All About FREE CHAPTER 2. Getting Hadoop Up and Running 3. Understanding MapReduce 4. Developing MapReduce Programs 5. Advanced MapReduce Techniques 6. When Things Break 7. Keeping Things Running 8. A Relational View on Data with Hive 9. Working with Relational Databases 10. Data Collection with Flume 11. Where to Go Next Pop Quiz Answers Index

Index

A

  • AccountRecordMapper class / What just happened?
  • add jar command / What just happened?
  • advanced techniques, MapReduce
    • about / Simple, advanced, and in-between
    • joins / Joins
    • graph algorithms / Graph algorithms
    • language-independent data structures, using / Using language-independent data structures
  • agent
    • about / What just happened?
    • writing, to multiple sinks / Time for action – writing to multiple sinks, What just happened?
  • alternative distributions
    • about / Alternative distributions
    • reasons / Why alternative distributions?
    • bundling / Bundling
    • free and commercial extensions / Free and commercial extensions
    • Cloudera Distribution / Cloudera Distribution for Hadoop
    • Hortonworks Data Platform / Hortonworks Data Platform
    • MapR / MapR
    • IBM InfoSphere Big Insights / IBM InfoSphere Big Insights
    • selecting / Choosing a distribution
  • Apache projects
    • HBase / HBase
    • Oozie / Oozie
    • Whir / Whir
    • Mahout / Mahout
    • MRUnit / MRUnit
  • Apache Software Foundation / A better way – introducing Sqoop
  • ApplicationManager
    • about / Upcoming Hadoop changes
  • array wrapper classes
    • about / Array wrapper classes
    • ArrayWritable / Array wrapper classes
    • TwoDArrayWritable / Array wrapper classes
  • aternative schedulers, MapReduce management
    • Capacity Scheduler / Capacity Scheduler
    • Fair Scheduler / Fair Scheduler
    • enabling / Enabling alternative schedulers
    • using / When to use alternative schedulers
  • Avro
    • about / Candidate technologies, Sources, Sinks
    • URL / Introducing Avro
    • downloading / Time for action – getting and installing Avro
    • installing / Time for action – getting and installing Avro
    • setting up / What just happened?
    • advantages / Avro and schemas
    • schemas / Avro and schemas
    • schema, defining / Time for action – defining the schema
    • source Avro data, creating with Ruby / Time for action – creating the source Avro data with Ruby, What just happened?
    • data, consuming with Java / Time for action – consuming the Avro data with Java
    • using, within MapReduce / Using Avro within MapReduce
    • graphs / Have a go hero – graphs in Avro
    • features / Going forward with Avro
  • Avro, within MapReduce
    • shape summaries, generating / Time for action – generating shape summaries in MapReduce, What just happened?
    • output data, examining with Ruby / Time for action – examining the output data with Ruby, What just happened?
    • output data, examining with Java / Time for action – examining the output data with Java, What just happened?
  • Avro-mapred JAR files / What just happened?
  • Avro client
    • about / What just happened?
  • Avro code
    • about / What just happened?
  • Avro data
    • creating, with Ruby / Time for action – creating the source Avro data with Ruby, What just happened?
    • consuming, with Java / Time for action – consuming the Avro data with Java
  • AvroJob
    • about / Using Avro within MapReduce
  • AvroKey
    • about / Using Avro within MapReduce
  • AvroMapper
    • about / Using Avro within MapReduce
  • AvroReducer
    • about / Using Avro within MapReduce
  • AvroValue
    • about / Using Avro within MapReduce
  • AWS
    • about / AWS – infrastructure on demand from Amazon, A note about AWS
    • Elastic Compute Cloud (EC2) / Elastic Compute Cloud (EC2)
    • Simple Storage Service (S3) / Simple Storage Service (S3)
    • Elastic MapReduce (EMR) / Elastic MapReduce (EMR)
    • considerations / AWS considerations
  • AWS account
    • creating / Creating an AWS account
    • needed services, signing up / Signing up for the necessary services
    • management console / Time for action – WordCount on EMR using the management console
  • AWS credentials
    • about / AWS credentials
    • account ID / AWS credentials
    • access key / AWS credentials
    • secret access key / AWS credentials
    • key pairs / AWS credentials
  • AWS developer forums
    • URL / Mailing lists and forums
  • AWS ecosystem
    • about / The AWS ecosystem
    • URL / The AWS ecosystem
  • AWS management console
    • used, for WordCount on EMR / Time for action – WordCount on EMR using the management console
    • URL / Time for action – running UFO analysis on EMR
  • AWS resources
    • about / AWS resources
    • HBase on EMR / HBase on EMR
    • SimpleDB / SimpleDB
    • DynamoDB / DynamoDB

B

  • BackupNameNode
    • about / Upcoming Hadoop changes
  • base HDFS directory
    • changing / Time for action – changing the base HDFS directory
  • big data processing
    • about / Big data processing
    • aspects / The value of data
    • historical trends / Historically for the few and not the many
    • different approach / A different approach, Share nothing, Expect failure, Move processing, not data, Build applications, not infrastructure
  • Bloom filter
    • about / Using a data representation instead of raw data
  • breadth-first search (BFS) / Graphs and MapReduce – a match made somewhere

C

  • C++ interface
    • using / Using languages other than Java with Hadoop
  • candidate technologies
    • about / Candidate technologies
    • Protocol Buffers / Candidate technologies
    • Thrift / Candidate technologies
    • Avro / Candidate technologies
  • capacity
    • adding, to local Hadoop cluster / Adding capacity to a local Hadoop cluster
    • adding, to EMR job flow / Adding capacity to an EMR job flow
  • Capacity Scheduler
    • about / Capacity Scheduler
  • capacityScheduler directory / Enabling alternative schedulers
  • Cascading
    • about / Cascading
    • URL / Cascading
  • CDH
    • about / Cloudera Distribution for Hadoop
  • ChainMapper class
    • using / Time for action – using ChainMapper for field validation/analysis
  • channels
    • about / Channels
  • CheckpointNameNode
    • about / Upcoming Hadoop changes
  • city() function / What just happened?
  • classic data processing systems
    • about / Classic data processing systems
    • scale-up / Scale-up
    • scale-out approach / Early approaches to scale-out
  • Cloud computing, with AWS
    • about / Cloud computing with Amazon Web Services
    • third approach / A third way
    • types of cost / Different types of costs
  • Cloudera
    • about / A better way – introducing Sqoop
    • URL / A better way – introducing Sqoop
  • Cloudera Distribution
    • about / Cloudera Distribution for Hadoop
    • URL / Cloudera Distribution for Hadoop
  • cluster access control
    • about / Cluster access control
    • Hadoop security model / The Hadoop security model
  • cluster masters, killing
    • JobTracker, killing / Time for action – killing the JobTracker, What just happened?
    • replacement JobTracker, starting / Starting a replacement JobTracker
    • JobTracker, moving / Have a go hero – moving the JobTracker to a new host
    • NameNode process, killing / Time for action – killing the NameNode process
    • replacement NameNode, starting / Starting a replacement NameNode
    • NameNode process / The role of the NameNode in more detail
    • files / File systems, files, blocks, and nodes
    • filesystem / File systems, files, blocks, and nodes
    • blocks / File systems, files, blocks, and nodes
    • nodes / File systems, files, blocks, and nodes
    • fsimage / The single most important piece of data in the cluster – fsimage
    • DataNode start-up / DataNode startup
    • safe mode / Safe mode
    • SecondaryNameNode / SecondaryNameNode
    • NameNode failure / So what to do when the NameNode process has a critical failure?
    • BackupNode / BackupNode/CheckpointNode and NameNode HA
    • CheckpointNode / BackupNode/CheckpointNode and NameNode HA
    • NameNode HA / BackupNode/CheckpointNode and NameNode HA
  • column-oriented databases
    • about / Pruning data to fit in the cache
  • combiner class
    • about / Apart from the combiner…maybe
    • features / Why have a combiner?
    • adding, to WordCount / Time for action – WordCount with a combiner
  • command line job management
    • about / Command line job management
  • command output
    • capturing, to flat file / Time for action – capturing the output of a command to a flat file, What just happened?
  • commodity hardware
    • about / What is commodity hardware anyway?
  • commodity versus enterprise class storage
    • about / Commodity versus enterprise class storage
  • common architecture, Hadoop
    • about / Common architecture
    • advantages / What it is and isn't good for
    • disadvantages / What it is and isn't good for
  • CompressedWritable wrapper class
    • about / Other wrapper classes
  • conferences
    • about / Conferences
    • URL / Conferences
  • configuration, Flume / Time for action – installing and configuring Flume, What just happened?
  • configuration, MySQL
    • for remote connections / Time for action – configuring MySQL to allow remote connections, What just happened?
  • configuration, Sqoop / Time for action – downloading and configuring Sqoop, What just happened?
  • configuration files, Flume / Understanding the Flume configuration files
  • considerations, AWS / AWS considerations
  • correlated failures
    • about / The risk of correlated failures
  • counters
    • adding / Counters, status, and other output
  • CPU / memory / storage ratio, Hadoop cluster
    • about / Processor / memory / storage ratio
  • CREATE DATABASE statement / What just happened?
  • CREATE FUNCTION command / What just happened?
  • CREATE TABLE command
    • about / What just happened?
  • cron
    • about / Scheduling
  • curl utility / Getting network traffic into Hadoop, What just happened?

D

  • data
    • getting, into Hadoop / Getting data into Hadoop
    • exporting, from MySQL to HDFS / Time for action – exporting data from MySQL to HDFS, What just happened?
    • importing, into Hive / Importing data into Hive using Sqoop
    • exporting, from MySQL into Hive / Time for action – exporting data from MySQL into Hive, What just happened?
    • importing, from raw query / Time for action – importing data from a raw query, What just happened?
    • getting, out of Hadoop / Getting data out of Hadoop
    • writing, from within reducer / Writing data from within the reducer
    • importing, from Hadoop into MySQL / Time for action – importing data from Hadoop into MySQL, What just happened?
    • about / Data data everywhere...
    • types / Types of data
    • copying, from web server into HDFS / Time for action – getting web server data into Hadoop, What just happened?
    • hidden issues / Hidden issues, A common framework approach
    • lifecycle / Data lifecycle
    • staging / Staging data
    • scheduling / Scheduling
  • data, types
    • network traffic / Types of data
    • file data / Types of data
  • database
    • accessing, from mapper / Accessing the database from the mapper
  • data import
    • improving, type mapping used / Time for action – using a type mapping, What just happened?
  • data input/output formats
    • about / Input/output
    • files / Files, splits, and records
    • splits / Files, splits, and records
    • records / Files, splits, and records
    • InputFormat / InputFormat and RecordReader
    • RecordReaders / InputFormat and RecordReader
    • Hadoop-provided input formats / Hadoop-provided InputFormat
    • Hadoop-provided record readers / Hadoop-provided RecordReader
    • OutputFormats / OutputFormat and RecordWriter
    • RecordWriters / OutputFormat and RecordWriter
    • Hadoop-provided OutputFormats / Hadoop-provided OutputFormat
    • Sequence files / Don't forget Sequence files
  • DataJoinMapperBase class
    • about / DataJoinMapper and TaggedMapperOutput
  • data lifecycle management
    • about / The bigger picture
  • DataNode
    • about / Location of the master nodes
  • data paths
    • about / Common data paths
  • dataset analysis
    • UFO sighting dataset / Getting the UFO sighting dataset
    • Java shape and location analysis / Java shape and location analysis
  • datatype issues
    • about / Datatype issues
  • datatypes, HiveQL
    • Boolean types / What just happened?
    • Integer types / What just happened?
    • Floating point types / What just happened?
    • Textual types / What just happened?
  • datum
    • about / What just happened?
  • default properties
    • about / Default values
    • browsing / Time for action – browsing default properties
  • default security, Hadoop security model
    • demonstrating / Time for action – demonstrating the default security
  • default storage location, Hadoop configuration properties
    • about / Default storage location
  • depth-first search (DFS) / Graphs and MapReduce – a match made somewhere
  • DESCRIBE TABLE command / What just happened?
  • description property element
    • about / Additional property elements
  • dfs.data.dir property / Where to write data
  • dfs.default.name variable
    • about / What just happened?
  • dfs.name.dir property / Where to write data
  • dfs.replication variable
    • about / What just happened?
  • different approach, big data processing
    • about / A different approach
  • dirty data, Hive tables
    • handling / Handling dirty data in Hive
    • query output, exporting / Time for action – exporting query output, What just happened?
  • Distributed Cache
    • used, for improving Java location data output / Time for action – using the Distributed Cache to improve location output, What just happened?
  • driver class, 0.20 MapReduce Java API
    • about / The Driver class
  • dual approach
    • about / A dual approach
  • DynamoDB
    • about / Integration with other AWS products, DynamoDB
    • URL / Integration with other AWS products, DynamoDB

E

  • EC2 / Considering RDS
  • edges
    • about / Graph 101
  • Elastic Compute Cloud (EC2)
    • about / Elastic Compute Cloud (EC2), Signing up for the necessary services
    • URL / Elastic Compute Cloud (EC2)
  • Elastic MapReduce
    • about / Using Elastic MapReduce
    • using / Using Elastic MapReduce
  • Elastic MapReduce (EMR)
    • URL / Elastic MapReduce (EMR)
    • about / Elastic MapReduce (EMR), Signing up for the necessary services
  • employee database
    • setting up / Time for action – setting up the employee database, What just happened?
  • employee table
    • exporting, into HDFS / Have a go hero – exporting the employee table into HDFS
  • EMR
    • about / A note on EMR
    • benefits / A note on EMR
    • as, prototyping platform / EMR as a prototyping platform
    / Considering RDS
  • EMR command-line tools
    • about / The EMR command-line tools
  • EMR Hadoop
    • versus, local Hadoop / Comparison of local versus EMR Hadoop
  • EMR job flow
    • capacity, adding / Adding capacity to an EMR job flow
    • expanding / Expanding a running job flow
  • Enterprise Application Integration (EAI)
    • about / A common framework approach
  • ETL tools
    • about / Oozie
    • Pentaho Kettle / Oozie
    • Spring Batch / Oozie
  • evaluate methods / What just happened?
  • events
    • about / It's all about events
  • exec
    • about / Sources
  • export command / What just happened?

F

  • failover sink processor
    • about / Handling sink failure
  • failure types, Hadoop
    • about / Types of failure
    • Hadoop node failures / Hadoop node failure
    • cluster masters, killing / Killing the cluster masters
  • Fair Scheduler
    • about / Fair Scheduler
  • fairScheduler directory / Enabling alternative schedulers
  • features, Sqoop
    • incremental merge / Incremental merge
    • partial exports, avoiding / Avoiding partial exports
    • code generator / Sqoop as a code generator
  • file channel
    • about / Channels
  • file data
    • about / Types of data
  • FileInputFormat
    • about / Hadoop-provided InputFormat
  • FileOutputFormat
    • about / Hadoop-provided OutputFormat
  • files
    • getting, into Hadoop / Getting files into Hadoop
    • versus logs / Logs versus files
  • file_roll sink
    • about / What just happened?
  • final property element
    • about / Additional property elements
  • First In, First Out (FIFO) queue
    • about / Job priorities and scheduling
  • flat file
    • command output, capturing to / Time for action – capturing the output of a command to a flat file, What just happened?
  • Flume
    • about / A common framework approach, Introducing Apache Flume, To Sqoop or to Flume..., Cloudera Distribution for Hadoop
    • URL / Introducing Apache Flume
    • versioning / A note on versioning
    • configuring / Time for action – installing and configuring Flume, What just happened?
    • installing / Time for action – installing and configuring Flume, What just happened?
    • used, for capturing network data / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
    • logging, into console / Time for action – logging to the console, What just happened?
    • network data, writing to log files / Writing network data to log files, What just happened?
    • source / Sources
    • sinks / Sinks
    • channels / Channels
    • configuration files / Understanding the Flume configuration files
    • timestamps, adding / Time for action – adding timestamps, What just happened?
    • sink failure, handling / Handling sink failure
    • features / Next, the world
  • flume.root.logger variable
    • about / What just happened?
  • Flume NG
    • about / A note on versioning
  • Flume OG
    • about / A note on versioning
  • FLUSH PRIVILEGES command / What just happened?
  • fsimage class / Configuring multiple locations for the fsimage class
  • fsimage location
    • adding, to NameNode / Time for action – adding an additional fsimage location
  • fully distributed mode
    • about / Three modes

G

  • GenericRecord class
    • about / What just happened?
  • Google File System (GFS)
    • URL / Thanks, Google
  • GRANT statement / What just happened?
  • granular access control, Hadoop security model
    • about / More granular access control
  • graph algorithms
    • about / Graph algorithms
    • Graph 101 / Graph 101
    • Graphs and MapReduce / Graphs and MapReduce – a match made somewhere
    • nodes / Graphs and MapReduce – a match made somewhere
    • graph, representing / Representing a graph, What just happened?
    • pointer-based representations / Representing a graph
    • adjacency matrix representations / Representing a graph
    • adjacency list representations / Representing a graph
    • common coloring technique / Representing a graph
    • white nodes / Representing a graph
    • graph nodes / Representing a graph
    • black nodes / Representing a graph
    • overview / Overview of the algorithm
    • states, for node / Overview of the algorithm
    • mapper / The mapper
    • reducer / The reducer
    • iterative application / Iterative application
    • source code, creating / Time for action – creating the source code
    • first run / Time for action – the first run, What just happened?
    • second run / Time for action – the second run, What just happened?
    • third run / Time for action – the third run, What just happened?
    • fourth run / Time for action – the fourth and last run, What just happened?
    • multiple jobs, running / Running multiple jobs
    • final thoughts / Final thoughts on graphs
  • graphs, Avro
    • about / Have a go hero – graphs in Avro

H

  • Hadoop
    • about / Hadoop
    • components / Parts of Hadoop
    • common building blocks / Common building blocks
    • architectural principles / Common building blocks
    • HDFS / HDFS
    • MapReduce / MapReduce
    • HDFS and MapReduce / Better together
    • common architecture / Common architecture
    • on local Ubuntu host / Hadoop on a local Ubuntu host
    • on Windows / Other operating systems
    • on Mac OS X / Other operating systems
    • prerequisites / Time for action – checking the prerequisites
    • setting up / Setting up Hadoop
    • versions / A note on versions, Sqoop and Hadoop versions
    • downloading / Time for action – downloading Hadoop
    • SSH, setting up / Time for action – setting up SSH
    • configuring / Configuring and running Hadoop
    • used, for calculating Pi / Time for action – using Hadoop to calculate Pi
    • running / Time for action – using Hadoop to calculate Pi
    • modes / Three modes
    • base folder, configuring / Configuring the base directory and formatting the filesystem
    • filesystem, formatting / Configuring the base directory and formatting the filesystem
    • base HDFS directory, changing / Time for action – changing the base HDFS directory
    • NameNode, formatting / Time for action – formatting the NameNode
    • starting / Time for action – starting Hadoop
    • HDFS, using / Time for action – using HDFS
    • WordCount, running / Time for action – WordCount, the Hello World of MapReduce
    • WordCount, executing on larger body of text / Have a go hero – WordCount on a larger body of text
    • monitoring / Monitoring Hadoop from the browser
    • HDFS web UI / The HDFS web UI
    • MapReduce web UI / The MapReduce web UI
    • failure / Failure
    • embrace failure / Embrace failure
    • failure, types / Types of failure
    • scaling / Scaling
    • data paths / Common data paths
    • as archive store / Hadoop as an archive store
    • as preprocessing step / Hadoop as a preprocessing step
    • as data input tool / Hadoop as a data input tool
    • data, getting into / Getting data into Hadoop
    • network traffic, getting into / Getting network traffic into Hadoop, What just happened?
    • web server data, getting into / Time for action – getting web server data into Hadoop, What just happened?
    • files, getting into / Getting files into Hadoop
    • alternative distributions / Alternative distributions
    • programming abstractions / Other programming abstractions
  • Hadoop, into MySQL
    • data, importing from / Time for action – importing data from Hadoop into MySQL, What just happened?
  • Hadoop-provided input formats
    • about / Hadoop-provided InputFormat
    • FileInputFormat / Hadoop-provided InputFormat
    • SequenceFileInputFormat / Hadoop-provided InputFormat
    • TextInputFormat / Hadoop-provided InputFormat
  • Hadoop-provided OutputFormats
    • about / Hadoop-provided OutputFormat
    • FileOutputFormat / Hadoop-provided OutputFormat
    • NullOutputFormat / Hadoop-provided OutputFormat
    • SequenceFileOutputFormat / Hadoop-provided OutputFormat
    • TextOutputFormat / Hadoop-provided OutputFormat
  • Hadoop-provided record readers
    • about / Hadoop-provided RecordReader
    • LineRecordReader / Hadoop-provided RecordReader
    • SequenceFileRecordReader / Hadoop-provided RecordReader
  • Hadoop-specific data types
    • about / Hadoop-specific data types
    • Writable interface / The Writable and WritableComparable interfaces
    • wrapper classes / Introducing the wrapper classes
  • hadoop/lib directory / Enabling alternative schedulers
  • Hadoop changes
    • about / Upcoming Hadoop changes
    • MapReduce 2.0 or MRV2 / Upcoming Hadoop changes
    • YARN (Yet Another Resource Negotiator) / Upcoming Hadoop changes
  • Hadoop cluster
    • setting up / Setting up a cluster
    • hosts / How many hosts?
    • usable space on node, calculating / Calculating usable space on a node
    • master nodes, location / Location of the master nodes
    • hardware, sizing / Sizing hardware
    • processor / memory / storage ratio / Processor / memory / storage ratio
    • EMR, as prototyping platform / EMR as a prototyping platform
    • special node requirements / Special node requirements
    • storage types / Storage types
    • networking configuration / Hadoop networking configuration
    • commodity hardware / What is commodity hardware anyway?
    • node and running balancer, adding / Have a go hero – adding a node and running balancer
  • Hadoop community
    • about / Sources of information
    • source code / Source code
    • mailing lists and forums / Mailing lists and forums
    • LinkedIn groups / LinkedIn groups
    • HUGs / HUGs
    • conferences / Conferences
  • Hadoop configuration properties
    • about / Hadoop configuration properties
    • default properties / Default values
    • property elements / Additional property elements
    • default storage location / Default storage location
    • setting / Where to set properties
  • Hadoop dependencies / Hadoop dependencies
  • Hadoop failure
    • hardware failures / Hardware failure
    • host failures / Host failure
    • host corruption / Host corruption
    • correlated failures / The risk of correlated failures
  • Hadoop FAQ
    • URL / Other operating systems
  • hadoop fs command / What just happened?
  • Hadoop Java API, for MapReduce
    • about / The Hadoop Java API for MapReduce
    • 0.20 MapReduce Java API / The 0.20 MapReduce Java API
  • hadoop job -history command / What just happened?
  • hadoop job -kill command / What just happened?
  • hadoop job -list all command / What just happened?
  • hadoop job -set-priority command / Job priorities and scheduling, What just happened?
  • hadoop job -status command / What just happened?
  • Hadoop networking configuration
    • about / Hadoop networking configuration
    • blocks, placing / How blocks are placed
    • rack-awareness script / Rack awareness
    • default rack configuration, examining / Time for action – examining the default rack configuration
    • rack awareness script, adding / Time for action – adding a rack awareness script, What just happened?
  • Hadoop node failures
    • dfsadmin command / The dfsadmin command
    • cluster setup / Cluster setup, test files, and block sizes
    • test files / Cluster setup, test files, and block sizes
    • block sizes / Cluster setup, test files, and block sizes
    • fault tolerance / Fault tolerance and Elastic MapReduce
    • Elastic MapReduce / Fault tolerance and Elastic MapReduce
    • DataNode process, killing / Time for action – killing a DataNode process, What just happened?
    • NameNode and DataNode communication / NameNode and DataNode communication
    • NameNode log delving / Have a go hero – NameNode log delving
    • replication factor / Time for action – the replication factor in action, What just happened?
    • missing blocks, causing intentionally / Time for action – intentionally causing missing blocks, What just happened?
    • data loss / When data may be lost
    • block corruption / Block corruption
    • TaskTracker process, killing / Time for action – killing a TaskTracker process, What just happened?
    • DataNode and TaskTracker failures, comparing / Comparing the DataNode and TaskTracker failures
    • permanent failure / Permanent failure
  • Hadoop Pipes
    • about / Using languages other than Java with Hadoop
  • Hadoop security model
    • about / The Hadoop security model
    • default security, demonstrating / Time for action – demonstrating the default security
    • user identity / User identity
    • granular access control / More granular access control
    • working around, via physical access control / Working around the security model via physical access control
  • Hadoop Streaming
    • about / Using languages other than Java with Hadoop
    • working / How Hadoop Streaming works
    • advantages / Why to use Hadoop Streaming, Differences in jobs when using Streaming
    • using, in WordCount / Time for action – implementing WordCount using Streaming, What just happened?
  • Hadoop Summit
    • about / Conferences
  • Hadoop versioning
    • about / A note on versions
  • hardware failure
    • about / Hardware failure
  • HBase
    • about / What it is and isn't good for, Sinks, HBase
    • URL / HBase
  • HBase on EMR
    • about / HBase on EMR
  • HDFS
    • about / Parts of Hadoop, HDFS
    • features / HDFS
    • using / Time for action – using HDFS, What just happened?
    • managing / Managing HDFS
    • data, writing / Where to write data
    • balancer, using / Using balancer
    • rebalancing / When to rebalance
    • employee table, exporting into / Have a go hero – exporting the employee table into HDFS
    • and Sqoop / Sqoop and HDFS
    • network traffic, writing onto / Time for action – writing network traffic onto HDFS, What just happened?
  • HDFS web UI
    • about / The HDFS web UI
  • hidden issues, data
    • about / Hidden issues
    • network data, keeping on network / Keeping network data on the network
    • Hadoop dependencies / Hadoop dependencies
    • reliability / Reliability
    • common framework approach / A common framework approach
  • historical trends, big data processing
    • about / Historically for the few and not the many
    • classic data processing systems / Classic data processing systems
    • limiting factors / Limiting factors
  • Hive
    • overview / Overview of Hive
    • benefits / Why use Hive?
    • setting up / Setting up Hive
    • prerequisites / Prerequisites
    • downloading / Getting Hive
    • installing / Time for action – installing Hive
    • using / Using Hive
    • table for UFO data, creating / Time for action – creating a table for the UFO data, What just happened?
    • UFO data, adding to table / Time for action – inserting the UFO data
    • data, validating / Validating the data
    • table, validating / Time for action – validating the table, What just happened?
    • bucketing / Bucketing, clustering, and sorting... oh my!
    • clustering / Bucketing, clustering, and sorting... oh my!
    • sorting / Bucketing, clustering, and sorting... oh my!
    • user-defined functions / User-Defined Function
    • versus, Pig / Hive versus Pig
    • features / What we didn't cover
    • data, importing into / Importing data into Hive using Sqoop
  • Hive, on AWS
    • UFO analysis, running on EMR / Time for action – running UFO analysis on EMR, What just happened?
    • interactive job flows, using for development / Using interactive job flows for development
    • interactive EMR cluster, using / Have a go hero – using an interactive EMR cluster
  • Hive and SQL views
    • about / Hive and SQL views
    • using / Time for action – using views, What just happened?
  • Hive data
    • importing, into MySQL / Time for action – importing Hive data into MySQL, What just happened?
  • Hive exports
    • and Sqoob / Sqoop and Hive exports
  • Hive partitions
    • about / Sqoop and Hive partitions
    • and Sqoop / Sqoop and Hive partitions
  • HiveQL
    • about / What just happened?
    • datatypes / What just happened?
  • HiveQL command
    • about / Hive versus Pig
  • HiveQL query planner
    • about / Hive versus Pig
  • Hive tables
    • about / Hive tables – real or not?
    • creating, from existing file / Time for action – creating a table from an existing file, What just happened?
    • join, performing / Time for action – performing a join, What just happened?
    • join, improving / Have a go hero – improve the join to use regular expressions
    • dirty data, handling / Handling dirty data in Hive
    • partitioning / Partitioning the table
    • partitioned UFO sighting table, creating / Time for action – making a partitioned UFO sighting table, What just happened?
  • Hive transforms / User-Defined Function
  • Hortonworks
    • about / Hortonworks Data Platform
  • Hortonworks Data Platform
    • about / Hortonworks Data Platform
    • URL / Hortonworks Data Platform
  • host failure
    • about / Host failure
  • HTTPClient
    • about / Have a go hero
  • HTTP Components / Have a go hero
  • HTTP protocol
    • about / What just happened?
  • HUGs
    • about / HUGs

I

  • IBM InfoSphere Big Insights
    • about / IBM InfoSphere Big Insights
    • URL / IBM InfoSphere Big Insights
  • InputFormat class
    • about / InputFormat and RecordReader, Using Avro within MapReduce
  • INSERT command / What just happened?
  • insert statement
    • versus update statement / Inserts versus updates
  • installation, Flume / Time for action – installing and configuring Flume, What just happened?
  • installation, MySQL / Time for action – installing and setting up MySQL, What just happened?
  • installation, Sqoop / Time for action – downloading and configuring Sqoop, What just happened?
  • interactive EMR cluster
    • using / Have a go hero – using an interactive EMR cluster
  • interactive job flows
    • using, for development, / Using interactive job flows for development
  • Iterator object / What just happened?

J

  • java.sql.Date / What just happened?
  • Java Development Kit (JDK) / Time for action – checking the prerequisites
  • Java HDFS interface / Have a go hero
  • Java IllegalArgumentExceptions / What just happened?
  • Java shape and location analysis
    • about / Java shape and location analysis
    • ChainMapper, using for record validation / Time for action – using ChainMapper for field validation/analysis, What just happened?
    • issues, with output data / Too many abbreviations
    • Distributed Cache, using / Using the Distributed Cache, Time for action – using the Distributed Cache to improve location output
  • JDBC / Writing data from within the reducer
  • JDBC channel
    • about / Channels
  • JobConf class / Where to set properties
  • job priorities, MapReduce management
    • changing / Job priorities and scheduling, Time for action – changing job priorities and killing a job
    • scheduling / Time for action – changing job priorities and killing a job
  • JobTracker
    • about / Location of the master nodes
  • JobTracker UI
    • about / The MapReduce web UI
  • joins
    • about / Joins
    • disadvantages / When this is a bad idea
    • map-side, versus reduce-side joins / Map-side versus reduce-side joins
    • account and sales information, mtaching / Matching account and sales information
    • reduce-side join, implementing / Matching account and sales information
    • map-side joins, implementing / Implementing map-side joins
    • limitations / To join or not to join...

K

  • key/value data
    • about / Why key/value data?
    • real-world examples / Some real-world examples
    • MapReduce, using / MapReduce as a series of key/value transformations
  • key/value pairs
    • about / Key/value pairs, What it mean
    • key/value data / Why key/value data?

L

  • language-independent data structures
    • about / Using language-independent data structures
    • candidate technologies / Candidate technologies
    • Avro / Introducing Avro
  • LineCounters / What just happened?
  • LineRecordReader
    • about / Hadoop-provided RecordReader
  • LinkedIn groups
    • about / LinkedIn groups
    • URL / LinkedIn groups
  • list jars command / What just happened?
  • load balancing sink processor
    • about / Handling sink failure
  • LOAD DATA statement / Be careful with data file access rights
  • local flat file
    • remote file, capturing to / Time for action – capturing a remote file in a local flat file, What just happened?
  • local Hadoop
    • versus, EMR Hadoop / Comparison of local versus EMR Hadoop
  • local standalone mode
    • about / Three modes
  • log file
    • network traffic, capturing to / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
  • logrotate
    • about / Scheduling
  • logs
    • versus files / Logs versus files

M

  • 0.20 MapReduce Java API
    • about / The 0.20 MapReduce Java API
    • Mapper class / The Mapper class
    • Reducer class / The Reducer class
    • driver class / The Driver class
  • Mahout
    • about / Mahout
    • URL / Mahout
  • map-side joins
    • about / Map-side versus reduce-side joins
    • implementing, Distributed Cache used / Using the Distributed Cache
    • data pruning, for fiting cache / Pruning data to fit in the cache
    • data representation, using / Using a data representation instead of raw data
    • multiple mappers, using / Using multiple mappers
  • mapper
    • database, accessing from / Accessing the database from the mapper
  • mapper and reducer implementations
    • about / Hadoop-provided mapper and reducer implementations
  • Mapper class, 0.20 MapReduce Java API
    • about / The Mapper class
    • setup method / The Mapper class
    • map method / The Mapper class
    • cleanup method / The Mapper class
  • mappers / MapReduce
    • about / Mappers and primary key columns
  • MapR
    • about / MapR
    • URL / MapR
  • mapred.job.tracker property / What about MapReduce?
  • mapred.job.tracker variable
    • about / What just happened?
  • mapred.map.max.attempts
    • about / Hadoop's handling of failing tasks
  • mapred.max.tracker.failures
    • about / Hadoop's handling of failing tasks
  • mapred.reduce.max.attempts
    • about / Hadoop's handling of failing tasks
  • MapReduce
    • about / Parts of Hadoop, MapReduce
    • features / MapReduce
    • used, as key/value transformations / MapReduce as a series of key/value transformations
    • Hadoop Java API / The Hadoop Java API for MapReduce
    • advanced techniques / Simple, advanced, and in-between
    / Staging data
  • MapReduce 2.0 or MRV2
    • about / Upcoming Hadoop changes
  • MapReducejob analysis
    • developing / Counters, status, and other output, Time for action – creating counters, task states, and writing log output
  • MapReduce management
    • about / MapReduce management
    • command line job management / Command line job management
    • job priorities / Job priorities and scheduling
    • scheduling / Job priorities and scheduling
    • alternative schedulers / Alternative schedulers
    • alternative schedulers, enabling / Enabling alternative schedulers
    • alternative schedulers, using / When to use alternative schedulers
  • MapReduce programs
    • writing / Writing MapReduce programs
    • classpath, setting up / Time for action – setting up the classpath, What just happened?
    • WordCount, implementing / Time for action – implementing WordCount, What just happened?
    • JAR file, building / Time for action – building a JAR file, What just happened?
    • WordCount, on local Hadoop cluster / Time for action – running WordCount on a local Hadoop cluster
    • WordCount, running on EMR / Time for action – running WordCount on EMR
    • pre-0.20 Java MapReduce API / The pre-0.20 Java MapReduce API
    • Hadoop-provided mapper and reducer implementations / Hadoop-provided mapper and reducer implementations
  • MapReduce programs development
    • languages, using / Using languages other than Java with Hadoop
    • large dataset, analyzing / Analyzing a large dataset
    • counters / Counters, status, and other output
    • status / Counters, status, and other output
    • job analysis workflow, developing / Counters, status, and other output
    • counters, creating / Time for action – creating counters, task states, and writing log output
    • task states / Time for action – creating counters, task states, and writing log output
  • MapReduce web UI
    • about / The MapReduce web UI
  • map wrapper classes
    • AbstractMapWritable / Map wrapper classes
    • MapWritable / Map wrapper classes
    • SortedMapWritable / Map wrapper classes
  • master nodes
    • location / Location of the master nodes
  • mean time between failures (MTBF) / Commodity versus enterprise class storage
  • memory channel
    • about / Channels
  • Message Passing Interface (MPI)
    • about / Upcoming Hadoop changes
  • MetaStore
    • about / What we didn't cover
  • modes
    • local standalone mode / Three modes
    • pseudo-distributed mode / Three modes
    • fully distributed mode / Three modes
  • MRUnit
    • URL / MRUnit
    • about / MRUnit
  • multi-level Flume networks
    • about / Time for action – multi level Flume networks, What just happened?
  • MultipleInputs class / What just happened?
  • multiple sinks
    • agent, writing to / Time for action – writing to multiple sinks, What just happened?
  • multiplexing
    • about / Selectors replicating and multiplexing
  • multiplexing source selector
    • about / Selectors replicating and multiplexing
  • MySQL
    • setting up / Setting up MySQL, Time for action – installing and setting up MySQL, What just happened?
    • installing / Time for action – installing and setting up MySQL, What just happened?
    • configuring, for remote connections / Time for action – configuring MySQL to allow remote connections, What just happened?
    • Hive data, importing into / Time for action – importing Hive data into MySQL, What just happened?
  • MySQL, into Hive
    • data, exporting from / Time for action – exporting data from MySQL into Hive, What just happened?
  • MySQL, to HDFS
    • data, exporting from / Time for action – exporting data from MySQL to HDFS, What just happened?
  • mysql command / To Sqoop or to Flume...
  • mysql command-line utility
    • about / What just happened?
    • options / What just happened?
  • mysqldump utility
    • about / Using MySQL tools and manual import
  • MySQL tools
    • used, for exporting data into Hadoop / Using MySQL tools and manual import

N

  • NameNode
    • formatting / Time for action – formatting the NameNode
    • about / Location of the master nodes
    • managing / Managing the NameNode
    • multiple locations, configuring / Configuring multiple locations for the fsimage class
    • fsimage location, adding / Time for action – adding an additional fsimage location
    • fsimage copies, writing / Where to write the fsimage copies
    • host, swaping / Swapping to another NameNode host
  • NameNode host, swapping
    • disaster recovery / Having things ready before disaster strikes
    • swapping, to new NameNode host / Time for action – swapping to a new NameNode host, What just happened?
  • Netcat
    • about / What just happened?, Sources
  • network
    • data, keeping on / Keeping network data on the network
  • network data
    • keeping, on network / Keeping network data on the network
    • capturing, Flume used / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
    • writing, to log files / Writing network data to log files, What just happened?
  • Network File System (NFS) / Network storage
  • network storage
    • about / Network storage
  • network traffic
    • about / Types of data
    • getting, into Hadoop / Getting network traffic into Hadoop, What just happened?
    • capturing, to log file / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
    • writing, onto HDFS / Time for action – writing network traffic onto HDFS, What just happened?
  • Node inner class / What just happened?
  • NullOutputFormat
    • about / Hadoop-provided OutputFormat
  • NullWritable wrapper class
    • about / Other wrapper classes

O

  • ObjectWritable wrapper class
    • about / Other wrapper classes
  • Oozie
    • about / Oozie
    • URL / Oozie
  • Open JDK / Time for action – checking the prerequisites
  • OutputFormat class
    • about / OutputFormat and RecordWriter

P

  • partitioned UFO sighting table
    • creating / Time for action – making a partitioned UFO sighting table, What just happened?
  • Pentaho Kettle
    • URL / Oozie
  • Pi
    • calculating, Hadoop used / Time for action – using Hadoop to calculate Pi
  • Pig
    • about / Hive versus Pig, Pig
    • URL / Pig
  • Pig Latin
    • about / Hive versus Pig
  • pre-0.20 Java MapReduce API
    • about / The pre-0.20 Java MapReduce API
  • primary key column
    • about / Mappers and primary key columns
  • primitive wrapper classes
    • about / Primitive wrapper classes
    • BooleanWritable / Primitive wrapper classes
    • ByteWritable / Primitive wrapper classes
    • DoubleWritable / Primitive wrapper classes
    • FloatWritable / Primitive wrapper classes
    • IntWritable / Primitive wrapper classes
    • LongWritable / Primitive wrapper classes
    • VIntWritable / Primitive wrapper classes
    • VLongWritable / Primitive wrapper classes
  • process ID (PID) / Time for action – killing a DataNode process
  • programming abstractions
    • about / Other programming abstractions
    • Pig / Pig
    • Cascading / Cascading
  • Project Gutenberg
    • URL / Have a go hero – WordCount on a larger body of text
  • property elements
    • about / Additional property elements
    • description / Additional property elements
    • final / Additional property elements
  • Protocol Buffers
    • URL / Candidate technologies
    • about / Candidate technologies, Introducing Apache Flume
  • pseudo-distributed mode
    • about / Three modes
    • configuring / Time for action – configuring the pseudo-distributed mode
    • configuration variables / What just happened?

Q

  • query output, Hive
    • exporting / Time for action – exporting query output, What just happened?

R

  • raw query
    • data, importing from / Time for action – importing data from a raw query, What just happened?
  • RDBMS
    • about / Hadoop as an archive store
  • RDS
    • considering / Considering RDS
  • real-world examples, key/value data
    • about / Some real-world examples
  • RecordReader class
    • about / InputFormat and RecordReader
  • RecordWriters class
    • about / OutputFormat and RecordWriter
  • reduce-side join
    • about / Map-side versus reduce-side joins
    • implementing / Matching account and sales information
    • implementing, MultipleInputs used / Time for action – reduce-side join using MultipleInputs
    • DataJoinMapper class / DataJoinMapper and TaggedMapperOutput
    • TaggedMapperOutput class / DataJoinMapper and TaggedMapperOutput
  • ReduceJoinReducer class / What just happened?
  • reducer
    • data, writing from / Writing data from within the reducer
    • SQL import files, writing from / Writing SQL import files from the reducer
  • Reducer class, 0.20 MapReduce Java API
    • about / The Reducer class
    • reduce method / The Reducer class
    • setup method / The Reducer class
    • run method / The Reducer class
    • cleanup method / The Reducer class
  • reducers / MapReduce
  • Redundant Arrays of Inexpensive Disks (RAID) / Single disk versus RAID
  • remote connections
    • MySQL, configuring for / Time for action – configuring MySQL to allow remote connections, What just happened?
  • remote file
    • capturing, to local flat file / Time for action – capturing a remote file in a local flat file, What just happened?
  • remote procedure call (RPC) framework
    • about / Going forward with Avro
  • replicating
    • about / Selectors replicating and multiplexing
  • ResourceManager
    • about / Upcoming Hadoop changes
  • Ruby API
    • URL / What just happened?

S

  • SalesRecordMapper class / What just happened?
  • scale-out approach
    • about / Early approaches to scale-out
    • benefits / Early approaches to scale-out
  • scale-up approach
    • about / Scale-up
    • advantages / Scale-up
  • scaling
    • capacity, adding to local Hadoop cluster / Adding capacity to a local Hadoop cluster
    • capacity, adding to EMR job flow / Adding capacity to an EMR job flow
  • schemas, Avro
    • defining / Time for action – defining the schema
    • Sighting_date field / What just happened?
    • City field / What just happened?
    • Shape field / What just happened?
    • Duration field / What just happened?
  • SecondaryNameNode
    • about / Location of the master nodes
  • selective import
    • performing / Time for action – a more selective import, What just happened?
  • SELECT statement
    • about / Using MySQL tools and manual import
  • SequenceFile class
    • about / Don't forget Sequence files
  • SequenceFileInputFormat
    • about / Hadoop-provided InputFormat
  • SequenceFileOutputFormat
    • about / Hadoop-provided OutputFormat
  • SequenceFileRecordReader
    • about / Hadoop-provided RecordReader
  • SerDe
    • about / What we didn't cover
  • SimpleDB / What just happened?
    • about / SimpleDB
    • URL / SimpleDB
  • Simple Storage Service (S3)
    • about / Simple Storage Service (S3), Signing up for the necessary services
    • URL / Simple Storage Service (S3)
  • single disk versus RAID
    • about / Single disk versus RAID
  • sink
    • about / What just happened?, Sinks
  • sink failure
    • handling / Handling sink failure
  • skip mode
    • about / Using Hadoop's skip mode
  • source
    • about / What just happened?, Sources
  • source code
    • about / Source code
  • special node requirements, Hadoop cluster
    • about / Special node requirements
  • Spring Batch
    • URL / Oozie
  • SQL import files
    • writing, from reducer / Writing SQL import files from the reducer
  • Sqoop
    • URL, for homepage / A better way – introducing Sqoop
    • installing / Time for action – downloading and configuring Sqoop, What just happened?
    • configuring / Time for action – downloading and configuring Sqoop, What just happened?
    • downloading / Time for action – downloading and configuring Sqoop, What just happened?
    • versions / Sqoop and Hadoop versions
    • and HDFS / Sqoop and HDFS
    • primary key columns / Mappers and primary key columns
    • mappers / Mappers and primary key columns
    • architecture / Sqoop's architecture
    • used, for importing data into Hive / Importing data into Hive using Sqoop
    • and Hive partitions / Sqoop and Hive partitions
    • field and line terminators / Field and line terminators
    • and Hive exports / Sqoop and Hive exports
    • export, re-running / Time for action – fixing the mapping and re-running the export, What just happened?
    • mapping, fixing / Time for action – fixing the mapping and re-running the export, What just happened?
    • features / Incremental merge, Sqoop as a code generator
    • as code generator / Sqoop as a code generator
    • about / To Sqoop or to Flume..., Cloudera Distribution for Hadoop
  • sqoop command-line utility / What just happened?
  • Sqoop exports
    • versus Sqoop imports / Differences between Sqoop imports and exports
  • Sqoop imports
    • versus Sqoop exports / Differences between Sqoop imports and exports
  • start-balancer.sh script / Using balancer
  • stop-balancer.sh script / Using balancer
  • Storage Area Network (SAN) / Network storage
  • storage types, Hadoop cluster
    • about / Storage types
    • commodity, versus enterprise class storage / Commodity versus enterprise class storage
    • single disk, versus RAID / Single disk versus RAID
    • balancing / Finding the balance
    • network storage / Network storage
  • Streaming WordCount mapper
    • about / Differences in jobs when using Streaming
  • syslogd
    • about / Sources

T

  • TaggedMapperOutput class
    • about / DataJoinMapper and TaggedMapperOutput
  • task failures, due to data
    • about / Task failure due to data
    • dirty data, handling through code / Handling dirty data through code
    • skip mode, using / Using Hadoop's skip mode
    • dirty data, handling by skip mode / Time for action – handling dirty data by using skip mode, What just happened?
  • task failures, due to software
    • about / Task failure due to software
    • slow running tasks / Failure of slow running tasks, Time for action – causing task failure
    • HDFS programmatic access / Have a go hero – HDFS programmatic access
    • slow-running tasks, handling / Hadoop's handling of slow-running tasks
    • speculative execution / Speculative execution
    • failing tasks, handling / Hadoop's handling of failing tasks
  • TextInputFormat
    • about / Hadoop-provided InputFormat
  • TextOutputFormat
    • about / Hadoop-provided OutputFormat
  • Thrift
    • about / Candidate technologies, Introducing Apache Flume
    • URL / Candidate technologies
  • timestamp() function / What just happened?
  • TimestampInterceptor class / What just happened?
  • timestamps
    • used, for writing data into directory / Time for action – adding timestamps, What just happened?
    • adding / Time for action – adding timestamps, What just happened?
  • traditional relational databases
    • about / Pruning data to fit in the cache
  • type mapping
    • used, for improving data import / Time for action – using a type mapping, What just happened?

U

  • Ubuntu
    • about / What just happened?
  • UDFMethodResolver interface / What just happened?
  • UDP syslogd source / It's all about events
  • UFO analysis
    • running, on EMR / Time for action – running UFO analysis on EMR
  • ufodata / What just happened?
  • UFO dataset
    • UFO data, summarizing / Time for action – summarizing the UFO data, What just happened?
    • UFO shapes, examining / Examining UFO shapes
    • shape data, summarizing / Time for action – summarizing the shape data, What just happened?
    • sighting duration, correlating to UFO shape / Time for action – correlating of sighting duration to UFO shape, What just happened?
    • Streaming scripts, using outside Hadoop / Using Streaming scripts outside Hadoop
    • shape/time analysis, performing from command line / Time for action – performing the shape/time analysis from the command line, What just happened?
  • UFO data table, Hive
    • creating / Time for action – creating a table for the UFO data, What just happened?
    • data, loading / Time for action – inserting the UFO data, What just happened?
    • data, validating / Validating the data, What just happened?
    • redefining, with correct column separator / Time for action – redefining the table with the correct column separator, What just happened?
  • UFO sighting dataset
    • getting / Getting the UFO sighting dataset
  • UFO sighting records
    • sighting date / Getting the UFO sighting dataset
    • recorded date / Getting the UFO sighting dataset
    • location date / Getting the UFO sighting dataset
    • shape / Getting the UFO sighting dataset
    • duration / Getting the UFO sighting dataset
    • description / Getting the UFO sighting dataset
  • Unix chmod / What just happened?
  • update statement
    • versus insert statement / Inserts versus updates
  • user-defined functions (UDF)
    • about / User-Defined Function
    • adding / Time for action – adding a new User Defined Function (UDF), What just happened?
  • user identity, Hadoop security model
    • about / User identity
    • super user / The super user
  • USE statement / What just happened?

V

  • VersionedWritable wrapper class
    • about / Other wrapper classes
  • versioning
    • about / A note on versioning

W

  • web server data
    • getting, into Hadoop / Time for action – getting web server data into Hadoop, What just happened?
  • WHERE clause / What just happened?
  • Whir
    • about / Whir
    • URL / Whir
  • WordCount example
    • executing / Time for action – WordCount, the Hello World of MapReduce, What just happened?, Have a go hero – WordCount on a larger body of text
    • mapper and reducer implementations, using / Time for action – WordCount the easy way
    • start-up / Startup
    • input, splitting / Splitting the input
    • task assignment / Task assignment
    • task start-up / Task startup
    • JobTracker monitoring / Ongoing JobTracker monitoring
    • mapper input / Mapper input
    • mapper execution / Mapper execution
    • mapper output / Mapper output and reduce input
    • reduce input / Mapper output and reduce input
    • partitioning / Partitioning
    • optional partition function / The optional partition function
    • reducer input / Reducer input
    • reducer execution / Reducer execution
    • reducer output / Reducer output
    • shutdown / Shutdown
    • combiner class, using / Apart from the combiner…maybe, Time for action – WordCount with a combiner
    • reducer, using as combiner / When you can use the reducer as the combiner
    • fixing, to work with combiner / Time for action – fixing WordCount to work with a combiner
    • implementing, Streaming used / Time for action – implementing WordCount using Streaming, What just happened?
  • WordCount example, on EMR
    • AWS management console used / Time for action – WordCount on EMR using the management console
  • wrapper classes
    • about / Introducing the wrapper classes
    • primitive wrapper classes / Primitive wrapper classes
    • array wrapper classes / Array wrapper classes
    • map wrapper classes / Map wrapper classes
    • writable wrapper classes / Time for action – using the Writable wrapper classes
    • CompressedWritable / Other wrapper classes
    • ObjectWritable / Other wrapper classes
    • NullWritable / Other wrapper classes
    • VersionedWritable / Other wrapper classes
  • writable wrapper classes
    • about / Time for action – using the Writable wrapper classes
    • exercises / Have a go hero – playing with Writables

Y

  • YARN
    • about / Upcoming Hadoop changes
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at ₹800/month. Cancel anytime