Index
A
- AccountRecordMapper class / What just happened?
- add jar command / What just happened?
- advanced techniques, MapReduce
- about / Simple, advanced, and in-between
- joins / Joins
- graph algorithms / Graph algorithms
- language-independent data structures, using / Using language-independent data structures
- agent
- about / What just happened?
- writing, to multiple sinks / Time for action – writing to multiple sinks, What just happened?
- alternative distributions
- about / Alternative distributions
- reasons / Why alternative distributions?
- bundling / Bundling
- free and commercial extensions / Free and commercial extensions
- Cloudera Distribution / Cloudera Distribution for Hadoop
- Hortonworks Data Platform / Hortonworks Data Platform
- MapR / MapR
- IBM InfoSphere Big Insights / IBM InfoSphere Big Insights
- selecting / Choosing a distribution
- Apache projects
- HBase / HBase
- Oozie / Oozie
- Whir / Whir
- Mahout / Mahout
- MRUnit / MRUnit
- Apache Software Foundation / A better way – introducing Sqoop
- ApplicationManager
- about / Upcoming Hadoop changes
- array wrapper classes
- about / Array wrapper classes
- ArrayWritable / Array wrapper classes
- TwoDArrayWritable / Array wrapper classes
- aternative schedulers, MapReduce management
- Capacity Scheduler / Capacity Scheduler
- Fair Scheduler / Fair Scheduler
- enabling / Enabling alternative schedulers
- using / When to use alternative schedulers
- Avro
- about / Candidate technologies, Sources, Sinks
- URL / Introducing Avro
- downloading / Time for action – getting and installing Avro
- installing / Time for action – getting and installing Avro
- setting up / What just happened?
- advantages / Avro and schemas
- schemas / Avro and schemas
- schema, defining / Time for action – defining the schema
- source Avro data, creating with Ruby / Time for action – creating the source Avro data with Ruby, What just happened?
- data, consuming with Java / Time for action – consuming the Avro data with Java
- using, within MapReduce / Using Avro within MapReduce
- graphs / Have a go hero – graphs in Avro
- features / Going forward with Avro
- Avro, within MapReduce
- shape summaries, generating / Time for action – generating shape summaries in MapReduce, What just happened?
- output data, examining with Ruby / Time for action – examining the output data with Ruby, What just happened?
- output data, examining with Java / Time for action – examining the output data with Java, What just happened?
- Avro-mapred JAR files / What just happened?
- Avro client
- about / What just happened?
- Avro code
- about / What just happened?
- Avro data
- creating, with Ruby / Time for action – creating the source Avro data with Ruby, What just happened?
- consuming, with Java / Time for action – consuming the Avro data with Java
- AvroJob
- about / Using Avro within MapReduce
- AvroKey
- about / Using Avro within MapReduce
- AvroMapper
- about / Using Avro within MapReduce
- AvroReducer
- about / Using Avro within MapReduce
- AvroValue
- about / Using Avro within MapReduce
- AWS
- about / AWS – infrastructure on demand from Amazon, A note about AWS
- Elastic Compute Cloud (EC2) / Elastic Compute Cloud (EC2)
- Simple Storage Service (S3) / Simple Storage Service (S3)
- Elastic MapReduce (EMR) / Elastic MapReduce (EMR)
- considerations / AWS considerations
- AWS account
- creating / Creating an AWS account
- needed services, signing up / Signing up for the necessary services
- management console / Time for action – WordCount on EMR using the management console
- AWS credentials
- about / AWS credentials
- account ID / AWS credentials
- access key / AWS credentials
- secret access key / AWS credentials
- key pairs / AWS credentials
- AWS developer forums
- URL / Mailing lists and forums
- AWS ecosystem
- about / The AWS ecosystem
- URL / The AWS ecosystem
- AWS management console
- used, for WordCount on EMR / Time for action – WordCount on EMR using the management console
- URL / Time for action – running UFO analysis on EMR
- AWS resources
- about / AWS resources
- HBase on EMR / HBase on EMR
- SimpleDB / SimpleDB
- DynamoDB / DynamoDB
B
- BackupNameNode
- about / Upcoming Hadoop changes
- base HDFS directory
- changing / Time for action – changing the base HDFS directory
- big data processing
- about / Big data processing
- aspects / The value of data
- historical trends / Historically for the few and not the many
- different approach / A different approach, Share nothing, Expect failure, Move processing, not data, Build applications, not infrastructure
- Bloom filter
- about / Using a data representation instead of raw data
- breadth-first search (BFS) / Graphs and MapReduce – a match made somewhere
C
- C++ interface
- using / Using languages other than Java with Hadoop
- candidate technologies
- about / Candidate technologies
- Protocol Buffers / Candidate technologies
- Thrift / Candidate technologies
- Avro / Candidate technologies
- capacity
- adding, to local Hadoop cluster / Adding capacity to a local Hadoop cluster
- adding, to EMR job flow / Adding capacity to an EMR job flow
- Capacity Scheduler
- about / Capacity Scheduler
- capacityScheduler directory / Enabling alternative schedulers
- Cascading
- about / Cascading
- URL / Cascading
- CDH
- about / Cloudera Distribution for Hadoop
- ChainMapper class
- using / Time for action – using ChainMapper for field validation/analysis
- channels
- about / Channels
- CheckpointNameNode
- about / Upcoming Hadoop changes
- city() function / What just happened?
- classic data processing systems
- about / Classic data processing systems
- scale-up / Scale-up
- scale-out approach / Early approaches to scale-out
- Cloud computing, with AWS
- about / Cloud computing with Amazon Web Services
- third approach / A third way
- types of cost / Different types of costs
- Cloudera
- about / A better way – introducing Sqoop
- URL / A better way – introducing Sqoop
- Cloudera Distribution
- about / Cloudera Distribution for Hadoop
- URL / Cloudera Distribution for Hadoop
- cluster access control
- about / Cluster access control
- Hadoop security model / The Hadoop security model
- cluster masters, killing
- JobTracker, killing / Time for action – killing the JobTracker, What just happened?
- replacement JobTracker, starting / Starting a replacement JobTracker
- JobTracker, moving / Have a go hero – moving the JobTracker to a new host
- NameNode process, killing / Time for action – killing the NameNode process
- replacement NameNode, starting / Starting a replacement NameNode
- NameNode process / The role of the NameNode in more detail
- files / File systems, files, blocks, and nodes
- filesystem / File systems, files, blocks, and nodes
- blocks / File systems, files, blocks, and nodes
- nodes / File systems, files, blocks, and nodes
- fsimage / The single most important piece of data in the cluster – fsimage
- DataNode start-up / DataNode startup
- safe mode / Safe mode
- SecondaryNameNode / SecondaryNameNode
- NameNode failure / So what to do when the NameNode process has a critical failure?
- BackupNode / BackupNode/CheckpointNode and NameNode HA
- CheckpointNode / BackupNode/CheckpointNode and NameNode HA
- NameNode HA / BackupNode/CheckpointNode and NameNode HA
- column-oriented databases
- about / Pruning data to fit in the cache
- combiner class
- about / Apart from the combiner…maybe
- features / Why have a combiner?
- adding, to WordCount / Time for action – WordCount with a combiner
- command line job management
- about / Command line job management
- command output
- capturing, to flat file / Time for action – capturing the output of a command to a flat file, What just happened?
- commodity hardware
- about / What is commodity hardware anyway?
- commodity versus enterprise class storage
- about / Commodity versus enterprise class storage
- common architecture, Hadoop
- about / Common architecture
- advantages / What it is and isn't good for
- disadvantages / What it is and isn't good for
- CompressedWritable wrapper class
- about / Other wrapper classes
- conferences
- about / Conferences
- URL / Conferences
- configuration, Flume / Time for action – installing and configuring Flume, What just happened?
- configuration, MySQL
- for remote connections / Time for action – configuring MySQL to allow remote connections, What just happened?
- configuration, Sqoop / Time for action – downloading and configuring Sqoop, What just happened?
- configuration files, Flume / Understanding the Flume configuration files
- considerations, AWS / AWS considerations
- correlated failures
- about / The risk of correlated failures
- counters
- adding / Counters, status, and other output
- CPU / memory / storage ratio, Hadoop cluster
- about / Processor / memory / storage ratio
- CREATE DATABASE statement / What just happened?
- CREATE FUNCTION command / What just happened?
- CREATE TABLE command
- about / What just happened?
- cron
- about / Scheduling
- curl utility / Getting network traffic into Hadoop, What just happened?
D
- data
- getting, into Hadoop / Getting data into Hadoop
- exporting, from MySQL to HDFS / Time for action – exporting data from MySQL to HDFS, What just happened?
- importing, into Hive / Importing data into Hive using Sqoop
- exporting, from MySQL into Hive / Time for action – exporting data from MySQL into Hive, What just happened?
- importing, from raw query / Time for action – importing data from a raw query, What just happened?
- getting, out of Hadoop / Getting data out of Hadoop
- writing, from within reducer / Writing data from within the reducer
- importing, from Hadoop into MySQL / Time for action – importing data from Hadoop into MySQL, What just happened?
- about / Data data everywhere...
- types / Types of data
- copying, from web server into HDFS / Time for action – getting web server data into Hadoop, What just happened?
- hidden issues / Hidden issues, A common framework approach
- lifecycle / Data lifecycle
- staging / Staging data
- scheduling / Scheduling
- data, types
- network traffic / Types of data
- file data / Types of data
- database
- accessing, from mapper / Accessing the database from the mapper
- data import
- improving, type mapping used / Time for action – using a type mapping, What just happened?
- data input/output formats
- about / Input/output
- files / Files, splits, and records
- splits / Files, splits, and records
- records / Files, splits, and records
- InputFormat / InputFormat and RecordReader
- RecordReaders / InputFormat and RecordReader
- Hadoop-provided input formats / Hadoop-provided InputFormat
- Hadoop-provided record readers / Hadoop-provided RecordReader
- OutputFormats / OutputFormat and RecordWriter
- RecordWriters / OutputFormat and RecordWriter
- Hadoop-provided OutputFormats / Hadoop-provided OutputFormat
- Sequence files / Don't forget Sequence files
- DataJoinMapperBase class
- about / DataJoinMapper and TaggedMapperOutput
- data lifecycle management
- about / The bigger picture
- DataNode
- about / Location of the master nodes
- data paths
- about / Common data paths
- dataset analysis
- UFO sighting dataset / Getting the UFO sighting dataset
- Java shape and location analysis / Java shape and location analysis
- datatype issues
- about / Datatype issues
- datatypes, HiveQL
- Boolean types / What just happened?
- Integer types / What just happened?
- Floating point types / What just happened?
- Textual types / What just happened?
- datum
- about / What just happened?
- default properties
- about / Default values
- browsing / Time for action – browsing default properties
- default security, Hadoop security model
- demonstrating / Time for action – demonstrating the default security
- default storage location, Hadoop configuration properties
- about / Default storage location
- depth-first search (DFS) / Graphs and MapReduce – a match made somewhere
- DESCRIBE TABLE command / What just happened?
- description property element
- about / Additional property elements
- dfs.data.dir property / Where to write data
- dfs.default.name variable
- about / What just happened?
- dfs.name.dir property / Where to write data
- dfs.replication variable
- about / What just happened?
- different approach, big data processing
- about / A different approach
- dirty data, Hive tables
- handling / Handling dirty data in Hive
- query output, exporting / Time for action – exporting query output, What just happened?
- Distributed Cache
- used, for improving Java location data output / Time for action – using the Distributed Cache to improve location output, What just happened?
- driver class, 0.20 MapReduce Java API
- about / The Driver class
- dual approach
- about / A dual approach
- DynamoDB
- about / Integration with other AWS products, DynamoDB
- URL / Integration with other AWS products, DynamoDB
E
- EC2 / Considering RDS
- edges
- about / Graph 101
- Elastic Compute Cloud (EC2)
- about / Elastic Compute Cloud (EC2), Signing up for the necessary services
- URL / Elastic Compute Cloud (EC2)
- Elastic MapReduce
- about / Using Elastic MapReduce
- using / Using Elastic MapReduce
- Elastic MapReduce (EMR)
- URL / Elastic MapReduce (EMR)
- about / Elastic MapReduce (EMR), Signing up for the necessary services
- employee database
- setting up / Time for action – setting up the employee database, What just happened?
- employee table
- exporting, into HDFS / Have a go hero – exporting the employee table into HDFS
- EMR
- about / A note on EMR
- benefits / A note on EMR
- as, prototyping platform / EMR as a prototyping platform
- EMR command-line tools
- about / The EMR command-line tools
- EMR Hadoop
- versus, local Hadoop / Comparison of local versus EMR Hadoop
- EMR job flow
- capacity, adding / Adding capacity to an EMR job flow
- expanding / Expanding a running job flow
- Enterprise Application Integration (EAI)
- about / A common framework approach
- ETL tools
- about / Oozie
- Pentaho Kettle / Oozie
- Spring Batch / Oozie
- evaluate methods / What just happened?
- events
- about / It's all about events
- exec
- about / Sources
- export command / What just happened?
F
- failover sink processor
- about / Handling sink failure
- failure types, Hadoop
- about / Types of failure
- Hadoop node failures / Hadoop node failure
- cluster masters, killing / Killing the cluster masters
- Fair Scheduler
- about / Fair Scheduler
- fairScheduler directory / Enabling alternative schedulers
- features, Sqoop
- incremental merge / Incremental merge
- partial exports, avoiding / Avoiding partial exports
- code generator / Sqoop as a code generator
- file channel
- about / Channels
- file data
- about / Types of data
- FileInputFormat
- about / Hadoop-provided InputFormat
- FileOutputFormat
- about / Hadoop-provided OutputFormat
- files
- getting, into Hadoop / Getting files into Hadoop
- versus logs / Logs versus files
- file_roll sink
- about / What just happened?
- final property element
- about / Additional property elements
- First In, First Out (FIFO) queue
- about / Job priorities and scheduling
- flat file
- command output, capturing to / Time for action – capturing the output of a command to a flat file, What just happened?
- Flume
- about / A common framework approach, Introducing Apache Flume, To Sqoop or to Flume..., Cloudera Distribution for Hadoop
- URL / Introducing Apache Flume
- versioning / A note on versioning
- configuring / Time for action – installing and configuring Flume, What just happened?
- installing / Time for action – installing and configuring Flume, What just happened?
- used, for capturing network data / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
- logging, into console / Time for action – logging to the console, What just happened?
- network data, writing to log files / Writing network data to log files, What just happened?
- source / Sources
- sinks / Sinks
- channels / Channels
- configuration files / Understanding the Flume configuration files
- timestamps, adding / Time for action – adding timestamps, What just happened?
- sink failure, handling / Handling sink failure
- features / Next, the world
- flume.root.logger variable
- about / What just happened?
- Flume NG
- about / A note on versioning
- Flume OG
- about / A note on versioning
- FLUSH PRIVILEGES command / What just happened?
- fsimage class / Configuring multiple locations for the fsimage class
- fsimage location
- adding, to NameNode / Time for action – adding an additional fsimage location
- fully distributed mode
- about / Three modes
G
- GenericRecord class
- about / What just happened?
- Google File System (GFS)
- URL / Thanks, Google
- GRANT statement / What just happened?
- granular access control, Hadoop security model
- about / More granular access control
- graph algorithms
- about / Graph algorithms
- Graph 101 / Graph 101
- Graphs and MapReduce / Graphs and MapReduce – a match made somewhere
- nodes / Graphs and MapReduce – a match made somewhere
- graph, representing / Representing a graph, What just happened?
- pointer-based representations / Representing a graph
- adjacency matrix representations / Representing a graph
- adjacency list representations / Representing a graph
- common coloring technique / Representing a graph
- white nodes / Representing a graph
- graph nodes / Representing a graph
- black nodes / Representing a graph
- overview / Overview of the algorithm
- states, for node / Overview of the algorithm
- mapper / The mapper
- reducer / The reducer
- iterative application / Iterative application
- source code, creating / Time for action – creating the source code
- first run / Time for action – the first run, What just happened?
- second run / Time for action – the second run, What just happened?
- third run / Time for action – the third run, What just happened?
- fourth run / Time for action – the fourth and last run, What just happened?
- multiple jobs, running / Running multiple jobs
- final thoughts / Final thoughts on graphs
- graphs, Avro
- about / Have a go hero – graphs in Avro
H
- Hadoop
- about / Hadoop
- components / Parts of Hadoop
- common building blocks / Common building blocks
- architectural principles / Common building blocks
- HDFS / HDFS
- MapReduce / MapReduce
- HDFS and MapReduce / Better together
- common architecture / Common architecture
- on local Ubuntu host / Hadoop on a local Ubuntu host
- on Windows / Other operating systems
- on Mac OS X / Other operating systems
- prerequisites / Time for action – checking the prerequisites
- setting up / Setting up Hadoop
- versions / A note on versions, Sqoop and Hadoop versions
- downloading / Time for action – downloading Hadoop
- SSH, setting up / Time for action – setting up SSH
- configuring / Configuring and running Hadoop
- used, for calculating Pi / Time for action – using Hadoop to calculate Pi
- running / Time for action – using Hadoop to calculate Pi
- modes / Three modes
- base folder, configuring / Configuring the base directory and formatting the filesystem
- filesystem, formatting / Configuring the base directory and formatting the filesystem
- base HDFS directory, changing / Time for action – changing the base HDFS directory
- NameNode, formatting / Time for action – formatting the NameNode
- starting / Time for action – starting Hadoop
- HDFS, using / Time for action – using HDFS
- WordCount, running / Time for action – WordCount, the Hello World of MapReduce
- WordCount, executing on larger body of text / Have a go hero – WordCount on a larger body of text
- monitoring / Monitoring Hadoop from the browser
- HDFS web UI / The HDFS web UI
- MapReduce web UI / The MapReduce web UI
- failure / Failure
- embrace failure / Embrace failure
- failure, types / Types of failure
- scaling / Scaling
- data paths / Common data paths
- as archive store / Hadoop as an archive store
- as preprocessing step / Hadoop as a preprocessing step
- as data input tool / Hadoop as a data input tool
- data, getting into / Getting data into Hadoop
- network traffic, getting into / Getting network traffic into Hadoop, What just happened?
- web server data, getting into / Time for action – getting web server data into Hadoop, What just happened?
- files, getting into / Getting files into Hadoop
- alternative distributions / Alternative distributions
- programming abstractions / Other programming abstractions
- Hadoop, into MySQL
- data, importing from / Time for action – importing data from Hadoop into MySQL, What just happened?
- Hadoop-provided input formats
- about / Hadoop-provided InputFormat
- FileInputFormat / Hadoop-provided InputFormat
- SequenceFileInputFormat / Hadoop-provided InputFormat
- TextInputFormat / Hadoop-provided InputFormat
- Hadoop-provided OutputFormats
- about / Hadoop-provided OutputFormat
- FileOutputFormat / Hadoop-provided OutputFormat
- NullOutputFormat / Hadoop-provided OutputFormat
- SequenceFileOutputFormat / Hadoop-provided OutputFormat
- TextOutputFormat / Hadoop-provided OutputFormat
- Hadoop-provided record readers
- about / Hadoop-provided RecordReader
- LineRecordReader / Hadoop-provided RecordReader
- SequenceFileRecordReader / Hadoop-provided RecordReader
- Hadoop-specific data types
- about / Hadoop-specific data types
- Writable interface / The Writable and WritableComparable interfaces
- wrapper classes / Introducing the wrapper classes
- hadoop/lib directory / Enabling alternative schedulers
- Hadoop changes
- about / Upcoming Hadoop changes
- MapReduce 2.0 or MRV2 / Upcoming Hadoop changes
- YARN (Yet Another Resource Negotiator) / Upcoming Hadoop changes
- Hadoop cluster
- setting up / Setting up a cluster
- hosts / How many hosts?
- usable space on node, calculating / Calculating usable space on a node
- master nodes, location / Location of the master nodes
- hardware, sizing / Sizing hardware
- processor / memory / storage ratio / Processor / memory / storage ratio
- EMR, as prototyping platform / EMR as a prototyping platform
- special node requirements / Special node requirements
- storage types / Storage types
- networking configuration / Hadoop networking configuration
- commodity hardware / What is commodity hardware anyway?
- node and running balancer, adding / Have a go hero – adding a node and running balancer
- Hadoop community
- about / Sources of information
- source code / Source code
- mailing lists and forums / Mailing lists and forums
- LinkedIn groups / LinkedIn groups
- HUGs / HUGs
- conferences / Conferences
- Hadoop configuration properties
- about / Hadoop configuration properties
- default properties / Default values
- property elements / Additional property elements
- default storage location / Default storage location
- setting / Where to set properties
- Hadoop dependencies / Hadoop dependencies
- Hadoop failure
- hardware failures / Hardware failure
- host failures / Host failure
- host corruption / Host corruption
- correlated failures / The risk of correlated failures
- Hadoop FAQ
- URL / Other operating systems
- hadoop fs command / What just happened?
- Hadoop Java API, for MapReduce
- about / The Hadoop Java API for MapReduce
- 0.20 MapReduce Java API / The 0.20 MapReduce Java API
- hadoop job -history command / What just happened?
- hadoop job -kill command / What just happened?
- hadoop job -list all command / What just happened?
- hadoop job -set-priority command / Job priorities and scheduling, What just happened?
- hadoop job -status command / What just happened?
- Hadoop networking configuration
- about / Hadoop networking configuration
- blocks, placing / How blocks are placed
- rack-awareness script / Rack awareness
- default rack configuration, examining / Time for action – examining the default rack configuration
- rack awareness script, adding / Time for action – adding a rack awareness script, What just happened?
- Hadoop node failures
- dfsadmin command / The dfsadmin command
- cluster setup / Cluster setup, test files, and block sizes
- test files / Cluster setup, test files, and block sizes
- block sizes / Cluster setup, test files, and block sizes
- fault tolerance / Fault tolerance and Elastic MapReduce
- Elastic MapReduce / Fault tolerance and Elastic MapReduce
- DataNode process, killing / Time for action – killing a DataNode process, What just happened?
- NameNode and DataNode communication / NameNode and DataNode communication
- NameNode log delving / Have a go hero – NameNode log delving
- replication factor / Time for action – the replication factor in action, What just happened?
- missing blocks, causing intentionally / Time for action – intentionally causing missing blocks, What just happened?
- data loss / When data may be lost
- block corruption / Block corruption
- TaskTracker process, killing / Time for action – killing a TaskTracker process, What just happened?
- DataNode and TaskTracker failures, comparing / Comparing the DataNode and TaskTracker failures
- permanent failure / Permanent failure
- Hadoop Pipes
- about / Using languages other than Java with Hadoop
- Hadoop security model
- about / The Hadoop security model
- default security, demonstrating / Time for action – demonstrating the default security
- user identity / User identity
- granular access control / More granular access control
- working around, via physical access control / Working around the security model via physical access control
- Hadoop Streaming
- about / Using languages other than Java with Hadoop
- working / How Hadoop Streaming works
- advantages / Why to use Hadoop Streaming, Differences in jobs when using Streaming
- using, in WordCount / Time for action – implementing WordCount using Streaming, What just happened?
- Hadoop Summit
- about / Conferences
- Hadoop versioning
- about / A note on versions
- hardware failure
- about / Hardware failure
- HBase
- about / What it is and isn't good for, Sinks, HBase
- URL / HBase
- HBase on EMR
- about / HBase on EMR
- HDFS
- about / Parts of Hadoop, HDFS
- features / HDFS
- using / Time for action – using HDFS, What just happened?
- managing / Managing HDFS
- data, writing / Where to write data
- balancer, using / Using balancer
- rebalancing / When to rebalance
- employee table, exporting into / Have a go hero – exporting the employee table into HDFS
- and Sqoop / Sqoop and HDFS
- network traffic, writing onto / Time for action – writing network traffic onto HDFS, What just happened?
- HDFS web UI
- about / The HDFS web UI
- hidden issues, data
- about / Hidden issues
- network data, keeping on network / Keeping network data on the network
- Hadoop dependencies / Hadoop dependencies
- reliability / Reliability
- common framework approach / A common framework approach
- historical trends, big data processing
- about / Historically for the few and not the many
- classic data processing systems / Classic data processing systems
- limiting factors / Limiting factors
- Hive
- overview / Overview of Hive
- benefits / Why use Hive?
- setting up / Setting up Hive
- prerequisites / Prerequisites
- downloading / Getting Hive
- installing / Time for action – installing Hive
- using / Using Hive
- table for UFO data, creating / Time for action – creating a table for the UFO data, What just happened?
- UFO data, adding to table / Time for action – inserting the UFO data
- data, validating / Validating the data
- table, validating / Time for action – validating the table, What just happened?
- bucketing / Bucketing, clustering, and sorting... oh my!
- clustering / Bucketing, clustering, and sorting... oh my!
- sorting / Bucketing, clustering, and sorting... oh my!
- user-defined functions / User-Defined Function
- versus, Pig / Hive versus Pig
- features / What we didn't cover
- data, importing into / Importing data into Hive using Sqoop
- Hive, on AWS
- UFO analysis, running on EMR / Time for action – running UFO analysis on EMR, What just happened?
- interactive job flows, using for development / Using interactive job flows for development
- interactive EMR cluster, using / Have a go hero – using an interactive EMR cluster
- Hive and SQL views
- about / Hive and SQL views
- using / Time for action – using views, What just happened?
- Hive data
- importing, into MySQL / Time for action – importing Hive data into MySQL, What just happened?
- Hive exports
- and Sqoob / Sqoop and Hive exports
- Hive partitions
- about / Sqoop and Hive partitions
- and Sqoop / Sqoop and Hive partitions
- HiveQL
- about / What just happened?
- datatypes / What just happened?
- HiveQL command
- about / Hive versus Pig
- HiveQL query planner
- about / Hive versus Pig
- Hive tables
- about / Hive tables – real or not?
- creating, from existing file / Time for action – creating a table from an existing file, What just happened?
- join, performing / Time for action – performing a join, What just happened?
- join, improving / Have a go hero – improve the join to use regular expressions
- dirty data, handling / Handling dirty data in Hive
- partitioning / Partitioning the table
- partitioned UFO sighting table, creating / Time for action – making a partitioned UFO sighting table, What just happened?
- Hive transforms / User-Defined Function
- Hortonworks
- about / Hortonworks Data Platform
- Hortonworks Data Platform
- about / Hortonworks Data Platform
- URL / Hortonworks Data Platform
- host failure
- about / Host failure
- HTTPClient
- about / Have a go hero
- HTTP Components / Have a go hero
- HTTP protocol
- about / What just happened?
- HUGs
- about / HUGs
I
- IBM InfoSphere Big Insights
- about / IBM InfoSphere Big Insights
- URL / IBM InfoSphere Big Insights
- InputFormat class
- about / InputFormat and RecordReader, Using Avro within MapReduce
- INSERT command / What just happened?
- insert statement
- versus update statement / Inserts versus updates
- installation, Flume / Time for action – installing and configuring Flume, What just happened?
- installation, MySQL / Time for action – installing and setting up MySQL, What just happened?
- installation, Sqoop / Time for action – downloading and configuring Sqoop, What just happened?
- interactive EMR cluster
- using / Have a go hero – using an interactive EMR cluster
- interactive job flows
- using, for development, / Using interactive job flows for development
- Iterator object / What just happened?
J
- java.sql.Date / What just happened?
- Java Development Kit (JDK) / Time for action – checking the prerequisites
- Java HDFS interface / Have a go hero
- Java IllegalArgumentExceptions / What just happened?
- Java shape and location analysis
- about / Java shape and location analysis
- ChainMapper, using for record validation / Time for action – using ChainMapper for field validation/analysis, What just happened?
- issues, with output data / Too many abbreviations
- Distributed Cache, using / Using the Distributed Cache, Time for action – using the Distributed Cache to improve location output
- JDBC / Writing data from within the reducer
- JDBC channel
- about / Channels
- JobConf class / Where to set properties
- job priorities, MapReduce management
- changing / Job priorities and scheduling, Time for action – changing job priorities and killing a job
- scheduling / Time for action – changing job priorities and killing a job
- JobTracker
- about / Location of the master nodes
- JobTracker UI
- about / The MapReduce web UI
- joins
- about / Joins
- disadvantages / When this is a bad idea
- map-side, versus reduce-side joins / Map-side versus reduce-side joins
- account and sales information, mtaching / Matching account and sales information
- reduce-side join, implementing / Matching account and sales information
- map-side joins, implementing / Implementing map-side joins
- limitations / To join or not to join...
K
- key/value data
- about / Why key/value data?
- real-world examples / Some real-world examples
- MapReduce, using / MapReduce as a series of key/value transformations
- key/value pairs
- about / Key/value pairs, What it mean
- key/value data / Why key/value data?
L
- language-independent data structures
- about / Using language-independent data structures
- candidate technologies / Candidate technologies
- Avro / Introducing Avro
- LineCounters / What just happened?
- LineRecordReader
- about / Hadoop-provided RecordReader
- LinkedIn groups
- about / LinkedIn groups
- URL / LinkedIn groups
- list jars command / What just happened?
- load balancing sink processor
- about / Handling sink failure
- LOAD DATA statement / Be careful with data file access rights
- local flat file
- remote file, capturing to / Time for action – capturing a remote file in a local flat file, What just happened?
- local Hadoop
- versus, EMR Hadoop / Comparison of local versus EMR Hadoop
- local standalone mode
- about / Three modes
- log file
- network traffic, capturing to / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
- logrotate
- about / Scheduling
- logs
- versus files / Logs versus files
M
- 0.20 MapReduce Java API
- about / The 0.20 MapReduce Java API
- Mapper class / The Mapper class
- Reducer class / The Reducer class
- driver class / The Driver class
- Mahout
- about / Mahout
- URL / Mahout
- map-side joins
- about / Map-side versus reduce-side joins
- implementing, Distributed Cache used / Using the Distributed Cache
- data pruning, for fiting cache / Pruning data to fit in the cache
- data representation, using / Using a data representation instead of raw data
- multiple mappers, using / Using multiple mappers
- mapper
- database, accessing from / Accessing the database from the mapper
- mapper and reducer implementations
- about / Hadoop-provided mapper and reducer implementations
- Mapper class, 0.20 MapReduce Java API
- about / The Mapper class
- setup method / The Mapper class
- map method / The Mapper class
- cleanup method / The Mapper class
- mappers / MapReduce
- about / Mappers and primary key columns
- MapR
- about / MapR
- URL / MapR
- mapred.job.tracker property / What about MapReduce?
- mapred.job.tracker variable
- about / What just happened?
- mapred.map.max.attempts
- about / Hadoop's handling of failing tasks
- mapred.max.tracker.failures
- about / Hadoop's handling of failing tasks
- mapred.reduce.max.attempts
- about / Hadoop's handling of failing tasks
- MapReduce
- about / Parts of Hadoop, MapReduce
- features / MapReduce
- used, as key/value transformations / MapReduce as a series of key/value transformations
- Hadoop Java API / The Hadoop Java API for MapReduce
- advanced techniques / Simple, advanced, and in-between
- MapReduce 2.0 or MRV2
- about / Upcoming Hadoop changes
- MapReducejob analysis
- developing / Counters, status, and other output, Time for action – creating counters, task states, and writing log output
- MapReduce management
- about / MapReduce management
- command line job management / Command line job management
- job priorities / Job priorities and scheduling
- scheduling / Job priorities and scheduling
- alternative schedulers / Alternative schedulers
- alternative schedulers, enabling / Enabling alternative schedulers
- alternative schedulers, using / When to use alternative schedulers
- MapReduce programs
- writing / Writing MapReduce programs
- classpath, setting up / Time for action – setting up the classpath, What just happened?
- WordCount, implementing / Time for action – implementing WordCount, What just happened?
- JAR file, building / Time for action – building a JAR file, What just happened?
- WordCount, on local Hadoop cluster / Time for action – running WordCount on a local Hadoop cluster
- WordCount, running on EMR / Time for action – running WordCount on EMR
- pre-0.20 Java MapReduce API / The pre-0.20 Java MapReduce API
- Hadoop-provided mapper and reducer implementations / Hadoop-provided mapper and reducer implementations
- MapReduce programs development
- languages, using / Using languages other than Java with Hadoop
- large dataset, analyzing / Analyzing a large dataset
- counters / Counters, status, and other output
- status / Counters, status, and other output
- job analysis workflow, developing / Counters, status, and other output
- counters, creating / Time for action – creating counters, task states, and writing log output
- task states / Time for action – creating counters, task states, and writing log output
- MapReduce web UI
- about / The MapReduce web UI
- map wrapper classes
- AbstractMapWritable / Map wrapper classes
- MapWritable / Map wrapper classes
- SortedMapWritable / Map wrapper classes
- master nodes
- location / Location of the master nodes
- mean time between failures (MTBF) / Commodity versus enterprise class storage
- memory channel
- about / Channels
- Message Passing Interface (MPI)
- about / Upcoming Hadoop changes
- MetaStore
- about / What we didn't cover
- modes
- local standalone mode / Three modes
- pseudo-distributed mode / Three modes
- fully distributed mode / Three modes
- MRUnit
- URL / MRUnit
- about / MRUnit
- multi-level Flume networks
- about / Time for action – multi level Flume networks, What just happened?
- MultipleInputs class / What just happened?
- multiple sinks
- agent, writing to / Time for action – writing to multiple sinks, What just happened?
- multiplexing
- about / Selectors replicating and multiplexing
- multiplexing source selector
- about / Selectors replicating and multiplexing
- MySQL
- setting up / Setting up MySQL, Time for action – installing and setting up MySQL, What just happened?
- installing / Time for action – installing and setting up MySQL, What just happened?
- configuring, for remote connections / Time for action – configuring MySQL to allow remote connections, What just happened?
- Hive data, importing into / Time for action – importing Hive data into MySQL, What just happened?
- MySQL, into Hive
- data, exporting from / Time for action – exporting data from MySQL into Hive, What just happened?
- MySQL, to HDFS
- data, exporting from / Time for action – exporting data from MySQL to HDFS, What just happened?
- mysql command / To Sqoop or to Flume...
- mysql command-line utility
- about / What just happened?
- options / What just happened?
- mysqldump utility
- about / Using MySQL tools and manual import
- MySQL tools
- used, for exporting data into Hadoop / Using MySQL tools and manual import
N
- NameNode
- formatting / Time for action – formatting the NameNode
- about / Location of the master nodes
- managing / Managing the NameNode
- multiple locations, configuring / Configuring multiple locations for the fsimage class
- fsimage location, adding / Time for action – adding an additional fsimage location
- fsimage copies, writing / Where to write the fsimage copies
- host, swaping / Swapping to another NameNode host
- NameNode host, swapping
- disaster recovery / Having things ready before disaster strikes
- swapping, to new NameNode host / Time for action – swapping to a new NameNode host, What just happened?
- Netcat
- about / What just happened?, Sources
- network
- data, keeping on / Keeping network data on the network
- network data
- keeping, on network / Keeping network data on the network
- capturing, Flume used / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
- writing, to log files / Writing network data to log files, What just happened?
- Network File System (NFS) / Network storage
- network storage
- about / Network storage
- network traffic
- about / Types of data
- getting, into Hadoop / Getting network traffic into Hadoop, What just happened?
- capturing, to log file / Using Flume to capture network data, Time for action – capturing network traffic in a log file, What just happened?
- writing, onto HDFS / Time for action – writing network traffic onto HDFS, What just happened?
- Node inner class / What just happened?
- NullOutputFormat
- about / Hadoop-provided OutputFormat
- NullWritable wrapper class
- about / Other wrapper classes
O
- ObjectWritable wrapper class
- about / Other wrapper classes
- Oozie
- about / Oozie
- URL / Oozie
- Open JDK / Time for action – checking the prerequisites
- OutputFormat class
- about / OutputFormat and RecordWriter
P
- partitioned UFO sighting table
- creating / Time for action – making a partitioned UFO sighting table, What just happened?
- Pentaho Kettle
- URL / Oozie
- Pi
- calculating, Hadoop used / Time for action – using Hadoop to calculate Pi
- Pig
- about / Hive versus Pig, Pig
- URL / Pig
- Pig Latin
- about / Hive versus Pig
- pre-0.20 Java MapReduce API
- about / The pre-0.20 Java MapReduce API
- primary key column
- about / Mappers and primary key columns
- primitive wrapper classes
- about / Primitive wrapper classes
- BooleanWritable / Primitive wrapper classes
- ByteWritable / Primitive wrapper classes
- DoubleWritable / Primitive wrapper classes
- FloatWritable / Primitive wrapper classes
- IntWritable / Primitive wrapper classes
- LongWritable / Primitive wrapper classes
- VIntWritable / Primitive wrapper classes
- VLongWritable / Primitive wrapper classes
- process ID (PID) / Time for action – killing a DataNode process
- programming abstractions
- about / Other programming abstractions
- Pig / Pig
- Cascading / Cascading
- Project Gutenberg
- URL / Have a go hero – WordCount on a larger body of text
- property elements
- about / Additional property elements
- description / Additional property elements
- final / Additional property elements
- Protocol Buffers
- URL / Candidate technologies
- about / Candidate technologies, Introducing Apache Flume
- pseudo-distributed mode
- about / Three modes
- configuring / Time for action – configuring the pseudo-distributed mode
- configuration variables / What just happened?
Q
- query output, Hive
- exporting / Time for action – exporting query output, What just happened?
R
- raw query
- data, importing from / Time for action – importing data from a raw query, What just happened?
- RDBMS
- about / Hadoop as an archive store
- RDS
- considering / Considering RDS
- real-world examples, key/value data
- about / Some real-world examples
- RecordReader class
- about / InputFormat and RecordReader
- RecordWriters class
- about / OutputFormat and RecordWriter
- reduce-side join
- about / Map-side versus reduce-side joins
- implementing / Matching account and sales information
- implementing, MultipleInputs used / Time for action – reduce-side join using MultipleInputs
- DataJoinMapper class / DataJoinMapper and TaggedMapperOutput
- TaggedMapperOutput class / DataJoinMapper and TaggedMapperOutput
- ReduceJoinReducer class / What just happened?
- reducer
- data, writing from / Writing data from within the reducer
- SQL import files, writing from / Writing SQL import files from the reducer
- Reducer class, 0.20 MapReduce Java API
- about / The Reducer class
- reduce method / The Reducer class
- setup method / The Reducer class
- run method / The Reducer class
- cleanup method / The Reducer class
- reducers / MapReduce
- Redundant Arrays of Inexpensive Disks (RAID) / Single disk versus RAID
- remote connections
- MySQL, configuring for / Time for action – configuring MySQL to allow remote connections, What just happened?
- remote file
- capturing, to local flat file / Time for action – capturing a remote file in a local flat file, What just happened?
- remote procedure call (RPC) framework
- about / Going forward with Avro
- replicating
- about / Selectors replicating and multiplexing
- ResourceManager
- about / Upcoming Hadoop changes
- Ruby API
- URL / What just happened?
S
- SalesRecordMapper class / What just happened?
- scale-out approach
- about / Early approaches to scale-out
- benefits / Early approaches to scale-out
- scale-up approach
- about / Scale-up
- advantages / Scale-up
- scaling
- capacity, adding to local Hadoop cluster / Adding capacity to a local Hadoop cluster
- capacity, adding to EMR job flow / Adding capacity to an EMR job flow
- schemas, Avro
- defining / Time for action – defining the schema
- Sighting_date field / What just happened?
- City field / What just happened?
- Shape field / What just happened?
- Duration field / What just happened?
- SecondaryNameNode
- about / Location of the master nodes
- selective import
- performing / Time for action – a more selective import, What just happened?
- SELECT statement
- about / Using MySQL tools and manual import
- SequenceFile class
- about / Don't forget Sequence files
- SequenceFileInputFormat
- about / Hadoop-provided InputFormat
- SequenceFileOutputFormat
- about / Hadoop-provided OutputFormat
- SequenceFileRecordReader
- about / Hadoop-provided RecordReader
- SerDe
- about / What we didn't cover
- SimpleDB / What just happened?
- about / SimpleDB
- URL / SimpleDB
- Simple Storage Service (S3)
- about / Simple Storage Service (S3), Signing up for the necessary services
- URL / Simple Storage Service (S3)
- single disk versus RAID
- about / Single disk versus RAID
- sink
- about / What just happened?, Sinks
- sink failure
- handling / Handling sink failure
- skip mode
- about / Using Hadoop's skip mode
- source
- about / What just happened?, Sources
- source code
- about / Source code
- special node requirements, Hadoop cluster
- about / Special node requirements
- Spring Batch
- URL / Oozie
- SQL import files
- writing, from reducer / Writing SQL import files from the reducer
- Sqoop
- URL, for homepage / A better way – introducing Sqoop
- installing / Time for action – downloading and configuring Sqoop, What just happened?
- configuring / Time for action – downloading and configuring Sqoop, What just happened?
- downloading / Time for action – downloading and configuring Sqoop, What just happened?
- versions / Sqoop and Hadoop versions
- and HDFS / Sqoop and HDFS
- primary key columns / Mappers and primary key columns
- mappers / Mappers and primary key columns
- architecture / Sqoop's architecture
- used, for importing data into Hive / Importing data into Hive using Sqoop
- and Hive partitions / Sqoop and Hive partitions
- field and line terminators / Field and line terminators
- and Hive exports / Sqoop and Hive exports
- export, re-running / Time for action – fixing the mapping and re-running the export, What just happened?
- mapping, fixing / Time for action – fixing the mapping and re-running the export, What just happened?
- features / Incremental merge, Sqoop as a code generator
- as code generator / Sqoop as a code generator
- about / To Sqoop or to Flume..., Cloudera Distribution for Hadoop
- sqoop command-line utility / What just happened?
- Sqoop exports
- versus Sqoop imports / Differences between Sqoop imports and exports
- Sqoop imports
- versus Sqoop exports / Differences between Sqoop imports and exports
- start-balancer.sh script / Using balancer
- stop-balancer.sh script / Using balancer
- Storage Area Network (SAN) / Network storage
- storage types, Hadoop cluster
- about / Storage types
- commodity, versus enterprise class storage / Commodity versus enterprise class storage
- single disk, versus RAID / Single disk versus RAID
- balancing / Finding the balance
- network storage / Network storage
- Streaming WordCount mapper
- about / Differences in jobs when using Streaming
- syslogd
- about / Sources
T
- TaggedMapperOutput class
- about / DataJoinMapper and TaggedMapperOutput
- task failures, due to data
- about / Task failure due to data
- dirty data, handling through code / Handling dirty data through code
- skip mode, using / Using Hadoop's skip mode
- dirty data, handling by skip mode / Time for action – handling dirty data by using skip mode, What just happened?
- task failures, due to software
- about / Task failure due to software
- slow running tasks / Failure of slow running tasks, Time for action – causing task failure
- HDFS programmatic access / Have a go hero – HDFS programmatic access
- slow-running tasks, handling / Hadoop's handling of slow-running tasks
- speculative execution / Speculative execution
- failing tasks, handling / Hadoop's handling of failing tasks
- TextInputFormat
- about / Hadoop-provided InputFormat
- TextOutputFormat
- about / Hadoop-provided OutputFormat
- Thrift
- about / Candidate technologies, Introducing Apache Flume
- URL / Candidate technologies
- timestamp() function / What just happened?
- TimestampInterceptor class / What just happened?
- timestamps
- used, for writing data into directory / Time for action – adding timestamps, What just happened?
- adding / Time for action – adding timestamps, What just happened?
- traditional relational databases
- about / Pruning data to fit in the cache
- type mapping
- used, for improving data import / Time for action – using a type mapping, What just happened?
U
- Ubuntu
- about / What just happened?
- UDFMethodResolver interface / What just happened?
- UDP syslogd source / It's all about events
- UFO analysis
- running, on EMR / Time for action – running UFO analysis on EMR
- ufodata / What just happened?
- UFO dataset
- UFO data, summarizing / Time for action – summarizing the UFO data, What just happened?
- UFO shapes, examining / Examining UFO shapes
- shape data, summarizing / Time for action – summarizing the shape data, What just happened?
- sighting duration, correlating to UFO shape / Time for action – correlating of sighting duration to UFO shape, What just happened?
- Streaming scripts, using outside Hadoop / Using Streaming scripts outside Hadoop
- shape/time analysis, performing from command line / Time for action – performing the shape/time analysis from the command line, What just happened?
- UFO data table, Hive
- creating / Time for action – creating a table for the UFO data, What just happened?
- data, loading / Time for action – inserting the UFO data, What just happened?
- data, validating / Validating the data, What just happened?
- redefining, with correct column separator / Time for action – redefining the table with the correct column separator, What just happened?
- UFO sighting dataset
- getting / Getting the UFO sighting dataset
- UFO sighting records
- sighting date / Getting the UFO sighting dataset
- recorded date / Getting the UFO sighting dataset
- location date / Getting the UFO sighting dataset
- shape / Getting the UFO sighting dataset
- duration / Getting the UFO sighting dataset
- description / Getting the UFO sighting dataset
- Unix chmod / What just happened?
- update statement
- versus insert statement / Inserts versus updates
- user-defined functions (UDF)
- about / User-Defined Function
- adding / Time for action – adding a new User Defined Function (UDF), What just happened?
- user identity, Hadoop security model
- about / User identity
- super user / The super user
- USE statement / What just happened?
V
- VersionedWritable wrapper class
- about / Other wrapper classes
- versioning
- about / A note on versioning
W
- web server data
- getting, into Hadoop / Time for action – getting web server data into Hadoop, What just happened?
- WHERE clause / What just happened?
- Whir
- about / Whir
- URL / Whir
- WordCount example
- executing / Time for action – WordCount, the Hello World of MapReduce, What just happened?, Have a go hero – WordCount on a larger body of text
- mapper and reducer implementations, using / Time for action – WordCount the easy way
- start-up / Startup
- input, splitting / Splitting the input
- task assignment / Task assignment
- task start-up / Task startup
- JobTracker monitoring / Ongoing JobTracker monitoring
- mapper input / Mapper input
- mapper execution / Mapper execution
- mapper output / Mapper output and reduce input
- reduce input / Mapper output and reduce input
- partitioning / Partitioning
- optional partition function / The optional partition function
- reducer input / Reducer input
- reducer execution / Reducer execution
- reducer output / Reducer output
- shutdown / Shutdown
- combiner class, using / Apart from the combiner…maybe, Time for action – WordCount with a combiner
- reducer, using as combiner / When you can use the reducer as the combiner
- fixing, to work with combiner / Time for action – fixing WordCount to work with a combiner
- implementing, Streaming used / Time for action – implementing WordCount using Streaming, What just happened?
- WordCount example, on EMR
- AWS management console used / Time for action – WordCount on EMR using the management console
- wrapper classes
- about / Introducing the wrapper classes
- primitive wrapper classes / Primitive wrapper classes
- array wrapper classes / Array wrapper classes
- map wrapper classes / Map wrapper classes
- writable wrapper classes / Time for action – using the Writable wrapper classes
- CompressedWritable / Other wrapper classes
- ObjectWritable / Other wrapper classes
- NullWritable / Other wrapper classes
- VersionedWritable / Other wrapper classes
- writable wrapper classes
- about / Time for action – using the Writable wrapper classes
- exercises / Have a go hero – playing with Writables
Y
- YARN
- about / Upcoming Hadoop changes