Fundamentals of Apache Pig
The primary interface to program Apache Pig is Pig Latin, a procedural language that implements ideas of the dataflow paradigm.
Pig Latin programs are generally organized as follows:
A
LOAD
statement reads data from HDFSA series of statements aggregates and manipulates data
A
STORE
statement writes output to the filesystemAlternatively, a
DUMP
statement displays the output to the terminal
The following example shows a sequence of statements that outputs the top 10 hashtags ordered by the frequency, extracted from the dataset of tweets:
tweets = LOAD 'tweets.json' USING JsonLoader('created_at:chararray, id:long, id_str:chararray, text:chararray'); hashtags = FOREACH tweets { GENERATE FLATTEN( REGEX_EXTRACT( text, '(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)', 1) ) as tag; } hashtags_grpd = GROUP hashtags BY tag; hashtags_count = FOREACH hashtags_grpd { GENERATE group, COUNT(hashtags) as occurrencies; } hashtags_count_sorted...