Prerequisites
Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this chapter. We'll create a modified version of a former Pig script as the main functionality for this. The script in this chapter assumes that the Elephant Bird JARs used previously are available in the /jar
directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/blob/master/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig
is as follows:
-- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str', (chararray)$0#'text' as text, (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv...