DEFINE
: The
DEFINE
statement is used to assign an alias to an external executable or a UDF function. Use this statement if you want to have a crisp name for a function that has a lengthy package name.
For a STREAM
command, DEFINE
plays an important role to transfer the executable to the task nodes of the Hadoop cluster. This is accomplished using the SHIP
clause of the DEFINE
operator. This is not a part of our example and will be illustrated in later chapters.
In our example, we define aliases by names ApacheCommonLogLoader
, DayMonExtractor
, and DayExtractor
for the corresponding fully qualified class names.
LOAD
: This operator loads data from the file or directory. If a directory name is specified, it loads all the files in the directory into the relation. If Pig is run in the local mode, it searches for the directories on the local File System; while in the MapReduce mode, it searches for the files on HDFS. In our example, the usage is as follows:
The content of tuple raw_logs_Jul
is as follows:
By using globs (such as *.txt
, *.csv
, and so on), you can read multiple files (all the files or selective files) that are in the same directory. In the following example, the files under the folders Jul
and Aug
will be loaded as a union.
STORE
: The
STORE
operator has dual purposes, one is to write the results into the File System after completion of the data pipeline processing, and another is to actually commence the execution of the preceding Pig Latin statements. This happens to be an important feature of this language, where logical, physical, and MapReduce plans are created after the script encounters the STORE
operator.
In our example, the following code demonstrates their usage:
DUMP
: The
DUMP
operator is almost similar to the STORE
operator, but it is used specially to display results on the command prompt rather than storing it in a File System like the STORE
operator. DUMP
behaves in exactly the same way as STORE
, where the Pig Latin statements actually begin execution after encountering the DUMP
operator. This operator is specifically targeted for the interactive execution of statements and viewing the output in real time.
In our example, the following code demonstrates the usage of the DUMP
operator:
UNION
: The
UNION
operator merges the contents of more than one relation without preserving the order of tuples as the relations involved are treated as unordered bags.
In our example, we will use UNION
to merge the two relations raw_logs_Jul
and raw_logs_Aug
into a relation called combined_raw_logs
.
The content of tuple combined_raw_logs
is as follows:
SAMPLE
: The
SAMPLE
operator is useful when you want to work on a very small subset of data to quickly test if the data flow processing is giving you correct results. This statement provides a random data sample picked from the entire population using an arbitrary sample size. The sample size is passed as a parameter. As the SAMPLE
operator internally uses a probability-based algorithm, it is not guaranteed to return the same number of rows or tuples every time SAMPLE
is used.
In our example, the SAMPLE
operator returns, at most, 1 percent of the data as an illustration.
The content of tuple sample_combined_raw_logs
is as follows:
GROUP
: The
GROUP
operator is used to group all records with the same value into a bag. This operator creates a nested structure of output tuples.
The following snippet of code from our example illustrates grouping logs by day of the month.
Schema content of jgrpd: The following output shows the schema of the relation jgrpd
where we can see that it has created a nested structure with two fields, the key and the bag of collected records. The key is named group
, and value
is the name of the alias that was grouped with raw_logs_Jul
and raw_logs_Aug
, in this case.
FOREACH
: The
FOREACH
operator is also known as a projection. It applies a set of expressions to each record in the bag, similar to applying an expression on every row of a table. The result of this operator is another relation.
In our example, FOREACH
is used for iterating through each grouped record in the group to get the count of distinct IP addresses.
Contents of the tuples: The following output shows the tuples in the relations jcountd
and acountd
. The first field is the date in the format of DD-MMM and the second field is the count of distinct hits.
DISTINCT
: The
DISTINCT
operator removes duplicate records in a relation. DISTINCT
should not be used where you need to preserve the order of the contents.
The following example code demonstrates the usage of DISTINCT
to remove duplicate IP addresses and FLATTEN
to remove the nest of jgrpd
and agrpd
.
Content of the tuples: The following output shows the schema of the relation of jcountd
and acountd
. We can see that the nesting created by GROUP
is now removed.
JOIN
: The
JOIN
operator joins more than one relation based on shared keys.
In our example, we join two relations by day of the month; it returns all the records where the day of the month matches. Records for which no match is found are dropped.
Content of tuples: The following output shows the resulting values after JOIN
is performed. This relation returns all the records where the day of the month matches; records for which no match is found are dropped. For example, we have seen in sample output of FOREACH
, the section jcountd
shows 4774 hits on 2-Jul
and acountd
does not have any record for 2-Aug
. Hence after JOIN
, the tuple having 2-Jul
hits is omitted as there is no match found for 2-Aug
.
DESCRIBE
: The
DESCRIBE
operator is a diagnostic operator in Pig and is used to view and understand the schema of an alias or a relation. This is a kind of command line log, which enables us to understand how preceding operators in the data pipeline are changing the data. The output of the DESCRIBE
operator is the description of the schema.
In our example, we use DESCRIBE
to understand the schema.
The output is as follows:
FILTER
: The
FILTER
operator allows you to select or filter out the records from a relation based on a condition. This operator works on tuples or rows of data.
The following example filters records whose count is greater than 2,600:
Content of filtered tuple: All the records which are less than 2600 are filtered out.
ILLUSTRATE
: The
ILLUSTRATE
operator is the debugger's best friend, and it is used to understand how data passes through the Pig Latin statements and gets transformed. This operator enables us to create good test data in order to test our programs on datasets, which are a sample representing the flow of statements.
ILLUSTRATE
internally uses an algorithm, which uses a small sample of the entire input data and propagates this data through all the statements in the Pig Latin scripts. This algorithm intelligently generates sample data when it encounters operators such as FILTER
, which have the ability to remove the rows from the data, resulting in no data following through the Pig statements.
In our example, the ILLUSTRATE
operator is used as shown in the following code snippet:
The dataset used by us does not have records where the count is less than 2,600. ILLUSTRATE
has manufactured a record with two counts to ensure that values below 2,600 get filtered out. This record passes through the FILTER
condition and gets filtered out and hence, no values are shown in the relation filtered.
The following screenshot shows the output:
ORDER BY
: The
ORDERBY
operator is used to sort a relation using the sort key specified. As of today, Pig supports sorting on fields with simple types rather than complex types or expressions. In the following example, we are sorting based on two fields (July date and August date).
PARALLEL
: The
PARALLEL
operator controls reduce-side parallelism by specifying the number of reducers. It is defaulted to one while running in a local mode. This clause can be used with operators, such as ORDER
, DISTINCT
, LIMIT
, JOIN
, GROUP
, COGROUP
, and CROSS
that force a reduce phase.
LIMIT
: The
LIMIT
operator is used to set an upper limit on the number of output records generated. The output is determined randomly and there is no guarantee if the output will be the same if the LIMIT
operator is executed consequently. To request a particular group of rows, you may consider using the ORDER
operator, immediately followed by the LIMIT
operator.
In our example, this operator returns five records as an illustration:
The content of the limitd
tuple is given as follows:
FLATTEN
: The
FLATTEN
operator is used to make relations such as bags and tuple flat by removing the nesting in them. Please refer to the example code in DISTINCT for the sample output and usage of FLATTEN
.