Performing Group By queries in Pig
In this recipe, we will use the Group By operator in Pig scripts to get the desired output.
Getting ready
To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Pig installed on it.
How to do it...
Group By is a very useful operator for data analysis. Pig supports this operator so that we can perform aggregations at the group level. Take the same data that we used in the previous recipe where we have this employee dataset:
1 Tanmay ENGINEERING 5000 2 Sneha PRODUCTION 8000 3 Sakalya ENGINEERING 7000 4 Avinash SALES 6000 5 Manisha SALES 5700 6 Vinit FINANCE 6200
First of all, load the data into HDFS:
hadoop fs -mkdir /pig/emps_data hadoop fs -put emps.txt /pig/emps_data
Next, we load the data into a bag called emps
, and then perform the Group By operation on this data by the department:
emps = LOAD '/pig/emps_data/emps.txt' AS (id, name, dept, salary); by_dept = GROUP emps BY dept; DUMP by_dept;
This will start a MapReduce...