Chapter 6. Data Analysis with Apache Pig
In the previous chapters, we explored a number of APIs for data processing. MapReduce, Spark, Tez and Samza are rather low-level, and writing non-trivial business logic with them often requires significant Java development. Moreover, different users will have different needs. It might be impractical for an analyst to write MapReduce code or build a DAG of inputs and outputs to answer some simple queries. At the same time, a software engineer or a researcher might want to prototype ideas and algorithms using high-level abstractions before jumping into low-level implementation details.
In this chapter and the following one, we will explore some tools that provide a way to process data on HDFS using higher-level abstractions. In this chapter we will explore Apache Pig, and, in particular, we will cover the following topics:
- What Apache Pig is and the dataflow model it provides
- Pig Latin's data types and functions
- How Pig can be easily enhanced...