Extending Pig (UDFs)
Functions can be a part of almost every operator in Pig. There are two main differences between UDFs and built-in functions. First, UDFs need to be registered using the REGISTER
keyword in order to make them available to Pig. Secondly, they need to be qualified when used. Pig UDFs can currently be implemented in Java, Python, Ruby, JavaScript, and Groovy. The most extensive support is provided for Java functions, which allow you to customize all parts of the process including data load/store, transformation, and aggregation. Additionally, Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported, such as the Algebraic and Accumulator interfaces. On the other hand, Ruby and Python APIs allow more rapid prototyping.
The integration of UDFs with the Pig environment is mainly managed by the following two statements REGISTER
and DEFINE
:
REGISTER
registers a JAR file so that the UDFs in the...