Streaming
Hive can also leverage the streaming feature in Hadoop to transform data in an alternative way. The streaming API opens an I/O pipe to an external process (script). Then, the process reads data from the standard input and writes the results out through the standard output. In Hive, we can use TRANSFORM
clauses in HQL directly, and embed the mapper and the reducer scripts written in commands, shell scripts, Java, or other programming languages. Although streaming brings overhead by using serialization/deserialization between processes, it is a simpler coding mode for developers, especially non-Java developers. The syntax of the TRANSFORM
clause is as follows:
FROM ( FROM src SELECT TRANSFORM '(' expression (',' expression)* ')' (inRowFormat)? USING 'map_user_script' (AS colName (',' colName)*)? (outRowFormat)? (outRecordReader)? (CLUSTER BY?|DISTRIBUTE BY? SORT BY?) src_alias ) SELECT TRANSFORM '(' expression (',' expression)* ')' (inRowFormat)? USING...