Implementing HDInsight Hive and Pig activities
Azure HDInsight is an Infrastructure as a Service (IaaS) offering that lets you create big data clusters to use Apache Hadoop, Spark, and Kafka to process big data. We can also scale up or down the clusters as and when required.
Apache Hive, built on top of Apache Hadoop, facilitates querying big data on Hadoop clusters using SQL syntax. Using Hive, we can read files stored in the Apache Hadoop Distributed File System (HDFS) as an external table. We can then apply transformations to the table and write the data back to HDFS as files.
Apache Pig, built on top of Apache Hadoop, is a language to perform Extract, Transform, and Load (ETL) operations on big data. Using Pig, we can read, transform, and write the data stored in HDFS.
In this recipe, we'll use Azure Data Factory, HDInsight Hive, and Pig activities to read data from Azure Blob storage, aggregate the data, and write it back to Azure Blob storage.