Connecting to the HDInsight cluster and getting unstructured data
Everyone talks about big data and everyone wants big data in their BI solution, hence previously-ignored data that is semi-structured or unstructured has now become part of BI solutions. Generally, the ETL solution transforms semi-structured or unstructured data into a structured format using a platform such as Hadoop and loads the transformed data back to the data warehouse. However, there are instances where we hold unstructured data in an environment such as Hadoop and query it directly when required for analysis and reporting using Hadoop-supported projects such as Hive or Pig.
Microsoft provides the Hadoop environment as a cloud service and it is known as HDInsight Cluster. It not only supports Hadoop, but also HBase, Spark, and Storm. You can create your Hadoop cluster in Azure as an HDInsight cluster to hold semi-structured and unstructured data, and query it using either Pig or Hive.
In this recipe, let's see how...