Leveraging a HDInsight big data cluster
So far, we've managed Blobs data using SSIS. In this case, the data was at rest and SSIS was used to manipulate it. SSIS was the orchestration service in Azure parlance. As stated in the introduction, SSIS can only be used on- premises and, so far, on a single machine.
The goal of this recipe is to use Azure HDInsight computation services. These services allow us to use (rent) powerful resources as a cluster of machines. These machines can run Linux or Windows according to user choice, but be aware that Windows will be deprecated for the newest version of HDInsight. Such clusters or machines, as fast and powerful as they can be, are very expensive to use. In fact, this is quite normal; we're talking about a potentially large amount of hardware here.
For this reason, unless we want to have these computing resource running continuously, SSIS has a way to create and drop a cluster on demand. The following recipe will show you how to do it.
Getting ready
You...