Creating an on-demand Azure HDInsight cluster
So far, we have installed the Azure Feature Pack in SSIS and created a storage account. It is now time to create a compute service in Azure so that we can manipulate some data.
An HDInsight cluster is what we call a compute resource in Azure. It is essentially a Hortonworks (now Cloudera) service available in Azure. It is composed of Linux virtual machines that have Apache Hadoop or Spark installed on them. Hadoop has been around for more than a decade now and it was the first big data compute resource available. Hadoop writes (stages) the data to disk at almost all the stages of a program's execution. Spark is a newer technology. Instead of staging data on disks, it uses memory while a program executes. It's therefore much faster than Hadoop.
We will use Hadoop clusters in this chapter because SSIS uses this type of cluster. HDInsight clusters can be very expensive if we create them manually and leave them running continuously...