Designing a batch processing solution
In Chapter 2, Designing a Data Storage Structure, we learned about the data lake architecture. I've presented the diagram here again for convenience. In the following diagram, there are two branches, one for batch processing and the other for real-time processing. The part highlighted in green is the batch processing solution for a data lake. Batch processing usually deals with larger amounts of data and takes more time to process compared to stream processing.
A batch processing solution typically consists of five major components:
- Storage systems such as Azure Blob storage, ADLS Gen2, HDFS, or similar
- Transformation/batch processing systems such as Spark, SQL, or Hive (via Azure HDInsight)
- Analytical data stores such as Synapse Dedicated SQL pool, Cosmos DB, and HBase (via Azure HDInsight)
- Orchestration systems such as ADF and Oozie (via Azure...