Reference architecture for batch ETL workloads
When data analysts receive data from different data sources, the first thing they do is transform it into a format that can be used for analysis or reporting. This data transformation process might involve several steps to bring it to the desired state and, after it is ready, you need to load it into a data warehousing system or data lake, which can be consumed by data analysts or data scientists.
To make the data available for consumption, you need to extract it from the source, transform it with different steps, and then load it into the target storage layer – hence the term ETL. For a few other use cases, when the raw data is in a structured format, you can then load it into a relational database or data warehouse and then transform it with SQL, where it becomes Extract, Load, and Transform (ELT).
What we understand from all this is that transformation is the primary piece that makes raw data ready for consumption. What...