What is ETL?
ELT stands for Extraction, Transformation,and Loading. The term has been around for decades and it represents an industry standard representing the data movement and transformation process to build data pipelines to deliver BI and Analytics. ETL processes are widely used on the data migration and master data management initiatives. Since the focus of our book is on Spark, we'll lightly touch upon the subject of ETL, but will not go into more detail.
Exaction
Extraction is the first part of the ETL process representing the extraction of data from source systems. This is often one of the most important parts of the ETL process, and it sets the stage for further downstream processing. There are a few major things to consider during an extraction process:
- The source system type (RDBMS, NoSQL, FlatFiles, Twitter/Facebook streams)
- The file formats (CSV, JSON, XML, Parquet, Sequence, Object files)
- The frequency of the extract ( Daily, Hourly, Every second)
- The size of the extract...