Data processing in the cloud
The success of deep learning (DL) projects depends on the quality and the quantity of data. Therefore, the systems for data preparation must be stable and scalable enough to process terabytes and petabytes of data efficiently. This often requires more than a single machine; a cluster of machines running a powerful ETL engine must be set up for the data process so that it can store and process a large amount of data.
First, we would like to introduce ETL, the core concept in data processing in the cloud. Next, we will provide an overview of a distributed system setup for data processing.
Introduction to ETL
Throughout the ETL process, data will be collected from one or more sources, get transformed into different forms as necessary, and get saved in data storage. In short, ETL itself covers the overall data processing pipeline. ETL interacts with three different types of data throughout: structured, unstructured, and semi-structured. While structured...