Data lakes
A data lake can be defined as a centralized repository that allows you to store all structured and unstructured data at any scale. With today’s hyper scalers providing cheap and durable storage, it is now possible for organizations to store all of their data in the cloud without significant cost implications. Data lakes are broken down into layers or zones.
In the first layer of the data lake, data is generally stored as-is. This reduces the entry barrier and enables organizations to move all of their data to the “lake” without significantly increasing development or maintenance costs. Because the first layer of the data lake is an as-is copy of the data, organizations can use an automated configuration-based pipeline to create newer sources.
Organizations usually pick a replication tool such as AWS Data Migration Service (AWS DMS) to bring the data into the data lake. While AWS DMS involves taking care of the replication infrastructure, it is mostly a hands-off mechanism for hydrating the lake. Organizations may also use a push mechanism to FTP to transfer the files to an AWS Simple Storage Service (S3)-based data lake using AWS Transfer Family.
Data from the first layer is compressed and partitioned, and audited columns are added during data preparation so that they can be used by downstream systems more effectively. Having all the data in the data lake enables data analysts to do the initial discovery to find out the value of combining data from various sources. If the value is discovered, then necessary transformations are applied in an ETL pipeline so that the target is hydrated with newer data periodically or through a streaming arrangement. These automated transformations are then loaded into the final layer of a data lake and used for user consumption.