Setting up Lake Formation
Now, it’s time to take a closer look at setting up our serverless data lake on AWS! Before we begin, let’s define what a data lake is and what type of data is stored in it. A data lake is a centralized data store that contains a variety of structured, semi-structured, and unstructured data from different data sources. As shown in the following diagram, data can be stored in a data lake without us having to worry about the structure and format. We can use a variety of file types such as JSON, CSV, and Apache Parquet when storing data in a data lake. In addition to these, data lakes may include both raw and processed (clean) data:
Figure 4.26 – Getting started with data lakes
ML engineers and data scientists can use data lakes as the source of the data used for building and training ML models. Since the data stored in data lakes may be a mixture of both raw and clean data, additional data processing, data cleaning...