Ingesting the data
The first step in our pipeline is to ingest data from JSON files and establish a robust data process that efficiently processes and stores this data within our data platform. The initial destination for the incoming data will be our Bronze layer.
Our data, originating from our operations department, arrives in JSON format. As part of our data ingestion process, we will collect these JSON files and store the data in them in its original, unaltered state within the Bronze layer. This retention of raw data ensures that we maintain an immutable historical record of all incoming data, which can be invaluable for traceability, auditing, and data lineage purposes. Our operations department will be landing the data in a storage location for us once a day in a folder specific to that day; for example, /<storage_location/event_date=<yyyy-mm-dd>
.
To further enhance the capabilities and manageability of our data, we will leverage the Delta Lake (Delta) format...