Handling late-arriving data
We haven't yet covered this scenario, so let's dive deeper into handling late-arriving data.
A late-arriving data scenario can be considered at three different stages in a data pipeline – during the data ingestion phase, the transformation phase, and the serving phase.
Handling late-arriving data in the ingestion/transformation stage
During the ingestion and transformation phases, the activities usually include copying data into the data lake and performing data transformations using engines such as Spark, Hive, and so on. In such scenarios, the following two methods can be used:
- Drop the data, if your application can handle some amount of data loss. This is the easiest option. You can keep a record of the last timestamp that has been processed. And if the new data has an older timestamp, you can just ignore that message and move forward.
- Rerun the pipeline from the ADF Monitoring tab, if your application cannot handle...