Understanding the batch processing problem
In Chapter 1, Introduction to ML Engineering, we saw the scenario of a taxi firm that wanted to analyze anomalous rides at the end of every day. The customer had the following requirements:
- Rides should be clustered based on ride distance and time, and anomalies/outliers identified.
- Speed (distance/time) was not to be used, as analysts would like to understand long-distance rides or those with a long duration.
- The analysis should be carried out on a daily schedule.
- The data for inference should be consumed from the company’s data lake.
- The results should be made available for consumption by other company systems.
Based on the description in the introduction to this chapter, we can now add some extra requirements:
- The system’s results should contain information on the rides classification as well as a summary of relevant textual data.
- Only anomalous rides need to have...