Understanding the batch processing problem
In Chapter 1, Introduction to ML Engineering, we saw the scenario of a taxi firm that wanted to analyze anomalous rides at the end of every day. The customer had the following requirements:
- Rides should be clustered based on ride distance and time and anomalies/outliers identified.
- Speed (distance/time) was not to be used, as analysts would like to understand long-distance rides or those of long duration.
- The analysis should be carried out on a daily schedule.
- The data for inference should be consumed from the company's data lake.
- The results should be made available for consumption by other company systems.
As we did in Chapter 2, The Machine Learning Development Process, and Chapter 7, Building an Example ML Microservice, we can now build out some user stories from these requirements, as follows:
- User story 1: As an operations analyst, I want to be given clear labels of rides that have anomalously...