Challenges of data preparation with streaming data
Before deep-diving into specific algorithms and solutions, let's first have a general discussion of why data preparation may be different when working with data that arrives in a streaming fashion. Multiple reasons can be identified, such as the following:
- The first, obvious issue is data drift. As discussed in much detail in the previous chapter, trends and descriptive statistics of your data can slowly change over time due to data drift. If your feature engineering or data preparation processes are too dependent on your data following certain distributions, you may run into problems when data drift occurs. As many solutions for this have been proposed in the previous chapter, this topic will be left out of consideration in the current chapter.
- The second issue is that population parameters are unknown. When observing data in a streaming fashion, it is possible, and even likely, that your estimates of the population...