Ensuring that the data is fresh
Data freshness is another important aspect of measuring data quality that has an impact on the quality and robustness of machine learning applications. Let’s imagine that we have a machine learning application that’s been trained on 2019 and 2020 customer behavior and utilized to predict hotel room bookings up to April 2021. Maybe January and February numbers were quite accurate, but when March and April hit, accuracy dropped. This might have been due to COVID-19, something that was unseen by the data, and its effects were not captured. In machine learning, this is called data drift. This is happening here; the data distribution in March and April was quite different from the data distribution in 2019 and 2020. By ensuring that the data is fresh and up to date, we can train the model more regularly or as soon as data drift is detected.
To measure data drift, we will use the alibi
Python package. However, there are more extensive Python...