Even a great offline model testing pipeline won't guarantee that the model will perform exactly the same in production. There are always risks that can affect your model performance, such as the following:
- Humans: We can make mistakes and leave bugs in the code.
- Data collection: Selection bias and incorrect data-collection procedures may disrupt true metric values.
- Changes: Real-world data may change and deviate from your training dataset, leading to unexpected model behavior.
The only way to be certain about model performance in the near future is to perform a live test. Depending on the environment, such test may introduce big risks. For example, models that assess airplane engine quality or patient health would be unsuitable for real-world testing before we become confident in their performance.
When the time for a live test comes, you will want...