The domain gap problem in ML
In this section, we will understand what the domain gap is and why it is a problem in ML. The domain gap is one of the main issues that limit the usability of synthetic data in practice. It usually refers to the dissimilarity between the distributions and properties of data in two or more domains. It is not just associated with synthetic data. However, it is a common problem in ML. It is very common to notice a degradation in the performance of ML models when tested on similar but slightly different datasets. For more information, please refer to Who is closer: A computational method for domain gap evaluation (https://doi.org/10.1016/j.patcog.2021.108293).
The main reasons for the domain gap between datasets can be linked to the following:
- Sensitivity to sensors’ variations
- Discrepancy in class and feature distributions
- Concept drift
Let’s discuss each of these points in more detail.