Best practices for the effective utilization of synthetic data
In this section, we will learn about some common good practices that can improve the usability of your synthetic data-based ML solution in practice:
- Understand the problem: Before you start deploying synthetic data, you need to understand what the problem with your ML model and data is and why the available real datasets are not suitable. Do not jump directly to the synthetic data solution if you are not fully aware of the problem and the limitations of the available real data-based solutions.
- Understand the synthetic data generation pipeline: We should not consider the synthetic data generation pipeline as a black box. However, we need a good understanding of the generation process to avoid biases and artifacts. For example, suppose we are generating synthetic data for an application to flag fraudulent transactions. If our synthetic data generator often generates the majority of fraudulent transactions with...