Using Synthetic Data in Data-Centric Machine Learning
In previous chapters, we discussed various approaches to improving data quality for machine learning purposes through better collection and labeling.
Although human labelers, data ownership, and technical data quality improvement practices are critical to data centricity, there are limits to the kind of labeling and data creation that can be performed by individuals or through empirical observation.
Synthetic data has the potential to fill in these gaps and produce comprehensive training data at a fraction of the cost and time of other approaches.
This chapter provides an introduction to synthetic data generation. We will cover the following main topics:
- What synthetic data is and why it matters for data centricity
- How synthetic data is being used to generate better models
- Common techniques used to generate synthetic data
- The risks and challenges with synthetic data use
Let’s start by defining...