The importance of quality data in ML
So far, we have defined what data-centric ML is and how it compares to the conventional model-centric approach. In this section, we will examine what good data looks like in practice.
From a data-centric perspective, good data is as follows5:
- Captured consistently: Independent (x) and dependent variables (y) are labeled unambiguously
- Full of signal and free of noise: Input data covers a wide range of important observations and events in the smallest number of observations possible
- Designed for the business problem: Data is designed and collected specifically for solving a business problem with ML, rather than the problem being solved with whatever data is already available
- Timely and relevant: Independent and dependent variables provide an accurate representation of current trends (no data or concept drift)
At first glance, this sort of systematic data collection seems both expensive and time-consuming. However, in...