Understanding data-centric ML
Data-centric ML is the discipline of systematically engineering the data used to build ML and artificial intelligence (AI) systems1.
The data-centric AI and ML movement is grounded in the philosophy that data quality is more important than data volume when it comes to building highly informative models. Put another way, it is possible to achieve more with a small but high-quality dataset than with a large but noisy dataset. For most ML use cases, it is not feasible to build models based on very large datasets, say millions of observations, simply because the volume of data doesn’t exist. In other words, the potential use of ML as a tool to solve certain problems is often ignored on the basis that the available dataset is too small.
But what if we can use ML to solve problems based on much smaller datasets, even down to less than 100 observations? This is one challenge the data-centric movement is attempting to solve through systematic data...