Principle 3 – use ML to improve your data
Just as we can use a programmatic or algorithmic approach to label our data, we can also use ML to identify data points that may be wrong or ambiguous. By leveraging developments in explainability, error analysis, and semi-supervised approaches, we can create new labels and find data points to improve or discard.
Here are some practical steps to generate better input data with ML:
- Toss out noisy examples: Sometimes, more data is not always better. Noisy data can lead to inaccurate predictions. By removing noisy examples, we can improve the quality of our input data. For instance, if you’re analyzing customer reviews and some reviews are filled with random characters or irrelevant information, those can be considered as “noisy” and removed.
- Use techniques to focus on a subset of data to improve: Not all data has the same value. We can focus on a subset of data to improve the quality of our input data...