Getting data under control
There’s a common saying in the AI community that an ML scientist’s job is only 10% ML and 90% data management. This, like many such sayings, is not far from the truth. While every ML task is focused on the actual training of the model, first, you must get your data into a manageable form before you can start the training. Hours of training can be completely wasted if your data isn’t properly prepared.
Before you can start training a model, you have to decide what data it is that you’re going to train it with. That data must be gathered, cleaned, converted into the right format, and generally made ready to train. Often, this involves a lot of manual processes and verification.
Defining your rules
The most important thing in the manual process is to make sure that all your data meets your requirements and meets a consistent level of quality. To do this, you need to define exactly what “good” data means. Whether...