Data in Software Systems – Text, Images, Code, and Their Annotations
Machine learning (ML) systems are data-hungry applications, and they like their data well prepared for training and inference. Although it may sound obvious, it is more important to scrutinize the properties of data than to select an algorithm to process the data. The data, however, can come in many different formats and can be from different sources. We can consider data in its raw format – for example, a text document or an image file. We can also consider data in a format that is specific to a task at hand – for example, tokenized text (where words are divided into tokens) or an image with bounding boxes (where objects are identified and enclosed in rectangles).
When considering the end user system, what we can do with the data and how we handle the data becomes crucial. However, identifying important elements in the data and transforming it into a format that is useful for ML algorithms...