Quantifying and Improving Data Properties
Procuring data in machine learning systems is a long process. So far, we have focused on data collection from source systems and cleaning noise from data. Noise, however, is not the only problem that we can encounter in data. Missing values or random attributes are examples of data properties that can cause problems with machine learning systems. Even the length of the input data can be problematic if it is outside of the expected values.
In this chapter, we will dive deeper into the properties of data and how to improve them. In contrast to the previous chapter, we will work on feature vectors rather than raw data. Feature vectors are already a transformation of the data and therefore, we can change properties such as noise or even change how the data is perceived.
We’ll focus on the processing of text, which is an important part of many machine learning algorithms nowadays. We’ll start by understanding how to transform...