Flaws in data used for modeling
Data is one of the core components of machine learning modeling (Figure 1.1). Applications of machine learning across different industries such as healthcare, finance, automotive, retail, and marketing are made possible by getting access to the necessary data for training and testing machine learning models. As the data gets fed into machine learning models for training (that is, identifying optimal model parameters) and testing, flaws in data could result in problems in models, such as low performance in training (for example, high bias), low generalizability (for example high variance), or socioeconomic biases. Here, we will discuss examples of flaws and properties of data that need to be considered when designing a machine learning model.
Data format and structure
There could be issues with how data is stored, read, and moved through different functions and classes in your code or pipeline. You might need to work with structured or tabular data or unstructured data such as videos and text documents. This data could be stored in relational databases such as MySQL or NoSQL (that is, non-relational) databases, data warehouses, and data lakes, or even stored locally in different file formats, such as CSV. Either way, the expected and existing file data structure and formats need to match. For example, if your code is expecting a tab-separated file format but instead the input file of the corresponding function is comma-separated, then all the columns could be lumped together. Luckily, most of the time, these kinds of issues result in errors in the code.
There could also be mismatches in the provided and expected data that wouldn’t cause any errors if the code is not defended against them and not enough information is logged. For example, imagine a scikit-learn fit
function that expects training data with 100 features and at the same time, you have 100 data points. In this case, your code will not return any errors if features are in rows or columns of an input DataFrame. Then, your code needs to check if each row of an input DataFrame contains values of one feature across all data points or the feature values of one data point. The following figure shows how switching features with data points, such as transposing a DataFrame that switches rows with columns, could provide wrong input files but result in no error. In this figure, we have considered four columns and rows for simplicity. Here, F and D are used as abbreviations for feature and data point, respectively:
Figure 1.4 – Simplified example showcasing how the transpose of a DataFrame can be used by mistake in a scikit-learn fit function that expects four features
Data flaws are not restricted to structure and format issues. Some data characteristics need to be considered when you’re trying to build and improve a machine learning model.
Data quantity and quality
Despite machine learning being a more than half-century-old concept, the rise of excitement around machine learning started in 2012. Although there were algorithmic advancements for image classification between 2010 and 2015, it was the availability of 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest and the necessary computing power that played a crucial role in the development of the first high-performance image classification models, such as AlexNet (Krizhevsky et al., 2012) and VGG (Simonyan and Zisserman, 2014).
In addition to data quantity, the quality of the data also plays a very important role. In some applications, such as clinical cancer settings, a high quantity of high-quality data is not accessible. Benefitting from both quantity and quality could also become a tradeoff as we could have access to more data but with lower quality. We can choose to stick to high-quality data or low-quality ones or try to benefit from both high-quality and low-quality data if possible. Selecting the right approach is domain-specific and depends on the data and algorithm used for modeling.
Data biases
Machine learning models can have different kinds of biases, depending on the data we feed them. Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a famous example of machine learning models with reported biases. COMPAS is designed to estimate the likelihood of a defendant to re-offend based on their response to more than 100 survey questions. A summary of the responses to the questions results in a risk score, which includes questions such as whether one of the prisoner’s parents was ever in prison. Although the tool has been successful in many examples, when it has been wrong in terms of prediction, the results for white and black offenders were not the same. The developer company of COMPAS presented data that supports its algorithm’s findings. You can find articles and blog posts to read more about its current status and whether it is still used or still has biases or not.
These were some examples of issues in data and their consequences in the resulting machine learning models. But there are other problems in models that do not originate from data.