The quality of data used in any machine learning problem, supervised or unsupervised, is critical to the performance of the final model, and should be at the forefront when planning any machine learning project. As a simple rule of thumb, if you have clean data, in sufficient quantity, with a good correlation between the input data type and the desired output, then the specifics regarding the type and details of the selected supervised learning model become significantly less important to achieve a good result.
In reality, however, this can rarely be the case. There are usually some issues regarding the quantity of available data, the quality or signal-to-noise ratio in the data, the correlation between the input and output, or some combination of all three factors. As such, we will use this last section of this chapter to consider some of the data quality problems that may occur and some mechanisms for addressing them. Previously, we mentioned that in any machine learning problem, having a thorough understanding of the dataset is critical if we to are construct a high-performing model. This is particularly the case when looking into data quality and attempting to address some of the issues present within the data. Without a comprehensive understanding of the dataset, additional noise or other unintended issues may be introduced during the data cleaning process leading to a further degradation of performance.
As we discussed earlier, the ability of pandas to read data with missing values is both a blessing and a curse and arguably is the most common issue that needs to be managed before we can continue with developing our supervised learning model. The simplest, but not necessarily the most effective, method is to just remove or ignore those entries that are missing data. We can easily do this in pandas using the dropna method of the DataFrame:
There is one very significant consequence of simply dropping rows with missing data and that is we may be throwing away a lot of important information. This is highlighted very clearly in the Titanic dataset as a lot of rows contain missing data. If we were to simply ignore these rows, we would start with a sample size of 1,309 and end with a sample of 183 entries. Developing a reasonable supervised learning model with a little over 10% of the data would be very difficult indeed:
So, with the exception of the early, explorative phase, it is rarely acceptable to simply discard all rows with invalid information. We can be a little more sophisticated about this though. Which rows are actually missing information? Is the missing information problem unique to certain columns or is it consistent throughout all columns of the dataset? We can use aggregate to help us here as well:
Now, this is useful! We can see that the vast majority of missing information is in the Cabin column, some in Age, and a little more in Survived. This is one of the first times in the data cleaning process that we may need to make an educated judgement call.
What do we want to do with the Cabin column? There is so much missing information here that it, in fact, may not be possible to use it in any reasonable way. We could attempt to recover the information by looking at the names, ages, and number of parents/siblings and see whether we can match some families together to provide information, but there would be a lot of uncertainty in this process. We could also simplify the column by using the level of the cabin on the ship rather than the exact cabin number, which may then correlate better with name, age, and social status. This is unfortunate as there could be a good correlation between Cabin and Survived, as perhaps those passengers in the lower decks of the ship may have had a harder time evacuating. We could examine only the rows with valid Cabin values to see whether there is any predictive power in the Cabin entry; but, for now, we will simply disregard Cabin as a reasonable input (or feature).
We can see that the Embarked and Fare columns only have three missing samples between them. If we decided that we needed the Embarked and Fare columns for our model, it would be a reasonable argument to simply drop these rows. We can do this using our indexing techniques, where ~ represents the not operation, or flipping the result (that is, where df.Embarked is not NaN and df.Fare is not NaN):
The missing age values are a little more interesting, as there are too many rows with missing age values to just discard them. But we also have a few more options here, as we can have a little more confidence in some plausible values to fill in. The simplest option would be to simply fill in the missing age values with the mean age for the dataset:
This is OK, but there are probably better ways of filling in the data rather than just giving all 263 people the same value. Remember, we are trying to clean up the data with the goal of maximizing the predictive power of the input features and the survival rate. Giving everyone the same value, while simple, doesn't seem too reasonable. What if we were to look at the average ages of the members of each of the classes (Pclass)? This may give a better estimate, as the average age reduces from class 1 through 3:
What if we consider sex as well as ticket class (social status)? Do the average ages differ here too? Let's find out:
We can see here that males in all ticket classes are typically older. This combination of sex and ticket class provides much more resolution than simply filling in all missing fields with the mean age. To do this, we will use the transform method, which applies a function to the contents of a series or DataFrame and returns another series or DataFrame with the transformed values. This is particularly powerful when combined with the groupby method:
There is a lot in these two lines of code, so let's break them down into components. Let's look at the first line:
We are already familiar with df_valid.groupby(['Pclass', 'Sex'])['Age'], which groups the data by ticket class and sex and returns only the Age column. The lambda x: x.fillna(x.mean()) Lambda function takes the input pandas series, and fills the NaN values with the mean value of the series.
The second line assigns the filled values within mean_ages to the Age column. Note the use of the loc[:, 'Age'] indexing method, which indicates that all rows within the Age column are to be assigned the values contained within mean_ages:
We have described a few different ways of filling in the missing values within the Age column, but by no means has this been an exhaustive discussion. There are many more methods that we could use to fill the missing data: we could apply random values within one standard deviation of the mean for the grouped data, we could also look at grouping the data by sex and the number of parents/children (Parch) or by the number of siblings, or by ticket class, sex, and the number of parents/children. What is most important about the decisions made during this process is the end result of the prediction accuracy. We may need to try different options, rerun our models and consider the effect on the accuracy of final predictions. This is an important aspect of the process of feature engineering, that is, selecting the features or components that provide the model with the most predictive power; you will find that, during this process, you will try a few different features, run the model, look at the end result and repeat, until you are happy with the performance.
The ultimate goal of this supervised learning problem is to predict the survival of passengers on the Titanic given the information we have available. So, that means that the Survived column provides our labels for training. What are we going to do if we are missing 418 of the labels? If this was a project where we had control over the collection of the data and access to its origins, we would obviously correct this by recollecting or asking for the labels to be clarified. With the Titanic dataset, we do not have this ability so we must make another educated judgement call. We could try some unsupervised learning techniques to see whether there are some patterns in the survival information that we could use. However, we may not have a choice of simply ignoring these rows. The task is to predict whether a person survived or perished, not whether they may have survived. By estimating the ground truth labels, we may introduce significant noise into the dataset, reducing our ability to accurately predict survival.
Missing data is not the only problem that may be present within a dataset. Class imbalance – that is, having more of one class or classes compared to another – can be a significant problem, particularly in the case of classification problems (we'll see more on classification in Chapter 4, Classification), where we are trying to predict which class (or classes) a sample is from. Looking at our Survived column, we can see that there are far more people who perished (Survived equals 0) than survived (Survived equals 1) in the dataset:
If we don't take this class imbalance into account, the predictive power of our model could be significantly reduced as, during training, the model would simply need to guess that the person did not survive to be correct 61% (549 / (549 + 342)) of the time. If, in reality, the actual survival rate was, say, 50%, then when being applied to unseen data, our model would predict not survived too often.
There are a few options available for managing class imbalance, one of which, similar to the missing data scenario, is to randomly remove samples from the over-represented class until balance has been achieved. Again, this option is not ideal, or perhaps even appropriate, as it involves ignoring available data. A more constructive example may be to oversample the under-represented class by randomly copying samples from the under-represented class in the dataset to boost the number of samples. While removing data can lead to accuracy issues due to discarding useful information, oversampling the under-represented class can lead to being unable to predict the label of unseen data, also known as overfitting (which we will cover in Chapter 5, Ensemble Modeling).
Adding some random noise to the input features for oversampled data may prevent some degree of overfitting, but this is highly dependent on the dataset itself. As with missing data, it is important to check the effect of any class imbalance corrections on the overall model performance. It is relatively straightforward to copy more data into a DataFrame using the append method, which works in a very similar fashion to lists. If we wanted to copy the first row to the end of the DataFrame, we would do this: