In the first part, we have only one step. In the preceding figure, this is referred to as step 1.1. In this first step, we will do basic data preprocessing. Once we are done with that, we will start with our next part.
Basic data analysis followed by data preprocessing
Let's perform some basic data analysis, which will help us find the statistical properties of the training dataset. This kind of analysis is also called exploratory data analysis (EDA), and it will help us understand how our dataset represents the facts. After deriving some facts, we can use them in order to derive feature engineering. So let's explore some important facts!
From this section onward, all the code is part of one iPython notebook. You can refer to the code using this GitHub Link: https://github.com/jalajthanaki/credit-risk-modelling/blob/master/Credit%20Risk%20Analysis.ipynb.
The following are the steps we are going to perform:
Listing statistical properties
Finding the missing values
Replacing missing values
Correlation
Detecting Outliers
Listing statistical properties
In this section, we will get an idea about the statistical properties of the training dataset. Using pandas' describe function, we can find out the following basic things:
count
: This will give us an idea about the number of records in our training dataset.
mean
: This value gives us an indication of the mean of each of the data attributes.
std
: This value indicates the standard deviation for each of the data attributes. You can refer to this example: http://www.mathsisfun.com/data/standard-deviation.html.
min
: This value gives us an idea of what the minimum value for each of the data attributes is.
25%
: This value indicates the 25th percentile. It should fall between 0 and 1.
50%
: This value indicates the 50th percentile. It should fall between 0 and 1.
75%
: This value indicates the 75th percentile. It should fall between 0 and 1.
max
: This value gives us an idea of what the maximum value for each of the data attributes is.
Take a look at the code snippet in the following figure:
We need to find some other statistical properties for our dataset that will help us understand it. So, here, we are going to find the median and mean for each of the data attributes. You can see the code for finding the median in the following figure:
Now let's check out what kind of data distribution is present in our dataset. We draw the frequency distribution for our target attribute, seriousdlqin2yrs
, in order to understand the overall distribution of the target variable for the training dataset. Here, we will use the seaborn
visualization library. You can refer to the following code snippet:
You can refer to the visualization chart in the following figure:
From this chart, you can see that there are many records with the target label 0 and fewer records with the target label 1. You can see that the data records with a 0 label are about 93.32%, whereas 6.68% of the data records are labeled 1. We will use all of these facts in the upcoming sections. For now, we can consider our outcome variable as imbalanced.
In order to find the missing values in the dataset, we need to check each and every data attribute. First, we will try to identify which attribute has a missing or null value. Once we have found out the name of the data attribute, we will replace the missing value with a more meaningful value. There are a couple of options available for replacing the missing values. We will explore all of these possibilities.
Let's code for our first step. Here, we will see which data attribute has missing values as well count how many records there are for each data attribute with a missing value. You can see the code snippet in the following figure:
As displayed in the preceding figure, the following two data attributes have missing values:
monthlyincome
: This attribute contains 29,731 records with a missing value.
numberofdependents
: This attribute contains 3,924 records with a missing value.
You can also refer to the code snippet in the following figure for the graphical representation of the facts described so far:
You can view the graph itself in the following figure:
In this case, we need to replace these missing values with more meaningful values. There are various standard techniques that we can use for that. We have the following two options:
In the previous section, we already derived the mean and median values for all of our data attributes, and we will use them. Here, our focus will be on the attributes titled monthlyincome
and numberofdependents
because they have missing values. We have found out which data attributes have missing values, so now it's time to perform the actual replacement operation. In the next section, you will see how we can replace the missing values with the mean or the median.
In the previous section, we figured out which data attributes in our training dataset contain missing values. We need to replace the missing values with either the mean or the median value of that particular data attribute. So in this section, we will focus particularly on how we can perform the actual replacement operation. This operation of replacing the missing value is also called imputing the missing data.
Before moving on to the code section, I feel you guys might have questions such as these: should I replace missing values with the mean or the median? Are there any other options available? Let me answer these questions one by one.
The answer to the first question, practically, will be a trial and error method. So you first replace missing values with the mean value, and during the training of the model, measure whether you get a good result on the training dataset or not. Then, in the second iteration, we need to try to replace the values with the median and measure whether you get a good result on the training dataset or not.
In order to answer the second question, there are many different imputation techniques available, such as the deletion of records, replacing the values using the KNN method, replacing the values using the most frequent value, and so on. You can select any of these techniques, but you need to train the model and measure the result. Without implementing a technique, you can't really say with certainty that a particular imputation technique will work for the given training dataset. Here, we are talking in terms of the credit-risk domain, so I would not get into the theory much, but just to refresh your concepts, you can refer to the following articles:
We can see the code for replacing the missing values using the attribute's mean value and its median value in the following figure:
In the preceding code snippet, we replaced the missing value with the mean value, and in the second step, we verified that all the missing values have been replaced with the mean of that particular data attribute.
In the next code snippet, you can see the code that we have used for replacing the missing values with the median of those data attributes. Refer to the following figure:
In the preceding code snippet, we have replaced the missing value with the median value, and in second step, we have verified that all the missing values have been replaced with the median of that particular data attribute.
In the first iteration, I would like to replace the missing value with the median.
In the next section, we will see one of the important aspects of basic data analysis: finding correlations between data attributes. So, let's get started with correlation.
I hope you basically know what correlation indicates in machine learning. The term correlation refers to a mutual relationship or association between quantities. If you want to refresh the concept on this front, you can refer to https://www.investopedia.com/terms/c/correlation.asp.
So, here, we will find out what kind of association is present among the different data attributes. Some attributes are highly dependent on one or many other attributes. Sometimes, values of a particular attribute increase with respect to its dependent attribute, whereas sometimes values of a particular attribute decrease with respect to its dependent attribute. So, correlation indicates the positive as well as negative associations among data attributes. You can refer to the following code snippet for the correlation:
You can see the code snippet of the graphical representation of the correlation in the following figure:
You can see the graph of the correlation in the following figure:
Let's look at the preceding graph because it will help you understand correlation in a great way. The following facts can be derived from the graph:
Cells with 1.0 values are highly associated with each other.
Each attribute has a very high correlation with itself, so all the diagonal values are 1.0.
The data attribute numberoftime3059dayspastduenotworse (refer to the data attribute given on the vertical line or on the y axis) is highly associated with two attributes, numberoftimes90dayslate and numberoftime6089dayspastduenotworse. These two data attributes are given on the x axis (or on the horizontal line).
The data attribute numberoftimes90dayslate is highly associated with numberoftime3059dayspastduenotworse and numberoftime6089dayspastduenotworse. These two data attributes are given on the x axis (or on the horizontal line).
The data attribute numberoftime6089dayspastduenotworse is highly associated with numberoftime3059dayspastduenotworse and numberoftimes90dayslate. These two data attributes are given on the x axis (or on the horizontal line).
The data attribute numberofopencreditlinesandloans also has an association with numberrealestateloansorlines and vice versa. Here, the data attribute numberrealestateloansorlines is present on the x axis (or on the horizontal line).
Before moving ahead, we need to check whether these attributes contain any outliers or insignificant values. If they do, we need to handle these outliers, so our next section is about detecting outliers from our training dataset.
In this section, you will learn how to detect outliers as well as how to handle them. There are two steps involved in this section:
First, let's begin with detecting outliers. Now you guys might have wonder why should we detect outliers. In order to answer this question, I would like to give you an example. Suppose you have the weights of 5-year-old children. You measure the weight of five children and you want to find out the average weight. The children weigh 15, 12, 13, 10, and 35 kg. Now if you try to find out the average of these values, you will see that the answer 17 kg. If you look at the weight range carefully, then you will realize that the last observation is out of the normal range compared to the other observations. Now let's remove the last observation (which has a value of 35) and recalculate the average of the other observations. The new average is 12.5 kg. This new value is much more meaningful in comparison to the last average value. So, the outlier values impact the accuracy greatly; hence, it is important to detect them. Once that is done, we will explore techniques to handle them in upcoming section named handling outlier.
Outliers detection techniques
Here, we are using the following outlier detection techniques:
Percentile-based outlier detection
Median Absolute Deviation (MAD)-based outlier detection
Standard Deviation (STD)-based outlier detection
Majority-vote-based outlier detection
Visualization of outliers
Percentile-based outlier detection
Here, we have used percentile-based outlier detection, which is derived based on the basic statistical understanding. We assume that we should consider all the data points that lie under the percentile range from 2.5 to 97.5. We have derived the percentile range by deciding on a threshold of 95. You can refer to the following code snippet:
We will use this method for each of the data attributes and detect the outliers.
Median Absolute Deviation (MAD)-based outlier detection
MAD is a really simple statistical concept. There are four steps involved in it. This is also known as modified Z-score. The steps are as follows:
Find the median of the particular data attribute.
For each of the given values for the data attribute, subtract the previously found median value. This subtraction is in the form of the absolute value. So, for each data point, you will get the absolute value.
In the third step, generate the median of the absolute values that we derived in the second step. We will perform this operation for each data point for each of the data attributes. This value is called the MAD value.
In the fourth step, we will use the following equation to derive the modified Z-score:
Now it's time to refer to the following code snippet:
Standard Deviation (STD)-based outlier detection
In this section, we will use standard deviation and the mean value to find the outlier. Here, we select a random threshold value of 3. You can refer to the following code snippet:
Majority-vote-based outlier detection:
In this section, we will build the voting mechanism so that we can simultaneously run all the previously defined methods—such as percentile-based outlier detection, MAD-based outlier detection, and STD-based outlier detection—and get to know whether the data point should be considered an outlier or not. We have seen three techniques so far. So, if two techniques indicate that the data should be considered an outlier, then we consider that data point as an outlier; otherwise, we don't. So, the minimum number of votes we need here is two. Refer to the following figure for the code snippet:
Visualization of outliers
In this section, we will plot the data attributes to get to know about the outliers visually. Again, we are using the seaborn
and matplotlib
library to visualize the outliers. You can find the code snippet in the following figure:
Refer to the preceding figure for the graph and learn how our defined methods detect the outlier. Here, we chose a sample size of 5,000. This sample was selected randomly.
Here, you can see how all the defined techniques will help us detect outlier data points from a particular data attribute. You can see all the attribute visualization graphs on this GitHub link at https://github.com/jalajthanaki/credit-risk-modelling/blob/master/Credit%20Risk%20Analysis.ipynb.
So far, you have learned how to detect outliers, but now it's time to handle these outlier points. In the next section, we will look at how we can handle outliers.
In this section, you will learn how to remove or replace outlier data points. This particular step is important because if you just identify the outlier but aren't able to handle it properly, then at the time of training, there will be a high chance that we over-fit the model. So, let's learn how to handle the outliers for this dataset. Here, I will explain the operation by looking at the data attributes one by one.
Revolving utilization of unsecured lines
In this data attribute, when you plot an outlier detection graph, you will come to know that values of more than 0.99999 are considered outliers. So, values greater than 0.99999 can be replaced with 0.99999. So for this data attribute, we perform the replacement operation. We have generated new values for the data attribute revolvingutilizationofunsecuredlines
.
For the code, you can refer to the following figure:
In this attribute, if you explore the data and see the percentile-based outlier, then you see that there is an outlier with a value of 0 and the youngest age present in the data attribute is 21. So, we replace the value of 0 with 22. We code the condition such that the age should be more than 22. If it is not, then we will replace the age with 22. You can refer to the following code and graph.
The following figure shows how the frequency distribution of age is given in the dataset. By looking at the data, we can derive the fact that 0 is the outlier value:
Refer to the following box graph, which gives us the distribution indication of the age:
Before removing the outlier, we got the following outlier detection graph:
The code for replacing the outlier is as follows:
In the code, you can see that we have checked each data point of the age column, and if the age is greater than 21, then we haven't applied any changes, but if the age is less than 21, then we have replaced the old value with 21. After that, we put all these revised values into our original dataframe.
Number of time 30-59 days past due not worse
In this data attribute, we explore the data as well as referring to the outlier detection graph. Having done that, we know that values 96 and 98 are our outliers. We replace these values with the media value. You can refer to the following code and graph to understand this better.
Refer to the outlier detection graph given in the following figure:
Refer to the frequency analysis of the data in the following figure:
The code snippet for replacing the outlier values with the median is given in the following figure:
If we look at the graph of the outlier detection of this attribute, then it's kind of confusing. Refer to the following figure:
Why? It's confusing because we are not sure which outlier detection method we should consider. So, here, we do some comparative analysis just by counting the number of outliers derived from each of the methods. Refer to the following figure:
The maximum number of outliers was detected by the MAD-based method, so we will consider that method. Here, we will find the minimum upper bound value in order to replace the outlier values. The minimum upper bound is the minimum value derived from the outlier value. Refer to the following code snippet:
For this data attribute, we will select the voting-based outlier detection method, as shown in the following figure:
In order to replace the outlier, we will use the same logic that we have for the debt ratio
data attribute. We replace the outliers by generating a minimum upper bound value. You can refer to the code given in the following figure:
Number of open credit lines and loans
If you refer to the graph given in the following figure, you will see that there are no highly deviated outlier values present in this column:
So, we will not perform any kind of replacement operation for this data attribute.
Number of times 90 days late
For this attribute, when you analyze the data value frequency, you will immediately see that the values 96 and 98 are outliers. We will replace these values with the median value of the data attribute.
Refer to the frequency analysis code snippet in the following figure:
The outlier replacement code snippet is shown in the following figure:
Number of real estate loans or lines
When we see the frequency of value present in the data attribute, we will come to know that a frequency value beyond 17 is too less. So, here we replace every value less than 17 with 17.
You can refer to the code snippet in the following figure:
Number of times 60-89 days past due not worse
For this attribute, when you analyze the data value frequency, you will immediately see that the values 96 and 98 are outliers. We will replace these values with the median value of the data attribute.
Refer to the frequency analysis code snippet in the following figure:
The outlier replacement code snippet is shown in the following figure:
You can refer to the removeSpecificAndPutMedian
method code from Figure 1.38.