Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Machine Learning Solutions
Machine Learning Solutions

Machine Learning Solutions: Expert techniques to tackle complex machine learning problems using Python

eBook
€23.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Machine Learning Solutions

Chapter 1. Credit Risk Modeling

All the chapters in this book are practical applications. We will develop one application per chapter. We will understand about the application, and choose the proper dataset in order to develop the application. After analyzing the dataset, we will build the base-line approach for the particular application. Later on, we will develop a revised approach that resolves the shortcomings of the baseline approach. Finally, we will see how we can develop the best possible solution using the appropriate optimization strategy for the given application. During this development process, we will learn necessary key concepts about Machine Learning techniques. I would recommend my reader run the code which is given in this book. That will help you understand concepts really well.

In this chapter, we will look at one of the many interesting applications of predictive analysis. I have selected the finance domain to begin with, and we are going to build an algorithm that can predict loan defaults. This is one of the most widely used predictive analysis applications in the finance domain. Here, we will look at how to develop an optimal solution for predicting loan defaults. We will cover all of the elements that will help us build this application.

We will cover the following topics in this chapter:

  • Introducing the problem statement

  • Understanding the dataset

    • Understanding attributes of the dataset

    • Data analysis

  • Features engineering for the baseline model

  • Selecting an ML algorithm

  • Training the baseline model

  • Understanding the testing matrix

  • Testing the baseline model

  • Problems with the existing approach

  • How to optimize the existing approach

    • Understanding key concepts to optimize the approach

    • Hyperparameter tuning

  • Implementing the revised approach

    • Testing the revised approach

    • Understanding the problem with the revised approach

  • The best approach

  • Implementing the best approach

  • Summary

Introducing the problem statement


First of all, let's try to understand the application that we want to develop or the problem that we are trying to solve. Once we understand the problem statement and it's use case, it will be much easier for us to develop the application. So let's begin!

Here, we want to help financial companies, such as banks, NBFS, lenders, and so on. We will make an algorithm that can predict to whom financial institutes should give loans or credit. Now you may ask what is the significance of this algorithm? Let me explain that in detail. When a financial institute lends money to a customer, they are taking some kind of risk. So, before lending, financial institutes check whether or not the borrower will have enough money in the future to pay back their loan. Based on the customer's current income and expenditure, many financial institutes perform some kind of analysis that helps them decide whether the borrower will be a good customer for that bank or not. This kind of analysis is manual and time-consuming. So, it needs some kind of automation. If we develop an algorithm, that will help financial institutes gauge their customers efficiently and effectively.Your next question may be what is the output of our algorithm? Our algorithm will generate probability. This probability value will indicate the chances of borrowers defaulting. Defaulting means borrowers cannot repay their loan in a certain amount of time. Here, probability indicates the chances of a customer not paying their loan EMI on time, resulting in default. So, a higher probability value indicates that the customer would be a bad or inappropriate borrower (customer) for the financial institution, as they may default in the next 2 years. A lower probability value indicates that the customer will be a good or appropriate borrower (customer) for the financial institution and will not default in the next 2 years.

Here, I have given you information regarding the problem statement and its output, but there is an important aspect of this algorithm: its input. So, let's discuss what our input will be!

Understanding the dataset


Here, we are going to discuss our input dataset in order to develop the application. You can find the dataset at https://github.com/jalajthanaki/credit-risk-modelling/tree/master/data.

Let's discuss the dataset and its attributes in detail. Here, in the dataset, you can find the following files:

  • cs-training.csv

    • Records in this file are used for training, so this is our training dataset.

  • cs-test.csv

    • Records in this file are used for testing our machine learning models, so this is our testing dataset.

  • Data Dictionary.xls

    • This file contains information about each of the attributes of the dataset. So, this file is referred to as our data dictionary.

  • sampleEntry.csv

    • This file gives us an idea about the format in which we need to generate our end output for our testing dataset. If you open this file, then you will see that we need to generate the probability of each of the records present in the testing dataset. This probability value indicates the chances of borrowers defaulting.

Understanding attributes of the dataset

The dataset has 11 attributes, which are shown as follows:

Figure 1.1: Attributes (variables) of the dataset

We will look at each of the attributes one by one and understand their meaning in the context of the application:

  1. SeriousDlqin2yrs:

    • In the dataset, this particular attribute indicates whether the borrower has experienced any past dues until 90 days in the previous 2 years.

    • The value of this attribute is Yes if the borrower has experienced past dues of more than 90 days in the previous 2 years. If the EMI was not paid by the borrower 90 days after the due date of the EMI, then this flag value is Yes.

    • The value of this attribute is No if the borrower has not experienced past dues of more than 90 days in the previous 2 years. If the EMI was paid by the borrower before 90 days from the due date of the EMI, then this flag value is No.

    • This attribute has target labels. In other words, we are going to predict this value using our algorithm for the test dataset.

  2. RevolvingUtilizationOfUnsecuredLines:

    • This attribute indicates the credit card limits of the borrower after excluding any current loan debt and real estate.

    • Suppose I have a credit card and its credit limit is $1,000. In my personal bank account, I have $1,000. My credit card balance is $500 out of $1,000.

    • So, the total maximum balance I can have via my credit card and personal bank account is $1,000 + $1,000 = $2,000; I have used $500 from my credit card limit, so the total balance that I have is $500 (credit card balance) + $1,000 (personal bank account balance) = $1,500.

    • If account holder have taken home loan or other property loan and paying EMIs for those loan then we are not considering EMI value for property loan. Here, for this data attribute we have considered account holder's credit card balance and personal account balance.

    • So, the RevolvingUtilizationOfUnsecuredLines value is = $1,500 / $2,000 = 0.7500

  3. Age:

    • This attribute is self-explanatory. It indicates the borrower's age.

  4. NumberOfTime30-59DaysPastDueNotWorse:

    • The number of this attribute indicates the number of times borrowers have paid their EMIs late but have paid them 30 days after the due date or 59 days before the due date.

  5. DebtRatio:

    • This is also a self-explanatory attribute, but we will try and understand it better with an example.

    • If my monthly debt is $200 and my other expenditure is $500, then I spend $700 monthly. If my monthly income is $1,000, then the value of the DebtRatio is $700/$1,000 = 0.7000

  6. MonthlyIncome:

    • This attribute contains the value of the monthly income of borrowers.

  7. NumberOfOpenCreditLinesAndLoans:

    • This attribute indicates the number of open loans and/or the number of credit cards the borrower holds.

  8. NumberOfTimes90DaysLate:

    • This attribute indicates how many times a borrower has paid their dues 90 days after the due date of their EMIs.

  9. NumberRealEstateLoansOrLines:

    • This attribute indicates the number of loans the borrower holds for their real estate or the number of home loans a borrower has.

  10. NumberOfTime60-89DaysPastDueNotWorse:

    • This attribute indicates how many times borrowers have paid their EMIs late but paid them 60 days after their due date or 89 days before their due date.

  11. NumberOfDependents:

    • This attribute is self-explanatory as well. It indicates the number of dependent family members the borrowers have. The dependent count is excluding the borrower.

These are basic attribute descriptions of the dataset, so you have a basic idea of the kind of dataset we have. Now it's time to get hands-on. So from the next section onward, we will start coding. We will begin exploring our dataset by performing basic data analysis so that we can find out the statistical properties of the dataset.

Data analysis

This section is divided into two major parts. You can refer to the following figure to see how we will approach this section:

Figure 1.2: Parts and steps of data analysis

In the first part, we have only one step. In the preceding figure, this is referred to as step 1.1. In this first step, we will do basic data preprocessing. Once we are done with that, we will start with our next part.

The second part has two steps. In the figure, this is referred to as step 2.1. In this step, we will perform basic data analysis using statistical and visualization techniques, which will help us understand the data. By doing this activity, we will get to know some statistical facts about our dataset. After this, we will jump to the next step, which is referred to as step 2.2 in Figure 1.2. In this step, we will once again perform data preprocessing, but, this time, our preprocessing will be heavily based on the findings that we have derived after doing basic data analysis on the given training dataset. You can find the code at this GitHub Link: https://github.com/jalajthanaki/credit-risk-modelling/blob/master/basic_data_analysis.ipynb.

So let's begin!

Data preprocessing

In this section, we will perform a minimal amount of basic preprocessing. We will look at the approaches as well as their implementation.

First change

If you open the cs-training.csv file, then you will find that there is a column without a heading, so we will add a heading there. Our heading for that attribute is ID. If you want to drop this column, you can because it just contains the sr.no of the records.

Second change

This change is not a mandatory one. If you want to skip it, you can, but I personally like to perform this kind of preprocessing. The change is related to the heading of the attributes, we are removing "-" from the headers. Apart from this, I will convert all the column heading into lowercase. For example, the attribute named NumberOfTime60-89DaysPastDueNotWorse will be converted into numberoftime6089dayspastduenotworse. These kinds of changes will help us when we perform in-depth data analysis. We do not need to take care of this hyphen symbols while processing.

Implementing the changes

Now, you may ask how will I perform the changes described? Well, there are two ways. One is a manual approach. In this approach, you will open the cs-training.csv file and perform the changes manually. This approach certainly isn't great. So, we will take the second approach. With the second approach, we will perform the changes using Python code. You can find all the changes in the following code snippets.

Refer to the following screenshot for the code to perform the first change:

Figure 1.3: Code snippet for implementing the renaming or dropping of the index column

For the second change, you can refer to Figure 1.4:

Figure 1.4: Code snippet for removing "-" from the column heading and converting all the column headings into lowercase

The same kind of preprocessing needs to be done on the cs-test.csv file. This is because the given changes are common for both the training and testing datasets.

You can find the entire code on GitHub by clicking on this link: https://github.com/jalajthanaki/credit-risk-modelling/blob/master/basic_data_analysis.ipynb.

You can also move hands-on along with reading.

I'm using Python 2.7 as well as a bunch of different Python libraries for the implementation of this code. You can find information related to Python dependencies as well as installation in the README section. Now let's move on to the basic data analysis section.

Basic data analysis followed by data preprocessing

Let's perform some basic data analysis, which will help us find the statistical properties of the training dataset. This kind of analysis is also called exploratory data analysis (EDA), and it will help us understand how our dataset represents the facts. After deriving some facts, we can use them in order to derive feature engineering. So let's explore some important facts!

From this section onward, all the code is part of one iPython notebook. You can refer to the code using this GitHub Link: https://github.com/jalajthanaki/credit-risk-modelling/blob/master/Credit%20Risk%20Analysis.ipynb.

The following are the steps we are going to perform:

  1. Listing statistical properties

  2. Finding the missing values

  3. Replacing missing values

  4. Correlation

  5. Detecting Outliers

Listing statistical properties

In this section, we will get an idea about the statistical properties of the training dataset. Using pandas' describe function, we can find out the following basic things:

  • count: This will give us an idea about the number of records in our training dataset.

  • mean: This value gives us an indication of the mean of each of the data attributes.

  • std: This value indicates the standard deviation for each of the data attributes. You can refer to this example: http://www.mathsisfun.com/data/standard-deviation.html.

  • min: This value gives us an idea of what the minimum value for each of the data attributes is.

  • 25%: This value indicates the 25th percentile. It should fall between 0 and 1.

  • 50%: This value indicates the 50th percentile. It should fall between 0 and 1.

  • 75%: This value indicates the 75th percentile. It should fall between 0 and 1.

  • max: This value gives us an idea of what the maximum value for each of the data attributes is.

Take a look at the code snippet in the following figure:

Figure 1.5: Basic statistical properties using the describe function of pandas

We need to find some other statistical properties for our dataset that will help us understand it. So, here, we are going to find the median and mean for each of the data attributes. You can see the code for finding the median in the following figure:

Figure 1.6: Code snippet for generating the median and the mean for each data attribute

Now let's check out what kind of data distribution is present in our dataset. We draw the frequency distribution for our target attribute, seriousdlqin2yrs, in order to understand the overall distribution of the target variable for the training dataset. Here, we will use the seaborn visualization library. You can refer to the following code snippet:

Figure 1.7: Code snippet for understanding the target variable distribution as well as the code snippet for the visualization of the distribution

You can refer to the visualization chart in the following figure:

Figure 1.8: Visualization of the variable distribution of the target data attribute

From this chart, you can see that there are many records with the target label 0 and fewer records with the target label 1. You can see that the data records with a 0 label are about 93.32%, whereas 6.68% of the data records are labeled 1. We will use all of these facts in the upcoming sections. For now, we can consider our outcome variable as imbalanced.

Finding missing values

In order to find the missing values in the dataset, we need to check each and every data attribute. First, we will try to identify which attribute has a missing or null value. Once we have found out the name of the data attribute, we will replace the missing value with a more meaningful value. There are a couple of options available for replacing the missing values. We will explore all of these possibilities.

Let's code for our first step. Here, we will see which data attribute has missing values as well count how many records there are for each data attribute with a missing value. You can see the code snippet in the following figure:

Figure 1.9: Code snippet for identifying which data attributes have missing values

As displayed in the preceding figure, the following two data attributes have missing values:

  • monthlyincome: This attribute contains 29,731 records with a missing value.

  • numberofdependents: This attribute contains 3,924 records with a missing value.

You can also refer to the code snippet in the following figure for the graphical representation of the facts described so far:

Figure 1.10: Code snippet for generating a graph of missing values

You can view the graph itself in the following figure:

Figure 1.11: A graphical representation of the missing values

In this case, we need to replace these missing values with more meaningful values. There are various standard techniques that we can use for that. We have the following two options:

  • Replace the missing value with the mean value of that particular data attribute

  • Replace the missing value with the median value of that particular data attribute

In the previous section, we already derived the mean and median values for all of our data attributes, and we will use them. Here, our focus will be on the attributes titled monthlyincome and numberofdependents because they have missing values. We have found out which data attributes have missing values, so now it's time to perform the actual replacement operation. In the next section, you will see how we can replace the missing values with the mean or the median.

Replacing missing values

In the previous section, we figured out which data attributes in our training dataset contain missing values. We need to replace the missing values with either the mean or the median value of that particular data attribute. So in this section, we will focus particularly on how we can perform the actual replacement operation. This operation of replacing the missing value is also called imputing the missing data.

Before moving on to the code section, I feel you guys might have questions such as these: should I replace missing values with the mean or the median? Are there any other options available? Let me answer these questions one by one.

The answer to the first question, practically, will be a trial and error method. So you first replace missing values with the mean value, and during the training of the model, measure whether you get a good result on the training dataset or not. Then, in the second iteration, we need to try to replace the values with the median and measure whether you get a good result on the training dataset or not.

In order to answer the second question, there are many different imputation techniques available, such as the deletion of records, replacing the values using the KNN method, replacing the values using the most frequent value, and so on. You can select any of these techniques, but you need to train the model and measure the result. Without implementing a technique, you can't really say with certainty that a particular imputation technique will work for the given training dataset. Here, we are talking in terms of the credit-risk domain, so I would not get into the theory much, but just to refresh your concepts, you can refer to the following articles:

We can see the code for replacing the missing values using the attribute's mean value and its median value in the following figure:

Figure 1.12: Code snippet for replacing the mean values

In the preceding code snippet, we replaced the missing value with the mean value, and in the second step, we verified that all the missing values have been replaced with the mean of that particular data attribute.

In the next code snippet, you can see the code that we have used for replacing the missing values with the median of those data attributes. Refer to the following figure:

Figure 1.13: Code snippet for replacing missing values with the median

In the preceding code snippet, we have replaced the missing value with the median value, and in second step, we have verified that all the missing values have been replaced with the median of that particular data attribute.

In the first iteration, I would like to replace the missing value with the median.

In the next section, we will see one of the important aspects of basic data analysis: finding correlations between data attributes. So, let's get started with correlation.

Correlation

I hope you basically know what correlation indicates in machine learning. The term correlation refers to a mutual relationship or association between quantities. If you want to refresh the concept on this front, you can refer to https://www.investopedia.com/terms/c/correlation.asp.

So, here, we will find out what kind of association is present among the different data attributes. Some attributes are highly dependent on one or many other attributes. Sometimes, values of a particular attribute increase with respect to its dependent attribute, whereas sometimes values of a particular attribute decrease with respect to its dependent attribute. So, correlation indicates the positive as well as negative associations among data attributes. You can refer to the following code snippet for the correlation:

Figure 1.14: Code snippet for generating correlation

You can see the code snippet of the graphical representation of the correlation in the following figure:

Figure 1.15: Code snippet for generating a graphical snippet

You can see the graph of the correlation in the following figure:

Figure 1.16: Heat map for correlation

Let's look at the preceding graph because it will help you understand correlation in a great way. The following facts can be derived from the graph:

  • Cells with 1.0 values are highly associated with each other.

  • Each attribute has a very high correlation with itself, so all the diagonal values are 1.0.

  • The data attribute numberoftime3059dayspastduenotworse (refer to the data attribute given on the vertical line or on the y axis) is highly associated with two attributes, numberoftimes90dayslate and numberoftime6089dayspastduenotworse. These two data attributes are given on the x axis (or on the horizontal line).

  • The data attribute numberoftimes90dayslate is highly associated with numberoftime3059dayspastduenotworse and numberoftime6089dayspastduenotworse. These two data attributes are given on the x axis (or on the horizontal line).

  • The data attribute numberoftime6089dayspastduenotworse is highly associated with numberoftime3059dayspastduenotworse and numberoftimes90dayslate. These two data attributes are given on the x axis (or on the horizontal line).

  • The data attribute numberofopencreditlinesandloans also has an association with numberrealestateloansorlines and vice versa. Here, the data attribute numberrealestateloansorlines is present on the x axis (or on the horizontal line).

Before moving ahead, we need to check whether these attributes contain any outliers or insignificant values. If they do, we need to handle these outliers, so our next section is about detecting outliers from our training dataset.

Detecting outliers

In this section, you will learn how to detect outliers as well as how to handle them. There are two steps involved in this section:

  • Outliers detection techniques

  • Handling outliers

First, let's begin with detecting outliers. Now you guys might have wonder why should we detect outliers. In order to answer this question, I would like to give you an example. Suppose you have the weights of 5-year-old children. You measure the weight of five children and you want to find out the average weight. The children weigh 15, 12, 13, 10, and 35 kg. Now if you try to find out the average of these values, you will see that the answer 17 kg. If you look at the weight range carefully, then you will realize that the last observation is out of the normal range compared to the other observations. Now let's remove the last observation (which has a value of 35) and recalculate the average of the other observations. The new average is 12.5 kg. This new value is much more meaningful in comparison to the last average value. So, the outlier values impact the accuracy greatly; hence, it is important to detect them. Once that is done, we will explore techniques to handle them in upcoming section named handling outlier.

Outliers detection techniques

Here, we are using the following outlier detection techniques:

  • Percentile-based outlier detection

  • Median Absolute Deviation (MAD)-based outlier detection

  • Standard Deviation (STD)-based outlier detection

  • Majority-vote-based outlier detection

  • Visualization of outliers

Percentile-based outlier detection

Here, we have used percentile-based outlier detection, which is derived based on the basic statistical understanding. We assume that we should consider all the data points that lie under the percentile range from 2.5 to 97.5. We have derived the percentile range by deciding on a threshold of 95. You can refer to the following code snippet:

Figure 1.17: Code snippet for percentile-based outlier detection

We will use this method for each of the data attributes and detect the outliers.

Median Absolute Deviation (MAD)-based outlier detection

MAD is a really simple statistical concept. There are four steps involved in it. This is also known as modified Z-score. The steps are as follows:

  1. Find the median of the particular data attribute.

  2. For each of the given values for the data attribute, subtract the previously found median value. This subtraction is in the form of the absolute value. So, for each data point, you will get the absolute value.

  3. In the third step, generate the median of the absolute values that we derived in the second step. We will perform this operation for each data point for each of the data attributes. This value is called the MAD value.

  4. In the fourth step, we will use the following equation to derive the modified Z-score:

Now it's time to refer to the following code snippet:

Figure 1.18: Code snippet for MAD-based outlier detection

Standard Deviation (STD)-based outlier detection

In this section, we will use standard deviation and the mean value to find the outlier. Here, we select a random threshold value of 3. You can refer to the following code snippet:

Figure 1.19: Standard Deviation (STD) based outlier detection code

Majority-vote-based outlier detection:

In this section, we will build the voting mechanism so that we can simultaneously run all the previously defined methods—such as percentile-based outlier detection, MAD-based outlier detection, and STD-based outlier detection—and get to know whether the data point should be considered an outlier or not. We have seen three techniques so far. So, if two techniques indicate that the data should be considered an outlier, then we consider that data point as an outlier; otherwise, we don't. So, the minimum number of votes we need here is two. Refer to the following figure for the code snippet:

Figure 1.20: Code snippet for the voting mechanism for outlier detection

Visualization of outliers

In this section, we will plot the data attributes to get to know about the outliers visually. Again, we are using the seaborn and matplotlib library to visualize the outliers. You can find the code snippet in the following figure:

Figure 1.21: Code snippet for the visualization of the outliers

Refer to the preceding figure for the graph and learn how our defined methods detect the outlier. Here, we chose a sample size of 5,000. This sample was selected randomly.

Figure 1.22: Graph for outlier detection

Here, you can see how all the defined techniques will help us detect outlier data points from a particular data attribute. You can see all the attribute visualization graphs on this GitHub link at https://github.com/jalajthanaki/credit-risk-modelling/blob/master/Credit%20Risk%20Analysis.ipynb.

So far, you have learned how to detect outliers, but now it's time to handle these outlier points. In the next section, we will look at how we can handle outliers.

Handling outliers

In this section, you will learn how to remove or replace outlier data points. This particular step is important because if you just identify the outlier but aren't able to handle it properly, then at the time of training, there will be a high chance that we over-fit the model. So, let's learn how to handle the outliers for this dataset. Here, I will explain the operation by looking at the data attributes one by one.

Revolving utilization of unsecured lines

In this data attribute, when you plot an outlier detection graph, you will come to know that values of more than 0.99999 are considered outliers. So, values greater than 0.99999 can be replaced with 0.99999. So for this data attribute, we perform the replacement operation. We have generated new values for the data attribute revolvingutilizationofunsecuredlines.

For the code, you can refer to the following figure:

Figure 1.23: Code snippet for replacing outlier values with 0.99999

Age

In this attribute, if you explore the data and see the percentile-based outlier, then you see that there is an outlier with a value of 0 and the youngest age present in the data attribute is 21. So, we replace the value of 0 with 22. We code the condition such that the age should be more than 22. If it is not, then we will replace the age with 22. You can refer to the following code and graph.

The following figure shows how the frequency distribution of age is given in the dataset. By looking at the data, we can derive the fact that 0 is the outlier value:

Figure 1.24: Frequency for each data value shows that 0 is an outlier

Refer to the following box graph, which gives us the distribution indication of the age:

Figure 1.25: Box graph for the age data attribute

Before removing the outlier, we got the following outlier detection graph:

Figure 1.26: Graphical representation of detecting outliers for data attribute age

The code for replacing the outlier is as follows:

Figure 1.27: Replace the outlier with the minimum age value 21

In the code, you can see that we have checked each data point of the age column, and if the age is greater than 21, then we haven't applied any changes, but if the age is less than 21, then we have replaced the old value with 21. After that, we put all these revised values into our original dataframe.

Number of time 30-59 days past due not worse

In this data attribute, we explore the data as well as referring to the outlier detection graph. Having done that, we know that values 96 and 98 are our outliers. We replace these values with the media value. You can refer to the following code and graph to understand this better.

Refer to the outlier detection graph given in the following figure:

Figure 1.28: Outlier detection graph

Refer to the frequency analysis of the data in the following figure:

Figure 1.29: Outlier values from the frequency calculation

The code snippet for replacing the outlier values with the median is given in the following figure:

Figure 1.30: Code snippet for replacing outliers

Debt ratio

If we look at the graph of the outlier detection of this attribute, then it's kind of confusing. Refer to the following figure:

Figure 1.31: Graph of outlier detection for the debt ratio column

Why? It's confusing because we are not sure which outlier detection method we should consider. So, here, we do some comparative analysis just by counting the number of outliers derived from each of the methods. Refer to the following figure:

Figure 1.32: Comparison of various outlier detection techniques

The maximum number of outliers was detected by the MAD-based method, so we will consider that method. Here, we will find the minimum upper bound value in order to replace the outlier values. The minimum upper bound is the minimum value derived from the outlier value. Refer to the following code snippet:

Figure 1.33: The code for the minimum upper bound

Monthly income

For this data attribute, we will select the voting-based outlier detection method, as shown in the following figure:

Figure 1.34: Outlier detection graph

In order to replace the outlier, we will use the same logic that we have for the debt ratio data attribute. We replace the outliers by generating a minimum upper bound value. You can refer to the code given in the following figure:

Figure 1.35: Replace the outlier value with the minimum upper bound value

Number of open credit lines and loans

If you refer to the graph given in the following figure, you will see that there are no highly deviated outlier values present in this column:

Figure 1.36: Outlier detection graph

So, we will not perform any kind of replacement operation for this data attribute.

Number of times 90 days late

For this attribute, when you analyze the data value frequency, you will immediately see that the values 96 and 98 are outliers. We will replace these values with the median value of the data attribute.

Refer to the frequency analysis code snippet in the following figure:

Figure 1.37: Frequency analysis of the data points

The outlier replacement code snippet is shown in the following figure:

Figure 1.38: Outlier replacement using the median value

Number of real estate loans or lines

When we see the frequency of value present in the data attribute, we will come to know that a frequency value beyond 17 is too less. So, here we replace every value less than 17 with 17.

You can refer to the code snippet in the following figure:

Figure 1.39: Code snippet for replacing outliers

Number of times 60-89 days past due not worse

For this attribute, when you analyze the data value frequency, you will immediately see that the values 96 and 98 are outliers. We will replace these values with the median value of the data attribute.

Refer to the frequency analysis code snippet in the following figure:

Figure 1.40: Frequency analysis of the data

The outlier replacement code snippet is shown in the following figure:

Figure 1.41: Code snippet for replacing outliers using the median value

You can refer to the removeSpecificAndPutMedian method code from Figure 1.38.

Number of dependents

For this attribute, when you see the frequency value of the data points, you will immediately see that data values greater than 10 are outliers. We replace values greater than 10 with 10.

Refer to the code snippet in the following figure:

Figure 1.42: Code snippet for replacing outlier values

This is the end of the outlier section. In this section, we've replaced the value of the data points in a more meaningful way. We have also reached the end of our basic data analysis section. This analysis has given us a good understanding of the dataset and its values. The next section is all about feature engineering. So, we will start with the basics first, and later on in this chapter, you will learn how feature engineering will impact the accuracy of the algorithm in a positive manner.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • • Master the advanced concepts, methodologies, and use cases of machine learning
  • • Build ML applications for analytics, NLP and computer vision domains
  • • Solve the most common problems in building machine learning models

Description

Machine learning (ML) helps you find hidden insights from your data without the need for explicit programming. This book is your key to solving any kind of ML problem you might come across in your job. You’ll encounter a set of simple to complex problems while building ML models, and you'll not only resolve these problems, but you’ll also learn how to build projects based on each problem, with a practical approach and easy-to-follow examples. The book includes a wide range of applications: from analytics and NLP, to computer vision domains. Some of the applications you will be working on include stock price prediction, a recommendation engine, building a chat-bot, a facial expression recognition system, and many more. The problem examples we cover include identifying the right algorithm for your dataset and use cases, creating and labeling datasets, getting enough clean data to carry out processing, identifying outliers, overftting datasets, hyperparameter tuning, and more. Here, you'll also learn to make more timely and accurate predictions. In addition, you'll deal with more advanced use cases, such as building a gaming bot, building an extractive summarization tool for medical documents, and you'll also tackle the problems faced while building an ML model. By the end of this book, you'll be able to fine-tune your models as per your needs to deliver maximum productivity.

Who is this book for?

This book is for the intermediate users such as machine learning engineers, data engineers, data scientists, and more, who want to solve simple to complex machine learning problems in their day-to-day work and build powerful and efficient machine learning models. A basic understanding of the machine learning concepts and some experience with Python programming is all you need to get started with this book.

What you will learn

  • • Select the right algorithm to derive the best solution in ML domains
  • • Perform predictive analysis effciently using ML algorithms
  • • Predict stock prices using the stock index value
  • • Perform customer analytics for an e-commerce platform
  • • Build recommendation engines for various domains
  • • Build NLP applications for the health domain
  • • Build language generation applications using different NLP techniques
  • • Build computer vision applications such as facial emotion recognition
Estimated delivery fee Deliver to Ireland

Premium delivery 7 - 10 business days

€23.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 27, 2018
Length: 566 pages
Edition : 1st
Language : English
ISBN-13 : 9781788390040
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Ireland

Premium delivery 7 - 10 business days

€23.95
(Includes tracking information)

Product Details

Publication date : Apr 27, 2018
Length: 566 pages
Edition : 1st
Language : English
ISBN-13 : 9781788390040
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 111.97
Machine Learning Solutions
€32.99
Machine Learning Algorithms
€41.99
Mastering Machine Learning Algorithms
€36.99
Total 111.97 Stars icon

Table of Contents

11 Chapters
Credit Risk Modeling Chevron down icon Chevron up icon
Stock Market Price Prediction Chevron down icon Chevron up icon
Customer Analytics Chevron down icon Chevron up icon
Recommendation Systems for E-Commerce Chevron down icon Chevron up icon
Sentiment Analysis Chevron down icon Chevron up icon
Job Recommendation Engine Chevron down icon Chevron up icon
Text Summarization Chevron down icon Chevron up icon
Developing Chatbots Chevron down icon Chevron up icon
Building a Real-Time Object Recognition App Chevron down icon Chevron up icon
Face Recognition and Face Emotion Recognition Chevron down icon Chevron up icon
Building Gaming Bot Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.6
(5 Ratings)
5 star 60%
4 star 40%
3 star 0%
2 star 0%
1 star 0%
EDDY GIMENEZ Dec 08, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
there is no waste in it, everything is clear and very well documented for the purpose of learning the different approaches to apply in machine learning algorithms.
Amazon Verified review Amazon
Amazon buyer Jul 12, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Excellent book.
Amazon Verified review Amazon
Anish Shah Sep 20, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
A practical approach towards Machine learning...The use of real-world datasets for various applications is very thoughtful by the author as it greatly reduces the learning curve. Highly recommended for those who want to practice real projects using scikit-learn library of Python...
Amazon Verified review Amazon
Dr. Franco Arda Sep 10, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I did not like Jalay's first book on NLP, but this one is pretty good.This book covers 11 topics in ML.It's impossible to be an expert at all 11 topics ranging from structured data to NLP and RL.Again, NLP is not her forte - see chapter 5 on Sentiment Analysis.I particular loved Chapter 3: CUSTOMER ANALYTICS. The author is very strong at structured data analysis, creating hypothesis, visualizing ....check our her corresponding Notebook on GitHub. Anyone who thinks he's good at ML because of a high score on a Kaggle competition will face the brutal reality of a Data Scientist with chapter 3.Jalay labels churn customers manually. That's really hard as there's no clear cut to when a customer turns profitable or not.It's data wrangling at it's best. Really impressive.Kudos
Amazon Verified review Amazon
Amazon Customer Aug 30, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
It's good in understanding the ML in hands on wise. However it would be helpful if parallel codes available in github are updated according to Tensorflow latest version.when I trying the samples available for seq2seq in github doesnt work for Tensorflow latest version.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the digital copy I get with my Print order? Chevron down icon Chevron up icon

When you buy any Print edition of our Books, you can redeem (for free) the eBook edition of the Print Book you’ve purchased. This gives you instant access to your book when you make an order via PDF, EPUB or our online Reader experience.

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela