Data Cleaning
Data cleaning includes processes such as filling in missing values and handling inconsistencies. It detects corrupt data and replaces or modifies it.
Missing Values
The concept of missing values is important to understand if you want to master the skill of successful management and understanding of data. Let's take a look at the following figure:
Figure 1.14: Bank customer credit data
As you can see, the data belongs to a bank; each row is a separate customer and each column contains their details, such as age and credit amount. There are some cells that have either NA or are just empty. This is missing data. Each piece of information about the customer is crucial for the bank. If any of the information is missing, then it will be difficult for the bank to predict the risk of providing a loan to the customer.
Handling Missing Data
Intelligent handling of missing data will result in building a robust model capable of handling complex tasks. There are many ways to handle missing data. Let's now look at some of those ways.
Removing the Data
Checking missing values is the first and the most important step in data pre-processing. A model cannot accept data with missing values. This is a very simple and commonly used method to handle missing values: we delete a row if the missing value corresponds to the places in the row, or we delete a column if it has more than 70%-75% of missing data. Again, the threshold value is not fixed and depends on how much you wish to fix.
The benefit of this approach is that it is quick and easy to do, and in many cases no data is better than bad data. The drawback is that you may end up losing important information, because you're deleting a whole feature based on a few missing values.
Exercise 2: Removing Missing Data
In this exercise, we will be loading the Banking_Marketing.csv dataset into the pandas DataFrame and handling the missing data. This dataset is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns involved phone calls to clients to try and get them to subscribe to a particular product. The dataset contains the details of each client contacted, and whether they subscribed to the product. Follow these steps to complete this exercise:
Note
The Banking_Marketing.csv dataset can be found at this location: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv.
- Open a Jupyter notebook. Insert a new cell and add the following code to import pandas and fetch the Banking_Marketing.csv dataset:
import pandas as pd
dataset = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(dataset, header=0)
- Once you have fetched the dataset, print the datatype of each column. To do so, use the dtypes attribute from the pandas DataFrame:
df.dtypes
The preceding code generates the following output:
Figure 1.15: Data types of each feature
- Now we need to find the missing values for each column. In order to do that, we use the isna() function provided by pandas:
df.isna().sum()
The preceding code generates the following output:
Figure 1.16: Missing values of each column in the dataset
In the preceding figure, we can see that there is data missing from three columns, namely age, contact, and duration. There are two NAs in the age column, six NAs in contact, and seven NAs in duration.
- Once you have figured out all the missing details, we remove all the missing rows from the DataFrame. To do so, we use the dropna() function:
#removing Null values
data = data.dropna()
- To check whether the missing vales are still present, use the isna() function:
df.isna().sum()
The preceding code generates the following output:
Figure 1.17: Each column of the dataset with zero missing values
You have successfully removed all missing data from the DataFrame. In the next section, we'll look at the second method of dealing with missing data, which uses imputation.
Mean/Median/Mode Imputation
In the case of numerical data, we can compute its mean or median and use the result to replace missing values. In the case of the categorical (non-numerical) data, we can compute its mode to replace the missing value. This is known as imputation.
The benefit of using imputation, rather than just removing data, is that it prevents data loss. The drawback is that you don't know how accurate using the mean, median, or mode is going to be in a given situation.
Let's look at an exercise in which we will use imputation method to solve missing data problems.
Exercise 3: Imputing Missing Data
In this exercise, we will be loading the Banking_Marketing.csv dataset into the pandas DataFrame and handle the missing data. We'll make use of the imputation method. Follow these steps to complete this exercise:
Note
The Banking_Marketing.csv dataset can be found at this location: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv.
- Open a Jupyter notebook and add a new cell. Load the dataset into the pandas DataFrame. Add the following code to do this:
import pandas as pd
dataset = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'
df = pd.read_csv(dataset, header=0)
- Impute the numerical data of the age column with its mean. To do so, first find the mean of the age column using the mean() function of pandas, and then print it:
mean_age = df.age.mean()
print(mean_age)
The preceding code generates the following output:
Figure 1.18: Mean of the age column
- Once this is done, impute the missing data with its mean using the fillna() function. This can be done with the following code:
df.age.fillna(mean_age, inplace=True)
- Now we impute the numerical data of the duration column with its median. To do so, first find the median of the duration column using the median() function of the pandas. Add the following code to do so:
median_duration = df.duration.median()
print(median_duration)
Figure 1.19: Median of the duration
- Impute the missing data of the duration with its median using the fillna() function.
df. duration.fillna(median_duration,inplace=True)
- Impute the categorical data of the contact column with its mode. To do so, first, find the mode of the contact column using the mode() function of pandas. Add the following code to do this:
mode_contact = df.contact.mode()[0]
print(mode_contact)
Figure 1.20: Mode of the contact
- Impute the missing data of the contact column with its mode using the fillna() function. Add the following code to do this:
df.contact.fillna(mode_contact,inplace=True)
Unlike mean and median, there may be more than one mode in a column. So, we just take the first mode with index 0.
You have successfully imputed the missing data in different ways and made the data complete and clean.
Another part of data cleaning is dealing with outliers, which will be discussed in the next section.
Outliers
Outliers are values that are very large or very small with respect to the distribution of the other data. We can only find outliers in numerical data. Box plots are one good way to find the outliers in a dataset, as you can see in the following figure:
Figure 1.21: Sample of outliers in a box plot
Note
An outlier is not always bad data! With the help of business understanding and client interaction, you can discern whether to remove or retain the outlier.
Let's learn how to find outliers using a simple example. Consider a sample dataset of temperatures from a place at different times:
71, 70, 90, 70, 70, 60, 70, 72, 72, 320, 71, 69
We can now do the following:
- First, we'll sort the data:
60,69, 70, 70, 70, 70, 71, 71, 72, 72, 90, 320
- Next, we'll calculate the median (Q2). The median is the middle data after sorting.
Here, the middle terms are 70 and 71 after sorting the list.
The median is (70 + 71) / 2 = 70.5
- Then we'll calculate the lower quartile (Q1). Q1 is the middle value (median) of the first half of the dataset.
First half of the data = 60, 69, 70, 70, 70, 70
Points 3 and 4 of the bottom 6 are both equal to 70.
The average is (70 + 70) / 2 = 70
Q1 = 70
- Then we calculate the upper quartile (Q3).
Q3 is the middle value (median) of the second half of the dataset.
Second half of the data = 71, 71, 72, 72, 90, 320
Points 3 and 4 of the upper 6 are 72 and 72.
The average is (72 + 72) / 2 = 72
Q3 = 72
- Then we find the interquartile range (IQR).
IQR = Q3 – Q1 = 72 – 70
IQR = 2
- Next, we find the upper and lower fences.
Lower fence = Q1 – 1.5 (IQR) = 70 – 1.5(2) = 67
Upper fence = Q3 + 1.5 (IQR) = 71.5 + 1.5(2) = 74.5
Boundaries of our fences = 67 and 74.5
Any data points lower than the lower fence and greater than the upper fence are outliers. Thus, the outliers from our example are 60, 90 and 320.
Exercise 4: Finding and Removing Outliers in Data
In this exercise, we will be loading the german_credit_data.csv dataset into the pandas DataFrame and removing the outliers. The dataset contains 1,000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes credit from a bank. Each person is classified as a good or bad credit risk according to the set of attributes. Follow these steps to complete this exercise:
Note
The link to the german_credit_data.csv dataset can be found here: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/german_credit_data.csv.
- Open a Jupyter notebook and add a new cell. Write the following code to import the necessary libraries: pandas, NumPy, matplotlib, and seaborn. Fetch the dataset and load it into the pandas DataFrame. Add the following code to do this:
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sbn
dataset = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/german_credit_data.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(dataset, header=0)
In the preceding code, %matplotlib inline is a magic function that is essential if we want the plot to be visible in the notebook.
- This dataset contains an Age column. Let's plot a boxplot of the Age column. To do so, use the boxplot() function from the seaborn library:
sbn.boxplot(df['Age'])
The preceding code generates the following output:
Figure 1.22: A box plot of the Age column
We can see that some data points are outliers in the boxplot.
- The boxplot uses the IQR method to display the data and the outliers (the shape of the data). But in order to print an outlier, we use a mathematical formula to retrieve it. Add the following code to find the outliers of the Age column using the IQR method:
Q1 = df["Age"].quantile(0.25)
Q3 = df["Age"].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
>>> 15.0
In the preceding code, Q1 is the first quartile and Q3 is the third quartile.
- Now we find the upper fence and lower fence by adding the following code, and print all the data above the upper fence and below the lower fence. Add the following code to do this:
Lower_Fence = Q1 - (1.5 * IQR)
Upper_Fence = Q3 + (1.5 * IQR)
print(Lower_Fence)
print(Upper_Fence)
>>> 4.5
>>> 64.5
- To print all the data above the upper fence and below the lower fence, add the following code:
df[((df["Age"] < Lower_Fence) |(df["Age"] > Upper_Fence))]
The preceding code generates the following output:
Figure 1.23: Outlier data based on the Age column
- Filter out the outlier data and print only the potential data. To do so, just negate the preceding result using the ~ operator:
df = df[~((df ["Age"] < Lower_Fence) |(df["Age"] > Upper_Fence))]
df
The preceding code generates the following output:
Figure 1.24: Potential data based on the Age column
You have successfully found the outliers using the IQR. In the next section, we will explore another method of pre-processing called data integration.