You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Product type Paperback

Published in Nov 2020

Publisher Packt

ISBN-13 9781838984847

Length 392 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Author (1):

Rongpeng Li

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: Getting Started with Statistics for Data Science

2. Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing FREE CHAPTER

3. Chapter 2: Essential Statistics for Data Assessment

4. Chapter 3: Visualization with Statistical Graphs

5. Section 2: Essentials of Statistical Analysis

6. Chapter 4: Sampling and Inferential Statistics

7. Chapter 5: Common Probability Distributions

8. Chapter 6: Parametric Estimation

9. Chapter 7: Statistical Hypothesis Testing

10. Section 3: Statistics for Machine Learning

11. Chapter 8: Statistics for Regression

12. Chapter 9: Statistics for Classification

13. Chapter 10: Statistics for Tree-Based Methods

14. Chapter 11: Statistics for Ensemble Methods

15. Section 4: Appendix

16. Chapter 12: A Collection of Best Practices

17. Chapter 13: Exercises and Projects

18. Other Books You May Enjoy

Leave a review - let other readers know what you think

Data imputation

Missing data is ubiquitous and data imputation techniques will help us to alleviate its influence.

In this section, we are going to use the heart disease data to examine the pros and cons of basic data imputation. I recommend you read the dataset description beforehand to understand the meaning of each column.

Preparing the dataset for imputation

The heart disease dataset is the same one we used earlier in the Collecting data from various data sources section. It should give you a real red flag that you shouldn't take data integrity for granted. The following screenshot shows missing data denoted by question marks:

Figure 1.2 – The head of Hungarian heart disease data in VS Code (CSV rainbow extension enabled)

First, let's do an info() call that lists column data type information:

df.info()

Note

df.info() is a very helpful function that provides you with pointers for your next move. It should be the first function call when given an unknown dataset.

The following screenshot shows the output obtained from the preceding function:

Figure 1.3 – Output of the info() function call

If pandas can't infer the data type of a column, it will interpret it as objects. For example, the chol (cholesterol) column contains missing data. The missing data is a question mark treated as a string, but the remainder of the data is of the float type. The records are collectively called objects.

Python's type tolerance

As Python is pretty error-tolerant, it is a good practice to introduce a necessary type check. For example, if a column mixes the numerical values, instead of using numerical values to check truth, explicitly check its type and write two branches. Also, it is advised to avoid type conversion on columns with data type objects. Remember to make your code completely deterministic and future-proof.

Now, let's replace the question mark with the NaN values. The following code snippet declares a function that can handle three different cases and treat them appropriately. The three cases are listed here:

The record value is "?".
The record value is of the integer type. This is treated independently because columns such as num should be binary. Floating numbers will lose the essence of using 0-1 encoding.
The rest includes valid strings that can be converted to float numbers and original float numbers.

The code snippet will be as follows:

import numpy as np
def replace_question_mark(val):
    if val == "?":
        return np.NaN
    elif type(val)==int:
        return val
    else:
        return float(val)
df2 = df.copy()
for (columnName, _) in df2.iteritems():
    df2[columnName] = df2[columnName].apply(replace-question_mark)

Now we call the info() function and the head() function, as shown here:

df2.info()

You should expect that all fields are now either floats or integers, as shown in the following output:

Figure 1.4 – Output of info() after data type conversion

Now you can check the number of non-null entries for each column, and different columns have different levels of completeness. age and sex don't contain missing values, but ca contains almost no valid data. This should guide you on your choices of data imputation. For example, strictly dropping all the missing values, which is also considered a way of data imputation, will almost remove the complete dataset. Let's check the shape of the DataFrame after the default missing value drops. You see that there is only one row left. We don't want it:

df2.dropna().shape

A screenshot of the output is as follows:

Figure 1.5 – Removing records containing NaN values leaves only one entry

Before moving on to other more mainstream imputation methods, we would love to perform a quick review of our processed DataFrame.

Check the head of the new DataFrame. You should see that all question marks are replaced by NaN values. NaN values are treated as legitimate numerical values, so native NumPy functions can be used on them:

df2.head()

The output should look as follows:

Figure 1.6 – The head of the updated DataFrame

Now, let's call the describe() function, which generates a table of statistics. It is a very helpful and handy function for a quick peak at common statistics in our dataset:

df2.describe()

Here is a screenshot of the output:

Figure 1.7 – Output from the describe() call

Understanding the describe() limitation

Note that the describe() function only considers valid values. In this sample, the average age value is more trustworthy than the average thal value. Do also pay attention to the metadata. A numerical value doesn't necessarily have a numerical meaning. For example, a thal value is encoded to integers with given meanings.

Now, let's examine the two most common ways of imputation.

Imputation with mean or median values

Imputation with mean or median values only works on numerical datasets. Categorical variables don't contain structures, such as one label being larger than another. Therefore, the concepts of mean and median won't apply.

There are several advantages associated with mean/median imputation:

It is easy to implement.
Mean/median imputation doesn't introduce extreme values.
It does not have any time limit.

However, there are some statistical consequences of mean/median imputation. The statistics of the dataset will change. For example, the histogram for cholesterol prior to imputation is provided here:

Figure 1.8 – Cholesterol concentration distribution

The following code snippet does the imputation with the mean. Following imputation the with mean, the histogram shifts to the right a little bit:

chol = df2["chol"]
plt.hist(chol.apply(lambda x: np.mean(chol) if np.isnan(x) else x), bins=range(0,630,30))
plt.xlabel("cholesterol imputation")
plt.ylabel("count")

Figure 1.9 – Cholesterol concentration distribution with mean imputation

Imputation with the median will shift the peak to the left because the median is smaller than the mean. However, it won't be obvious if you enlarge the bin size. Median and mean values will likely fall into the same bin in this eventuality:

Figure 1.10 – Cholesterol imputation with median imputation

The good news is that the shape of the distribution looks rather similar. The bad news is that we probably increased the level of concentration a little bit. We will cover such statistics in Chapter 3, Visualization with Statistical Graphs.

Note

In other cases where the distribution is not centered or contains a substantial ratio of missing data, such imputation can be disastrous. For example, if the waiting time in a restaurant follows an exponential distribution, imputation with mean values will probably break the characteristics of the distribution.

Imputation with the mode/most frequent value

The advantage of using the most frequent value is that it works well with categorical features and, without a doubt, it will introduce bias as well. The slope field is categorical in nature, although it looks numerical. It represents three statuses of a slope value as positive, flat, or negative.

The following code snippet will reveal our observation:

plt.hist(df2["slope"],bins = 5)
plt.xlabel("slope")
plt.ylabel("count");

Here is the output:

Figure 1.11 – Counting of the slope variable

Without a doubt, the mode is 2. Following imputation with the mode, we obtain the following new distribution:

plt.hist(df2["slope"].apply(lambda x: 2 if np.isnan(x) else x),bins=5)
plt.xlabel("slope mode imputation")
plt.ylabel("count");

In the following graph, pay attention to the scale on y:

Figure 1.12 – Counting of the slope variable after mode imputation

Replacing missing values with the mode in this case is disastrous. If positive and negative values of slope have medical consequences, performing prediction tasks on the preprocessed dataset will depress their weights and significance.

Different imputation methods have their own pros and cons. The prerequisite is to fully understand your business goals and downstream tasks. If key statistics are important, you should try to avoid distorting them. Also, do remember that collecting more data is always an option.

You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Table of Contents (19) Chapters

Data imputation

Preparing the dataset for imputation

Imputation with mean or median values

Imputation with the mode/most frequent value

Authors (1)

Other recommended products

Personalised recommendations for you