You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Product type Paperback

Published in Nov 2020

Publisher Packt

ISBN-13 9781838984847

Length 392 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Author (1):

Rongpeng Li

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: Getting Started with Statistics for Data Science

2. Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing FREE CHAPTER

3. Chapter 2: Essential Statistics for Data Assessment

4. Chapter 3: Visualization with Statistical Graphs

5. Section 2: Essentials of Statistical Analysis

6. Chapter 4: Sampling and Inferential Statistics

7. Chapter 5: Common Probability Distributions

8. Chapter 6: Parametric Estimation

9. Chapter 7: Statistical Hypothesis Testing

10. Section 3: Statistics for Machine Learning

11. Chapter 8: Statistics for Regression

12. Chapter 9: Statistics for Classification

13. Chapter 10: Statistics for Tree-Based Methods

14. Chapter 11: Statistics for Ensemble Methods

15. Section 4: Appendix

16. Chapter 12: A Collection of Best Practices

17. Chapter 13: Exercises and Projects

18. Other Books You May Enjoy

Leave a review - let other readers know what you think

Outlier removal

Outliers can stem from two possibilities. They either come from mistakes or they have a story behind them. In principle, outliers should be very rare, otherwise the experiment/survey for generating the dataset is intrinsically flawed.

The definition of an outlier is tricky. Outliers can be legitimate because they fall into the long tail end of the population. For example, a team working on financial crisis prediction establishes that a financial crisis occurs in one out of 1,000 simulations. Of course, the result is not an outlier that should be discarded.

It is often good to keep original mysterious outliers from the raw data if possible. In other words, the reason to remove outliers should only come from outside the dataset – only when you already know the originals. For example, if the heart rate data is strangely fast and you know there is something wrong with the medical equipment, then you can remove the bad data. The fact that you know the sensor/equipment is wrong can't be deduced from the dataset itself.

Perhaps the best example for including outliers in data is the discovery of Neptune. In 1821, Alexis Bouvard discovered substantial deviations in Uranus' orbit based on observations. This led him to hypothesize that another planet may be affecting Uranus' orbit, which was found to be Neptune.

Otherwise, discarding mysterious outliers is risky for downstream tasks. For example, some regression tasks are sensitive to extreme values. It takes further experiments to decide whether the outliers exist for a reason. In such cases, don't remove or correct outliers in the data preprocessing steps.

The following graph generates a scatter plot for the trestbps and chol fields. The highlighted data points are possible outliers, but I probably will keep them for now:

Figure 1.13 – A scatter plot of two fields in heart disease dataset

Like missing data imputation, outlier removal is tricky and depends on the quality of data and your understanding of the data.

It is hard to discuss systemized outlier removal without talking about concepts such as quartiles and box plots. In this section, we looked at the background information pertaining to outlier removal. We will talk about the implementation based on statistical criteria in the corresponding sections in Chapter 2, Essential Statistics for Data Assessment, and Chapter 3, Visualization with Statistical Graphs.

You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Table of Contents (19) Chapters

Outlier removal

Authors (1)

Other recommended products

Personalised recommendations for you