Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Essential Statistics for Non-STEM Data Analysts

You're reading from   Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Arrow left icon
Product type Paperback
Published in Nov 2020
Publisher Packt
ISBN-13 9781838984847
Length 392 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Rongpeng Li Rongpeng Li
Author Profile Icon Rongpeng Li
Rongpeng Li
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Section 1: Getting Started with Statistics for Data Science
2. Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing FREE CHAPTER 3. Chapter 2: Essential Statistics for Data Assessment 4. Chapter 3: Visualization with Statistical Graphs 5. Section 2: Essentials of Statistical Analysis
6. Chapter 4: Sampling and Inferential Statistics 7. Chapter 5: Common Probability Distributions 8. Chapter 6: Parametric Estimation 9. Chapter 7: Statistical Hypothesis Testing 10. Section 3: Statistics for Machine Learning
11. Chapter 8: Statistics for Regression 12. Chapter 9: Statistics for Classification 13. Chapter 10: Statistics for Tree-Based Methods 14. Chapter 11: Statistics for Ensemble Methods 15. Section 4: Appendix
16. Chapter 12: A Collection of Best Practices 17. Chapter 13: Exercises and Projects 18. Other Books You May Enjoy

Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing

Thank you for purchasing this book and welcome to a journal of exploration and excitement! Whether you are already a data scientist, preparing for an interview, or just starting learning, this book will serve you well as a companion. You may already be familiar with common Python toolkits and have followed trending tutorials online. However, there is a lack of a systematic approach to the statistical side of data science. This book is designed and written to close this gap for you.

As the first chapter in the book, we start with the very first step of a data science project: collecting, cleaning data, and performing some initial preprocessing. It is like preparing fish for cooking. You get the fish from the water or from the fish market, examine it, and process it a little bit before bringing it to the chef.

You are going to learn five key topics in this chapter. They are correlated with other topics, such as visualization and basic statistics concepts. For example, outlier removal will be very hard to conduct without a scatter plot. Data standardization clearly requires an understanding of statistics such as standard deviation. We prepared a GitHub repository that contains ready-to-run codes from this chapter as well as the rest.

Here are the topics that will be covered in this chapter:

  • Collecting data from various data sources with a focus on data quality
  • Data imputation with an assessment of downstream task requirements
  • Outlier removal
  • Data standardization – when and how
  • Examples involving the scikit-learn preprocessing module

The role of this chapter is as a primer. It is not possible to cover the topics in an entirely sequential fashion. For example, to remove outliers, necessary techniques such as statistical plotting, specifically a box plot and scatter plot, will be used. We will come back to those techniques in detail in future chapters of course, but you must bear with it now. Sometimes, in order to learn new topics, bootstrapping may be one of a few ways to break the shell. You will enjoy it because the more topics you learn along the way, the higher your confidence will be.

You have been reading a chapter from
Essential Statistics for Non-STEM Data Analysts
Published in: Nov 2020
Publisher: Packt
ISBN-13: 9781838984847
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image