Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Essential Statistics for Non-STEM Data Analysts
Essential Statistics for Non-STEM Data Analysts

Essential Statistics for Non-STEM Data Analysts: Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

eBook
₹799.99 ₹2621.99
Paperback
₹3276.99
Subscription
Free Trial
Renews at ₹800p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Essential Statistics for Non-STEM Data Analysts

Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing

Thank you for purchasing this book and welcome to a journal of exploration and excitement! Whether you are already a data scientist, preparing for an interview, or just starting learning, this book will serve you well as a companion. You may already be familiar with common Python toolkits and have followed trending tutorials online. However, there is a lack of a systematic approach to the statistical side of data science. This book is designed and written to close this gap for you.

As the first chapter in the book, we start with the very first step of a data science project: collecting, cleaning data, and performing some initial preprocessing. It is like preparing fish for cooking. You get the fish from the water or from the fish market, examine it, and process it a little bit before bringing it to the chef.

You are going to learn five key topics in this chapter. They are correlated with other topics, such as visualization and basic statistics concepts. For example, outlier removal will be very hard to conduct without a scatter plot. Data standardization clearly requires an understanding of statistics such as standard deviation. We prepared a GitHub repository that contains ready-to-run codes from this chapter as well as the rest.

Here are the topics that will be covered in this chapter:

  • Collecting data from various data sources with a focus on data quality
  • Data imputation with an assessment of downstream task requirements
  • Outlier removal
  • Data standardization – when and how
  • Examples involving the scikit-learn preprocessing module

The role of this chapter is as a primer. It is not possible to cover the topics in an entirely sequential fashion. For example, to remove outliers, necessary techniques such as statistical plotting, specifically a box plot and scatter plot, will be used. We will come back to those techniques in detail in future chapters of course, but you must bear with it now. Sometimes, in order to learn new topics, bootstrapping may be one of a few ways to break the shell. You will enjoy it because the more topics you learn along the way, the higher your confidence will be.

Technical requirements

The best environment for running the Python code in the book is on Google Colaboratory (https://colab.research.google.com). Google Colaboratory is a product that runs Jupyter Notebook in the cloud. It has common Python packages that are pre-installed and runs in a browser. It can also communicate with a disk so that you can upload local files to Google Drive. The recommended browsers are the latest versions of Chrome and Firefox.

For more information about Colaboratory, check out their official notebooks: https://colab.research.google.com .

You can find the code for this chapter in the following GitHub repository: https://github.com/PacktPublishing/Essential-Statistics-for-Non-STEM-Data-Analysts

Collecting data from various data sources

There are three major ways to collect and gather data. It is crucial to keep in mind that data doesn't have to be well-formatted tables:

  • Obtaining structured tabulated data directly: For example, the Federal Reserve (https://www.federalreserve.gov/data.htm) releases well-structured and well-documented data in various formats, including CSV, so that pandas can read the file into a DataFrame format.
  • Requesting data from an API: For example, the Google Map API (https://developers.google.com/maps/documentation) allows developers to request data from the Google API at a capped rate depending on the pricing plan. The returned format is usually JSON or XML.
  • Building a dataset from scratch: For example, social scientists often perform surveys and collect participants' answers to build proprietary data.

Let's look at some examples involving these three approaches. You will use the UCI machine learning repository, the Google Map API and USC President's Office websites as data sources, respectively.

Reading data directly from files

Reading data from local files or remote files through a URL usually requires a good source of publicly accessible data archives. For example, the University of California, Irvine maintains a data repository for machine learning. We will be reading the air quality dataset with pandas. The latest URL will be updated in the book's official GitHub repository in case the following code fails. You may obtain the file from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. From the datasets, we are using the processed.hungarian.data file. You need to upload the file to the same folder where the notebook resides.

The following code snippet reads the data and displays the first several rows of the datasets:

import pandas as pd
df = pd.read_csv("processed.hungarian.data",
                 sep=",",
                 names = ["age","sex","cp","trestbps",
                          "chol","fbs","restecg","thalach",
                          "exang","oldpeak","slope","ca",
                          "thal","num"])
df.head()

This produces the following output:

Figure 1.1 – Head of the Hungarian heart disease dataset

Figure 1.1 – Head of the Hungarian heart disease dataset

In the following section, you will learn how to obtain data from an API.

Obtaining data from an API

In plain English, an Application Programming Interface (API) defines protocols, agreements, or treaties between applications or parts of applications. You need to pass requests to an API and obtain returned data in JSON or other formats specified in the API documentation. Then you can extract the data you want.

Note

When working with an API, you need to follow the guidelines and restrictions regarding API usage. Improper usage of an API will result in the suspension of an account or even legal issues.

Let's take the Google Map Place API as an example. The Place API (https://developers.google.com/places/web-service/intro) is one of many Google Map APIs that Google offers. Developers can use HTTP requests to obtain information about certain geographic locations, the opening hours of establishments, and the types of establishment, such as schools, government offices, and police stations.

In terms of using external APIs

Like many APIs, the Google Map Place API requires you to create an account on its platform – the Google Cloud Platform. It is free, but still requires a credit card account for some services it provides. Please pay attention so that you won't be mistakenly charged.

After obtaining and activating the API credentials, the developer can build standard HTTP requests to query the endpoints. For example, the textsearch endpoint is used to query places based on text. Here, you will use the API to query information about libraries in Culver City, Los Angeles:

  1. First, let's import the necessary libraries:
    import requests
    import json
  2. Initialize the API key and endpoints. We need to replace API_KEY with a real API key to make the code work:
    API_KEY = Your API key goes here
    TEXT_SEARCH_URL = https://maps.googleapis.com/maps/api/place/textsearch/json?
    query = "Culver City Library"
  3. Obtain the response returned and parse the returned data into JSON format. Let's examine it:
    response = requests.get(TEXT_SEARCH_URL+'query='+query+'&key='+API_KEY) 
    json_object = response.json() 
    print(json_object)

This is a one-result response. Otherwise, the results fields will have multiple entries. You can index the multi-entry results fields as a normal Python list object:

{'html_attributions': [],
 'results': [{'formatted_address': '4975 Overland Ave, Culver City, CA 90230, United States',
   'geometry': {'location': {'lat': 34.0075635, 'lng': -118.3969651},
    'viewport': {'northeast': {'lat': 34.00909257989272,
      'lng': -118.3955611701073},
     'southwest': {'lat': 34.00639292010727, 'lng': -118.3982608298927}}},
   'icon': 'https://maps.gstatic.com/mapfiles/place_api/icons/civic_building-71.png',
   'id': 'ccdd10b4f04fb117909897264c78ace0fa45c771',
   'name': 'Culver City Julian Dixon Library',
   'opening_hours': {'open_now': True},
   'photos': [{'height': 3024,
     'html_attributions': ['<a href="https://maps.google.com/maps/contrib/102344423129359752463">Khaled Alabed</a>'],
     'photo_reference': 'CmRaAAAANT4Td01h1tkI7dTn35vAkZhx_-mg3PjgKvjHiyh80M5UlI3wVw1cer4vkOksYR68NM9aw33ZPYGQzzXTE8bkOwQYuSChXAWlJUtz8atPhmRht4hP4dwFgqfbJULmG5f1EhAfWlF_cpLz76sD_81fns1OGhT4KU-zWTbuNY54_4_XozE02pLNWw',
     'width': 4032}],
   'place_id': 'ChIJrUqREx-6woARFrQdyscOZ-8',
   'plus_code': {'compound_code': '2J53+26 Culver City, California',
    'global_code': '85632J53+26'},
   'rating': 4.2,
   'reference': 'ChIJrUqREx-6woARFrQdyscOZ-8',
   'types': ['library', 'point_of_interest', 'establishment'],
   'user_ratings_total': 49}],
 'status': 'OK'}

The address and name of the library can be obtained as follows:

print(json_object["results"][0]["formatted_address"])
print(json_object["results"][0]["name"])

The result reads as follows:

4975 Overland Ave, Culver City, CA 90230, United States
Culver City Julian Dixon Library

Information

An API can be especially helpful for data augmentation. For example, if you have a list of addresses that are corrupted or mislabeled, using the Google Map API may help you correct wrong data.

Obtaining data from scratch

There are instances where you would need to build your own dataset from scratch.

One way of building data is to crawl and parse the internet. On the internet, a lot of public resources are open to the public and free to use. Google's spiders crawl the internet relentlessly 24/7 to keep its search results up to date. You can write your own code to gather information online instead of opening a web browser to do it manually.

Doing a survey and obtaining feedback, whether explicitly or implicitly, is another way to obtain private data. Companies such as Google and Amazon gather tons of data from user profiling. Such data builds the core of their dominating power in ads and e-commerce. We won't be covering this method, however.

Legal issue of crawling

Notice that in some cases, web crawling is highly controversial. Before crawling a website, do check their user agreement. Some websites explicitly forbid web crawling. Even if a website is open to web crawling, intensive requests may dramatically slow down the website, disabling its normal functionality to serve other users. It is a courtesy not only to respect their policy, but also the law.

Here is a simple example that uses regular expression to obtain all the phone numbers from the web page of the president's office, University of Southern California: http://departmentsdirectory.usc.edu/pres_off.html:

  1. First, let's import the necessary libraries. re is the Python built-in regular expression library. requests is an HTTP client that enables communication with the internet through the http protocol:
    import re
    import requests
  2. If you look at the web page, you will notice that there is a pattern within the phone numbers. All the phone numbers start with three digits, followed by a hyphen and then four digits. Our objective now is to compile such a pattern:
    pattern = re.compile("\d{3}-\d{4}")
  3. The next step is to create an http client and obtain the response from the GET call:
    response = requests.get("http://departmentsdirectory.usc.edu/pres_off.html") 
  4. The data attribute of response can be converted into a long string and fed to the findall method:
    pattern.findall(str(response.data))

The results contain all the phone numbers on the web page:

 ['740-2111',
 '821-1342',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-9749',
 '740-2505',
 '740-6942',
 '821-1340',
 '821-6292']

In this section, we introduced three different ways of collecting data: reading tabulated data from data files provided by others, obtaining data from APIs, and building data from scratch. In the rest of the book, we will focus on the first option and mainly use collected data from the UCI Machine Learning Repository. In most cases, API data and scraped data will be integrated into tabulated datasets for production usage.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Work your way through the entire data analysis pipeline with statistics concerns in mind to make reasonable decisions
  • Understand how various data science algorithms function
  • Build a solid foundation in statistics for data science and machine learning using Python-based examples

Description

Statistics remain the backbone of modern analysis tasks, helping you to interpret the results produced by data science pipelines. This book is a detailed guide covering the math and various statistical methods required for undertaking data science tasks. The book starts by showing you how to preprocess data and inspect distributions and correlations from a statistical perspective. You’ll then get to grips with the fundamentals of statistical analysis and apply its concepts to real-world datasets. As you advance, you’ll find out how statistical concepts emerge from different stages of data science pipelines, understand the summary of datasets in the language of statistics, and use it to build a solid foundation for robust data products such as explanatory models and predictive models. Once you’ve uncovered the working mechanism of data science algorithms, you’ll cover essential concepts for efficient data collection, cleaning, mining, visualization, and analysis. Finally, you’ll implement statistical methods in key machine learning tasks such as classification, regression, tree-based methods, and ensemble learning. By the end of this Essential Statistics for Non-STEM Data Analysts book, you’ll have learned how to build and present a self-contained, statistics-backed data product to meet your business goals.

Who is this book for?

This book is an entry-level guide for data science enthusiasts, data analysts, and anyone starting out in the field of data science and looking to learn the essential statistical concepts with the help of simple explanations and examples. If you’re a developer or student with a non-mathematical background, you’ll find this book useful. Working knowledge of the Python programming language is required.

What you will learn

  • Find out how to grab and load data into an analysis environment
  • Perform descriptive analysis to extract meaningful summaries from data
  • Discover probability, parameter estimation, hypothesis tests, and experiment design best practices
  • Get to grips with resampling and bootstrapping in Python
  • Delve into statistical tests with variance analysis, time series analysis, and A/B test examples
  • Understand the statistics behind popular machine learning algorithms
  • Answer questions on statistics for data scientist interviews

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 12, 2020
Length: 392 pages
Edition : 1st
Language : English
ISBN-13 : 9781838987565
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Nov 12, 2020
Length: 392 pages
Edition : 1st
Language : English
ISBN-13 : 9781838987565
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
₹800 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
₹4500 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₹400 each
Feature tick icon Exclusive print discounts
₹5000 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₹400 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 11,469.97
Practical Discrete Mathematics
₹4915.99
Hands-On Mathematics for Deep Learning
₹3276.99
Essential Statistics for Non-STEM Data Analysts
₹3276.99
Total 11,469.97 Stars icon

Table of Contents

18 Chapters
Section 1: Getting Started with Statistics for Data Science Chevron down icon Chevron up icon
Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing Chevron down icon Chevron up icon
Chapter 2: Essential Statistics for Data Assessment Chevron down icon Chevron up icon
Chapter 3: Visualization with Statistical Graphs Chevron down icon Chevron up icon
Section 2: Essentials of Statistical Analysis Chevron down icon Chevron up icon
Chapter 4: Sampling and Inferential Statistics Chevron down icon Chevron up icon
Chapter 5: Common Probability Distributions Chevron down icon Chevron up icon
Chapter 6: Parametric Estimation Chevron down icon Chevron up icon
Chapter 7: Statistical Hypothesis Testing Chevron down icon Chevron up icon
Section 3: Statistics for Machine Learning Chevron down icon Chevron up icon
Chapter 8: Statistics for Regression Chevron down icon Chevron up icon
Chapter 9: Statistics for Classification Chevron down icon Chevron up icon
Chapter 10: Statistics for Tree-Based Methods Chevron down icon Chevron up icon
Chapter 11: Statistics for Ensemble Methods Chevron down icon Chevron up icon
Section 4: Appendix Chevron down icon Chevron up icon
Chapter 12: A Collection of Best Practices Chevron down icon Chevron up icon
Chapter 13: Exercises and Projects Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.6
(10 Ratings)
5 star 70%
4 star 20%
3 star 10%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Amazon Customer Jan 05, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I came to this book with the perspective of a non-STEM professional who is transitioning into a Data Science career - so it seemed like it could be an ideal resource for someone with my profile, and that is indeed the case!Previously, I had worked through online courses to learn some Python and some statistics, and had also used the books ‘Python the Hard Way’ and ‘Statistics for Dummies’ quite extensively. The challenge with both of those books is that, in order to become a working data analyst or data scientist, you really need to learn both topics hand-in-hand: it's best to learn statistics while also learning to code in Python. That's exactly what you get in THIS book - working through the essential mathematics/statistics concepts, and learning to code them as you go.These are tough subjects to learn and it's not easy for anyone who doesn't have previous experience in coding - as the author says early on in the book, for a few chapters it's best to just "code along" with the assignments even if you don't fully understand. As you become more familiar with Python, the syntax will start to make sense.Additional notes:- If you’re apprehensive about starting your data analysis/data science journey, this book includes a very clear explanation of how beginners can start coding (using Google Colabs or Jupyter Notebook).- There’s a full section on machine learning - this is what many people consider the most fun part of data science. Spend a lot of time and effort on the earlier sections of the book, and you’ll be fully ready to jump into the machine learning chapters.- Don’t miss the great chapter towards the end of the book, titled “A Collection of Best Practices” - it covers some of the ethical challenges of working with data, ranging from the publication of misleading graphs to the controversial use of facial recognition technology.
Amazon Verified review Amazon
Rongyu Lin May 22, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
It contains the most important knowledge that data analysts use in the daily work. Well organized and easy to understand. I would recommend this book to all my friends.
Amazon Verified review Amazon
data analyst Feb 23, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is great for the beginner data analyst wanting to learn statistics and Python. Unlike some other statistics texts I have read, the examples are laid out with very simple python code without skipping the beginning steps. The preface even shows you how to install, import and read in an excel file with Pandas. All the example plots have their source code next to them which is extremely helpful to the beginner as well. Another important concept covered in this book is Data Preprocessing. It’s not all you will need to know but it is a great start. As an analyst, %80 of my time is spent on data preprocessing.
Amazon Verified review Amazon
Elsa V. Jan 27, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I wish I'd had this book from the beginning of my journey. The graphics make it very easy to understand what we are talking about and the code is broken down and functional enough to make it easy and clear to understand the point. Li has the insight and clarity to teach data science in a way that even a 5 year old could understand. Absolutely recommend his book, for both stem and non-stem majors. (I have taken Data Sci at a prestigious school via the comp sci/engineering dept, completed a Data Sci bootcamp and work as a software engineer supporting a Data Sci team, and am a former kindergarten teacher and can professionally attest to the quality of digestibility of the info and the breadth and depth of topics.)
Amazon Verified review Amazon
Imran Sarker Feb 15, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book provides a lot of knowledge related to data analysis to its readers. It doesn’t matter whatyour background is when you read this book you start walking on the path to becoming a data analyst.Being an Analyst, it’s important to be familiar with Statistics in order to come up with results and turnyour idea into reality.Although, with the time being, new ways are being introduced but getting familiar with thebackground of everything is important. The author has distributed all the topics well so people fromany background a non-stem or stem can go through each chapter and learn as much as they can. Iwould definitely recommend this book to all who want to manage the data and find ways to deal with the data.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.