What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

AI Assistant (beta) to help accelerate your learning

Essential Statistics for Non-STEM Data Analysts

Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing

Thank you for purchasing this book and welcome to a journal of exploration and excitement! Whether you are already a data scientist, preparing for an interview, or just starting learning, this book will serve you well as a companion. You may already be familiar with common Python toolkits and have followed trending tutorials online. However, there is a lack of a systematic approach to the statistical side of data science. This book is designed and written to close this gap for you.

As the first chapter in the book, we start with the very first step of a data science project: collecting, cleaning data, and performing some initial preprocessing. It is like preparing fish for cooking. You get the fish from the water or from the fish market, examine it, and process it a little bit before bringing it to the chef.

You are going to learn five key topics in this chapter. They are correlated with other topics, such as visualization and basic statistics concepts. For example, outlier removal will be very hard to conduct without a scatter plot. Data standardization clearly requires an understanding of statistics such as standard deviation. We prepared a GitHub repository that contains ready-to-run codes from this chapter as well as the rest.

Here are the topics that will be covered in this chapter:

Collecting data from various data sources with a focus on data quality
Data imputation with an assessment of downstream task requirements
Outlier removal
Data standardization – when and how
Examples involving the scikit-learn preprocessing module

The role of this chapter is as a primer. It is not possible to cover the topics in an entirely sequential fashion. For example, to remove outliers, necessary techniques such as statistical plotting, specifically a box plot and scatter plot, will be used. We will come back to those techniques in detail in future chapters of course, but you must bear with it now. Sometimes, in order to learn new topics, bootstrapping may be one of a few ways to break the shell. You will enjoy it because the more topics you learn along the way, the higher your confidence will be.

Collecting data from various data sources

There are three major ways to collect and gather data. It is crucial to keep in mind that data doesn't have to be well-formatted tables:

Obtaining structured tabulated data directly: For example, the Federal Reserve (https://www.federalreserve.gov/data.htm) releases well-structured and well-documented data in various formats, including CSV, so that pandas can read the file into a DataFrame format.
Requesting data from an API: For example, the Google Map API (https://developers.google.com/maps/documentation) allows developers to request data from the Google API at a capped rate depending on the pricing plan. The returned format is usually JSON or XML.
Building a dataset from scratch: For example, social scientists often perform surveys and collect participants' answers to build proprietary data.

Let's look at some examples involving these three approaches. You will use the UCI machine learning repository, the Google Map API and USC President's Office websites as data sources, respectively.

Reading data directly from files

Reading data from local files or remote files through a URL usually requires a good source of publicly accessible data archives. For example, the University of California, Irvine maintains a data repository for machine learning. We will be reading the air quality dataset with pandas. The latest URL will be updated in the book's official GitHub repository in case the following code fails. You may obtain the file from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. From the datasets, we are using the processed.hungarian.data file. You need to upload the file to the same folder where the notebook resides.

The following code snippet reads the data and displays the first several rows of the datasets:

import pandas as pd
df = pd.read_csv("processed.hungarian.data",
                 sep=",",
                 names = ["age","sex","cp","trestbps",
                          "chol","fbs","restecg","thalach",
                          "exang","oldpeak","slope","ca",
                          "thal","num"])
df.head()

This produces the following output:

Figure 1.1 – Head of the Hungarian heart disease dataset

In the following section, you will learn how to obtain data from an API.

Obtaining data from an API

In plain English, an Application Programming Interface (API) defines protocols, agreements, or treaties between applications or parts of applications. You need to pass requests to an API and obtain returned data in JSON or other formats specified in the API documentation. Then you can extract the data you want.

Note

When working with an API, you need to follow the guidelines and restrictions regarding API usage. Improper usage of an API will result in the suspension of an account or even legal issues.

Let's take the Google Map Place API as an example. The Place API (https://developers.google.com/places/web-service/intro) is one of many Google Map APIs that Google offers. Developers can use HTTP requests to obtain information about certain geographic locations, the opening hours of establishments, and the types of establishment, such as schools, government offices, and police stations.

In terms of using external APIs

Like many APIs, the Google Map Place API requires you to create an account on its platform – the Google Cloud Platform. It is free, but still requires a credit card account for some services it provides. Please pay attention so that you won't be mistakenly charged.

After obtaining and activating the API credentials, the developer can build standard HTTP requests to query the endpoints. For example, the textsearch endpoint is used to query places based on text. Here, you will use the API to query information about libraries in Culver City, Los Angeles:

First, let's import the necessary libraries:
```
import requests
import json
```

Initialize the API key and endpoints. We need to replace API_KEY with a real API key to make the code work:

API_KEY = Your API key goes here
TEXT_SEARCH_URL = https://maps.googleapis.com/maps/api/place/textsearch/json?
query = "Culver City Library"

Obtain the response returned and parse the returned data into JSON format. Let's examine it:

response = requests.get(TEXT_SEARCH_URL+'query='+query+'&key='+API_KEY) 
json_object = response.json() 
print(json_object)

This is a one-result response. Otherwise, the results fields will have multiple entries. You can index the multi-entry results fields as a normal Python list object:

{'html_attributions': [],
 'results': [{'formatted_address': '4975 Overland Ave, Culver City, CA 90230, United States',
   'geometry': {'location': {'lat': 34.0075635, 'lng': -118.3969651},
    'viewport': {'northeast': {'lat': 34.00909257989272,
      'lng': -118.3955611701073},
     'southwest': {'lat': 34.00639292010727, 'lng': -118.3982608298927}}},
   'icon': 'https://maps.gstatic.com/mapfiles/place_api/icons/civic_building-71.png',
   'id': 'ccdd10b4f04fb117909897264c78ace0fa45c771',
   'name': 'Culver City Julian Dixon Library',
   'opening_hours': {'open_now': True},
   'photos': [{'height': 3024,
     'html_attributions': ['<a href="https://maps.google.com/maps/contrib/102344423129359752463">Khaled Alabed</a>'],
     'photo_reference': 'CmRaAAAANT4Td01h1tkI7dTn35vAkZhx_-mg3PjgKvjHiyh80M5UlI3wVw1cer4vkOksYR68NM9aw33ZPYGQzzXTE8bkOwQYuSChXAWlJUtz8atPhmRht4hP4dwFgqfbJULmG5f1EhAfWlF_cpLz76sD_81fns1OGhT4KU-zWTbuNY54_4_XozE02pLNWw',
     'width': 4032}],
   'place_id': 'ChIJrUqREx-6woARFrQdyscOZ-8',
   'plus_code': {'compound_code': '2J53+26 Culver City, California',
    'global_code': '85632J53+26'},
   'rating': 4.2,
   'reference': 'ChIJrUqREx-6woARFrQdyscOZ-8',
   'types': ['library', 'point_of_interest', 'establishment'],
   'user_ratings_total': 49}],
 'status': 'OK'}

The address and name of the library can be obtained as follows:

print(json_object["results"][0]["formatted_address"])
print(json_object["results"][0]["name"])

The result reads as follows:

4975 Overland Ave, Culver City, CA 90230, United States
Culver City Julian Dixon Library

Information

An API can be especially helpful for data augmentation. For example, if you have a list of addresses that are corrupted or mislabeled, using the Google Map API may help you correct wrong data.

Obtaining data from scratch

There are instances where you would need to build your own dataset from scratch.

One way of building data is to crawl and parse the internet. On the internet, a lot of public resources are open to the public and free to use. Google's spiders crawl the internet relentlessly 24/7 to keep its search results up to date. You can write your own code to gather information online instead of opening a web browser to do it manually.

Doing a survey and obtaining feedback, whether explicitly or implicitly, is another way to obtain private data. Companies such as Google and Amazon gather tons of data from user profiling. Such data builds the core of their dominating power in ads and e-commerce. We won't be covering this method, however.

Legal issue of crawling

Notice that in some cases, web crawling is highly controversial. Before crawling a website, do check their user agreement. Some websites explicitly forbid web crawling. Even if a website is open to web crawling, intensive requests may dramatically slow down the website, disabling its normal functionality to serve other users. It is a courtesy not only to respect their policy, but also the law.

Here is a simple example that uses regular expression to obtain all the phone numbers from the web page of the president's office, University of Southern California: http://departmentsdirectory.usc.edu/pres_off.html:

First, let's import the necessary libraries. re is the Python built-in regular expression library. requests is an HTTP client that enables communication with the internet through the http protocol:
```
import re
import requests
```
If you look at the web page, you will notice that there is a pattern within the phone numbers. All the phone numbers start with three digits, followed by a hyphen and then four digits. Our objective now is to compile such a pattern:
```
pattern = re.compile("\d{3}-\d{4}")
```
The next step is to create an http client and obtain the response from the GET call:
```
response = requests.get("http://departmentsdirectory.usc.edu/pres_off.html") 
```
The data attribute of response can be converted into a long string and fed to the findall method:
```
pattern.findall(str(response.data))
```

The results contain all the phone numbers on the web page:

 ['740-2111',
 '821-1342',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-9749',
 '740-2505',
 '740-6942',
 '821-1340',
 '821-6292']

In this section, we introduced three different ways of collecting data: reading tabulated data from data files provided by others, obtaining data from APIs, and building data from scratch. In the rest of the book, we will focus on the first option and mainly use collected data from the UCI Machine Learning Repository. In most cases, API data and scraped data will be integrated into tabulated datasets for production usage.

Key benefits

Work your way through the entire data analysis pipeline with statistics concerns in mind to make reasonable decisions

Understand how various data science algorithms function

Build a solid foundation in statistics for data science and machine learning using Python-based examples

Description

Statistics remain the backbone of modern analysis tasks, helping you to interpret the results produced by data science pipelines. This book is a detailed guide covering the math and various statistical methods required for undertaking data science tasks. The book starts by showing you how to preprocess data and inspect distributions and correlations from a statistical perspective. You’ll then get to grips with the fundamentals of statistical analysis and apply its concepts to real-world datasets. As you advance, you’ll find out how statistical concepts emerge from different stages of data science pipelines, understand the summary of datasets in the language of statistics, and use it to build a solid foundation for robust data products such as explanatory models and predictive models. Once you’ve uncovered the working mechanism of data science algorithms, you’ll cover essential concepts for efficient data collection, cleaning, mining, visualization, and analysis. Finally, you’ll implement statistical methods in key machine learning tasks such as classification, regression, tree-based methods, and ensemble learning. By the end of this Essential Statistics for Non-STEM Data Analysts book, you’ll have learned how to build and present a self-contained, statistics-backed data product to meet your business goals.

Who is this book for?

This book is an entry-level guide for data science enthusiasts, data analysts, and anyone starting out in the field of data science and looking to learn the essential statistical concepts with the help of simple explanations and examples. If you’re a developer or student with a non-mathematical background, you’ll find this book useful. Working knowledge of the Python programming language is required.

What you will learn

Find out how to grab and load data into an analysis environment

Perform descriptive analysis to extract meaningful summaries from data

Discover probability, parameter estimation, hypothesis tests, and experiment design best practices

Get to grips with resampling and bootstrapping in Python

Delve into statistical tests with variance analysis, time series analysis, and A/B test examples

Understand the statistics behind popular machine learning algorithms

Answer questions on statistics for data scientist interviews

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

AI Assistant (beta) to help accelerate your learning

Frequently bought together

₹4915.99

₹3276.99

Essential Statistics for Non-STEM Data Analysts

₹3276.99

Total ₹ 11,469.97

Filter reviews by

All

Amazon verified reviews

Amazon Customer Jan 05, 2021

I came to this book with the perspective of a non-STEM professional who is transitioning into a Data Science career - so it seemed like it could be an ideal resource for someone with my profile, and that is indeed the case!Previously, I had worked through online courses to learn some Python and some statistics, and had also used the books ‘Python the Hard Way’ and ‘Statistics for Dummies’ quite extensively. The challenge with both of those books is that, in order to become a working data analyst or data scientist, you really need to learn both topics hand-in-hand: it's best to learn statistics while also learning to code in Python. That's exactly what you get in THIS book - working through the essential mathematics/statistics concepts, and learning to code them as you go.These are tough subjects to learn and it's not easy for anyone who doesn't have previous experience in coding - as the author says early on in the book, for a few chapters it's best to just "code along" with the assignments even if you don't fully understand. As you become more familiar with Python, the syntax will start to make sense.Additional notes:- If you’re apprehensive about starting your data analysis/data science journey, this book includes a very clear explanation of how beginners can start coding (using Google Colabs or Jupyter Notebook).- There’s a full section on machine learning - this is what many people consider the most fun part of data science. Spend a lot of time and effort on the earlier sections of the book, and you’ll be fully ready to jump into the machine learning chapters.- Don’t miss the great chapter towards the end of the book, titled “A Collection of Best Practices” - it covers some of the ethical challenges of working with data, ranging from the publication of misleading graphs to the controversial use of facial recognition technology.

Amazon Verified review

Rongyu Lin May 22, 2021

It contains the most important knowledge that data analysts use in the daily work. Well organized and easy to understand. I would recommend this book to all my friends.

data analyst Feb 23, 2021

This book is great for the beginner data analyst wanting to learn statistics and Python. Unlike some other statistics texts I have read, the examples are laid out with very simple python code without skipping the beginning steps. The preface even shows you how to install, import and read in an excel file with Pandas. All the example plots have their source code next to them which is extremely helpful to the beginner as well. Another important concept covered in this book is Data Preprocessing. It’s not all you will need to know but it is a great start. As an analyst, %80 of my time is spent on data preprocessing.

Elsa V. Jan 27, 2021

I wish I'd had this book from the beginning of my journey. The graphics make it very easy to understand what we are talking about and the code is broken down and functional enough to make it easy and clear to understand the point. Li has the insight and clarity to teach data science in a way that even a 5 year old could understand. Absolutely recommend his book, for both stem and non-stem majors. (I have taken Data Sci at a prestigious school via the comp sci/engineering dept, completed a Data Sci bootcamp and work as a software engineer supporting a Data Sci team, and am a former kindergarten teacher and can professionally attest to the quality of digestibility of the info and the breadth and depth of topics.)

Imran Sarker Feb 15, 2021

This book provides a lot of knowledge related to data analysis to its readers. It doesn’t matter whatyour background is when you read this book you start walking on the path to becoming a data analyst.Being an Analyst, it’s important to be familiar with Statistics in order to come up with results and turnyour idea into reality.Although, with the time being, new ways are being introduced but getting familiar with thebackground of everything is important. The author has distributed all the topics well so people fromany background a non-stem or stem can go through each chapter and learn as much as they can. Iwould definitely recommend this book to all who want to manage the data and find ways to deal with the data.

Essential Statistics for Non-STEM Data Analysts: Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

What do you get with eBook?

Essential Statistics for Non-STEM Data Analysts

Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing

Technical requirements

Collecting data from various data sources

Reading data directly from files

Obtaining data from an API

Obtaining data from scratch

Page 1 of 8

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Essential Statistics for Non-STEM Data Analysts: Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs