You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Product type Paperback

Published in Nov 2020

Publisher Packt

ISBN-13 9781838984847

Length 392 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Author (1):

Rongpeng Li

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: Getting Started with Statistics for Data Science

2. Chapter 1: Fundamentals of Data Collection, Cleaning, and Preprocessing FREE CHAPTER

3. Chapter 2: Essential Statistics for Data Assessment

4. Chapter 3: Visualization with Statistical Graphs

5. Section 2: Essentials of Statistical Analysis

6. Chapter 4: Sampling and Inferential Statistics

7. Chapter 5: Common Probability Distributions

8. Chapter 6: Parametric Estimation

9. Chapter 7: Statistical Hypothesis Testing

10. Section 3: Statistics for Machine Learning

11. Chapter 8: Statistics for Regression

12. Chapter 9: Statistics for Classification

13. Chapter 10: Statistics for Tree-Based Methods

14. Chapter 11: Statistics for Ensemble Methods

15. Section 4: Appendix

16. Chapter 12: A Collection of Best Practices

17. Chapter 13: Exercises and Projects

18. Other Books You May Enjoy

Leave a review - let other readers know what you think

Collecting data from various data sources

There are three major ways to collect and gather data. It is crucial to keep in mind that data doesn't have to be well-formatted tables:

Obtaining structured tabulated data directly: For example, the Federal Reserve (https://www.federalreserve.gov/data.htm) releases well-structured and well-documented data in various formats, including CSV, so that pandas can read the file into a DataFrame format.
Requesting data from an API: For example, the Google Map API (https://developers.google.com/maps/documentation) allows developers to request data from the Google API at a capped rate depending on the pricing plan. The returned format is usually JSON or XML.
Building a dataset from scratch: For example, social scientists often perform surveys and collect participants' answers to build proprietary data.

Let's look at some examples involving these three approaches. You will use the UCI machine learning repository, the Google Map API and USC President's Office websites as data sources, respectively.

Reading data directly from files

Reading data from local files or remote files through a URL usually requires a good source of publicly accessible data archives. For example, the University of California, Irvine maintains a data repository for machine learning. We will be reading the air quality dataset with pandas. The latest URL will be updated in the book's official GitHub repository in case the following code fails. You may obtain the file from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. From the datasets, we are using the processed.hungarian.data file. You need to upload the file to the same folder where the notebook resides.

The following code snippet reads the data and displays the first several rows of the datasets:

import pandas as pd
df = pd.read_csv("processed.hungarian.data",
                 sep=",",
                 names = ["age","sex","cp","trestbps",
                          "chol","fbs","restecg","thalach",
                          "exang","oldpeak","slope","ca",
                          "thal","num"])
df.head()

This produces the following output:

Figure 1.1 – Head of the Hungarian heart disease dataset

In the following section, you will learn how to obtain data from an API.

Obtaining data from an API

In plain English, an Application Programming Interface (API) defines protocols, agreements, or treaties between applications or parts of applications. You need to pass requests to an API and obtain returned data in JSON or other formats specified in the API documentation. Then you can extract the data you want.

Note

When working with an API, you need to follow the guidelines and restrictions regarding API usage. Improper usage of an API will result in the suspension of an account or even legal issues.

Let's take the Google Map Place API as an example. The Place API (https://developers.google.com/places/web-service/intro) is one of many Google Map APIs that Google offers. Developers can use HTTP requests to obtain information about certain geographic locations, the opening hours of establishments, and the types of establishment, such as schools, government offices, and police stations.

In terms of using external APIs

Like many APIs, the Google Map Place API requires you to create an account on its platform – the Google Cloud Platform. It is free, but still requires a credit card account for some services it provides. Please pay attention so that you won't be mistakenly charged.

After obtaining and activating the API credentials, the developer can build standard HTTP requests to query the endpoints. For example, the textsearch endpoint is used to query places based on text. Here, you will use the API to query information about libraries in Culver City, Los Angeles:

First, let's import the necessary libraries:
```
import requests
import json
```

Initialize the API key and endpoints. We need to replace API_KEY with a real API key to make the code work:

API_KEY = Your API key goes here
TEXT_SEARCH_URL = https://maps.googleapis.com/maps/api/place/textsearch/json?
query = "Culver City Library"

Obtain the response returned and parse the returned data into JSON format. Let's examine it:

response = requests.get(TEXT_SEARCH_URL+'query='+query+'&key='+API_KEY) 
json_object = response.json() 
print(json_object)

This is a one-result response. Otherwise, the results fields will have multiple entries. You can index the multi-entry results fields as a normal Python list object:

{'html_attributions': [],
 'results': [{'formatted_address': '4975 Overland Ave, Culver City, CA 90230, United States',
   'geometry': {'location': {'lat': 34.0075635, 'lng': -118.3969651},
    'viewport': {'northeast': {'lat': 34.00909257989272,
      'lng': -118.3955611701073},
     'southwest': {'lat': 34.00639292010727, 'lng': -118.3982608298927}}},
   'icon': 'https://maps.gstatic.com/mapfiles/place_api/icons/civic_building-71.png',
   'id': 'ccdd10b4f04fb117909897264c78ace0fa45c771',
   'name': 'Culver City Julian Dixon Library',
   'opening_hours': {'open_now': True},
   'photos': [{'height': 3024,
     'html_attributions': ['<a href="https://maps.google.com/maps/contrib/102344423129359752463">Khaled Alabed</a>'],
     'photo_reference': 'CmRaAAAANT4Td01h1tkI7dTn35vAkZhx_-mg3PjgKvjHiyh80M5UlI3wVw1cer4vkOksYR68NM9aw33ZPYGQzzXTE8bkOwQYuSChXAWlJUtz8atPhmRht4hP4dwFgqfbJULmG5f1EhAfWlF_cpLz76sD_81fns1OGhT4KU-zWTbuNY54_4_XozE02pLNWw',
     'width': 4032}],
   'place_id': 'ChIJrUqREx-6woARFrQdyscOZ-8',
   'plus_code': {'compound_code': '2J53+26 Culver City, California',
    'global_code': '85632J53+26'},
   'rating': 4.2,
   'reference': 'ChIJrUqREx-6woARFrQdyscOZ-8',
   'types': ['library', 'point_of_interest', 'establishment'],
   'user_ratings_total': 49}],
 'status': 'OK'}

The address and name of the library can be obtained as follows:

print(json_object["results"][0]["formatted_address"])
print(json_object["results"][0]["name"])

The result reads as follows:

4975 Overland Ave, Culver City, CA 90230, United States
Culver City Julian Dixon Library

Information

An API can be especially helpful for data augmentation. For example, if you have a list of addresses that are corrupted or mislabeled, using the Google Map API may help you correct wrong data.

Obtaining data from scratch

There are instances where you would need to build your own dataset from scratch.

One way of building data is to crawl and parse the internet. On the internet, a lot of public resources are open to the public and free to use. Google's spiders crawl the internet relentlessly 24/7 to keep its search results up to date. You can write your own code to gather information online instead of opening a web browser to do it manually.

Doing a survey and obtaining feedback, whether explicitly or implicitly, is another way to obtain private data. Companies such as Google and Amazon gather tons of data from user profiling. Such data builds the core of their dominating power in ads and e-commerce. We won't be covering this method, however.

Legal issue of crawling

Notice that in some cases, web crawling is highly controversial. Before crawling a website, do check their user agreement. Some websites explicitly forbid web crawling. Even if a website is open to web crawling, intensive requests may dramatically slow down the website, disabling its normal functionality to serve other users. It is a courtesy not only to respect their policy, but also the law.

Here is a simple example that uses regular expression to obtain all the phone numbers from the web page of the president's office, University of Southern California: http://departmentsdirectory.usc.edu/pres_off.html:

First, let's import the necessary libraries. re is the Python built-in regular expression library. requests is an HTTP client that enables communication with the internet through the http protocol:
```
import re
import requests
```
If you look at the web page, you will notice that there is a pattern within the phone numbers. All the phone numbers start with three digits, followed by a hyphen and then four digits. Our objective now is to compile such a pattern:
```
pattern = re.compile("\d{3}-\d{4}")
```
The next step is to create an http client and obtain the response from the GET call:
```
response = requests.get("http://departmentsdirectory.usc.edu/pres_off.html") 
```
The data attribute of response can be converted into a long string and fed to the findall method:
```
pattern.findall(str(response.data))
```

The results contain all the phone numbers on the web page:

 ['740-2111',
 '821-1342',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-2111',
 '740-9749',
 '740-2505',
 '740-6942',
 '821-1340',
 '821-6292']

In this section, we introduced three different ways of collecting data: reading tabulated data from data files provided by others, obtaining data from APIs, and building data from scratch. In the rest of the book, we will focus on the first option and mainly use collected data from the UCI Machine Learning Repository. In most cases, API data and scraped data will be integrated into tabulated datasets for production usage.

You're reading from Essential Statistics for Non-STEM Data Analysts Get to grips with the statistics and math knowledge needed to enter the world of data science with Python

Table of Contents (19) Chapters

Collecting data from various data sources

Reading data directly from files

Obtaining data from an API

Obtaining data from scratch

Authors (1)

Other recommended products

Personalised recommendations for you