You're reading from Python Data Cleaning Cookbook Modern techniques and Python tools to detect and remove dirty data and extract key insights

Product type Paperback

Published in Dec 2020

Publisher Packt

ISBN-13 9781800565661

Length 436 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Michael B Walker

Michael Walker

View More author details

Table of Contents (12) Chapters

Preface

1. Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

2. Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas FREE CHAPTER

3. Chapter 3: Taking the Measure of Your Data

4. Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

5. Chapter 5: Using Visualizations for the Identification of Unexpected Values

6. Chapter 6: Cleaning and Exploring Data with Series Operations

7. Chapter 7: Fixing Messy Data when Aggregating

8. Chapter 8: Addressing Data Issues When Combining DataFrames

9. Chapter 9: Tidying and Reshaping Data

10. Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning

11. Other Books You May Enjoy

Leave a review - let other readers know what you think

Importing data from web pages

We use Beautiful Soup in this recipe to scrape data from a web page and load that data into pandas. Web scraping is very useful when there is data at a website that is updated regularly, but there is no API. We can rerun our code to generate new data whenever the page is updated.

Unfortunately, the web scrapers we build can be broken when the structure of the targeted page changes. That is less likely to happen with APIs because they are designed for data exchange, and carefully curated with that end in mind. The priority for most web designers is the quality of the display of information, not the reliability and ease of data exchange. This causes data cleaning challenges unique to web scraping, including HTML elements that house the data being in surprising and changing locations, formatting tags that obfuscate the underlying data, and explanatory text that aid data interpretation being difficult to retrieve. In addition to these challenges, scraping presents data cleaning issues that are familiar, such as changing data types in columns, less than ideal headings, and missing values. We deal with data issues that occur most frequently in this recipe.

Getting ready

You will need Beautiful Soup installed to run the code in this recipe. You can install it with pip by entering pip install beautifulsoup4 in a terminal window or Windows PowerShell.

We will scrape data from a web page, find the following table in that page, and load it into a pandas DataFrame:

Figure 2.1 – COVID-19 data from six countries

Note

I created this web page, http://www.alrb.org/datacleaning/covidcaseoutliers.html, based on COVID-19 data for public use from Our World in Data, available at https://ourworldindata.org/coronavirus-source-data.

How to do it…

We scrape the COVID data from the website and do some routine data checks:

Import the pprint, requests, and BeautifulSoup libraries:

>>> import pandas as pd
>>> import numpy as np
>>> import json
>>> import pprint
>>> import requests
>>> from bs4 import BeautifulSoup

Parse the web page and get the header row of the table.

Use Beautiful Soup's find method to get the table we want and then use find_all to retrieve the elements nested within the th elements for that table. Create a list of column labels based on the text of the th rows:

>>> webpage = requests.get("http://www.alrb.org/datacleaning/covidcaseoutliers.html")
>>> bs = BeautifulSoup(webpage.text, 'html.parser')
>>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
>>> type(theadrows)
<class 'bs4.element.ResultSet'>
>>> labelcols = [j.get_text() for j in theadrows]
>>> labelcols[0] = "rowheadings"
>>> labelcols
['rowheadings', 'Cases', 'Deaths', 'Cases per Million', 'Deaths per Million', 'population', 'population_density', 'median_age', 'gdp_per_capita', 'hospital_beds_per_100k']

Get the data from the table cells.

Find all of the table rows for the table we want. For each table row, find the th element and retrieve the text. We will use that text for our row labels. Also, for each row, find all the td elements (the table cells with the data) and save text from all of them in a list. This gives us datarows, which has all the numeric data in the table. (You can confirm that it matches the table from the web page.) We then insert the labelrows list (which has the row headings) at the beginning of each list in datarows:

>>> rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')
>>> datarows = []
>>> labelrows = []
>>> for row in rows:
...   rowlabels = row.find('th').get_text()
...   cells = row.find_all('td', {'class':'data'})
...   if (len(rowlabels)>3):
...     labelrows.append(rowlabels)
...   if (len(cells)>0):
...     cellvalues = [j.get_text() for j in cells]
...     datarows.append(cellvalues)
... 
>>> pprint.pprint(datarows[0:2])
[['9,394', '653', '214', '15', '43,851,043', '17', '29', '13,914', '1.9'],
 ['16,642', '668', '1848', '74', '9,006,400', '107', '44', '45,437', '7.4']]
>>> pprint.pprint(labelrows[0:2])
['Algeria', 'Austria']
>>> 
>>> for i in range(len(datarows)):
...   datarows[i].insert(0, labelrows[i])
... 
>>> pprint.pprint(datarows[0:1])
[['Algeria','9,394','653','214','15','43,851,043','17','29','13,914','1.9']]

Load the data into pandas.

Pass the datarows list to the DataFrame method of pandas. Notice that all data is read into pandas with the object data type, and that some data has values that cannot be converted into numeric values in their current form (due to the commas):

>>> totaldeaths = pd.DataFrame(datarows, columns=labelcols)
>>> totaldeaths.head()
  rowheadings    Cases Deaths  ... median_age gdp_per_capita  \
0     Algeria    9,394    653  ...         29         13,914   
1     Austria   16,642    668  ...         44         45,437   
2  Bangladesh   47,153    650  ...         28          3,524   
3     Belgium   58,381   9467  ...         42         42,659   
4      Brazil  514,849  29314  ...         34         14,103   
>>> totaldeaths.dtypes
rowheadings               object
Cases                     object
Deaths                    object
Cases per Million         object
Deaths per Million        object
population                object
population_density        object
median_age                object
gdp_per_capita            object
hospital_beds_per_100k    object
dtype: object

Fix the column names and convert the data to numeric values.

Remove spaces from column names. Remove all non-numeric data from the first columns with data, including the commas (str.replace("[^0-9]",""). Convert to numeric values, except for the rowheadings column:

>>> totaldeaths.columns = totaldeaths.columns.str.replace(" ", "_").str.lower()
>>> for col in totaldeaths.columns[1:-1]:
...   totaldeaths[col] = totaldeaths[col].\
...     str.replace("[^0-9]","").astype('int64')
... 
>>> totaldeaths['hospital_beds_per_100k'] = totaldeaths['hospital_beds_per_100k'].astype('float')
>>> totaldeaths.head()
  rowheadings   cases  deaths  ...  median_age  gdp_per_capita  \
0     Algeria    9394     653  ...          29           13914   
1     Austria   16642     668  ...          44           45437   
2  Bangladesh   47153     650  ...          28            3524   
3     Belgium   58381    9467  ...          42           42659   
4      Brazil  514849   29314  ...          34           14103   
>>> totaldeaths.dtypes
rowheadings                object
cases                       int64
deaths                      int64
cases_per_million           int64
deaths_per_million          int64
population                  int64
population_density          int64
median_age                  int64
gdp_per_capita              int64
hospital_beds_per_100k    float64
dtype: object

We have now created a pandas DataFrame from an html table.

How it works…

Beautiful Soup is a very useful tool for finding specific HTML elements in a web page and retrieving text from them. You can get one HTML element with find and get one or more with find_all. The first argument for both find and find_all is the HTML element to get. The second argument takes a Python dictionary of attributes. You can retrieve text from all of the HTML elements you find with get_text.

Some amount of looping is usually necessary to process the elements and text, as with step 2 and step 3. These two statements in step 2 are fairly typical:

>>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
>>> labelcols = [j.get_text() for j in theadrows]

The first statement finds all the th elements we want and creates a Beautiful Soup result set called theadrows from the elements it found. The second statement iterates over the theadrows Beautiful Soup result set using the get_text method to get the text from each element, and stores it in the labelcols list.

Step 3 is a little more involved, but makes use of the same Beautiful Soup methods. We find all of the table rows (tr) in the target table (rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')). We then iterate over each of those rows, finding the th element and getting the text in that element (rowlabels = row.find('th').get_text()). We also find all of the table cells (td) for each row (cells = row.find_all('td', {'class':'data'}) and get the text from all table cells (cellvalues = [j.get_text() for j in cells]). Note that this code is dependent on the class of the td elements being data. Finally, we insert the row labels we get from the th elements at the beginning of each list in datarows:

>>> for i in range(len(datarows)):
...   datarows[i].insert(0, labelrows[i])

In step 4, we use the DataFrame method to load the list we created in steps 2 and 3 into pandas. We then do some cleaning similar to what we have done in previous recipes in this chapter. We use string replace to remove spaces from column names and to remove all non-numeric data, including commas, from what are otherwise valid numeric values. We convert all columns, except for the rowheadings column, to numeric.

There's more…

Our scraping code is dependent on several aspects of the web page's structure not changing: the ID of the main table, the presence of th tags with column and row labels, and the td elements continuing to have their class equal to data. The good news is that if the structure of the web page does change, this will likely only affect the find and find_all calls. The rest of the code would not need to change.

You're reading from Python Data Cleaning Cookbook Modern techniques and Python tools to detect and remove dirty data and extract key insights

Table of Contents (12) Chapters

Importing data from web pages

Getting ready

How to do it…

How it works…

There's more…

Authors (2)

Other recommended products

Personalised recommendations for you