Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Data Cleaning Cookbook

You're reading from   Python Data Cleaning Cookbook Modern techniques and Python tools to detect and remove dirty data and extract key insights

Arrow left icon
Product type Paperback
Published in Dec 2020
Publisher Packt
ISBN-13 9781800565661
Length 436 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Michael B Walker Michael B Walker
Author Profile Icon Michael B Walker
Michael B Walker
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas 2. Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas FREE CHAPTER 3. Chapter 3: Taking the Measure of Your Data 4. Chapter 4: Identifying Missing Values and Outliers in Subsets of Data 5. Chapter 5: Using Visualizations for the Identification of Unexpected Values 6. Chapter 6: Cleaning and Exploring Data with Series Operations 7. Chapter 7: Fixing Messy Data when Aggregating 8. Chapter 8: Addressing Data Issues When Combining DataFrames 9. Chapter 9: Tidying and Reshaping Data 10. Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning 11. Other Books You May Enjoy

Importing data from web pages

We use Beautiful Soup in this recipe to scrape data from a web page and load that data into pandas. Web scraping is very useful when there is data at a website that is updated regularly, but there is no API. We can rerun our code to generate new data whenever the page is updated.

Unfortunately, the web scrapers we build can be broken when the structure of the targeted page changes. That is less likely to happen with APIs because they are designed for data exchange, and carefully curated with that end in mind. The priority for most web designers is the quality of the display of information, not the reliability and ease of data exchange. This causes data cleaning challenges unique to web scraping, including HTML elements that house the data being in surprising and changing locations, formatting tags that obfuscate the underlying data, and explanatory text that aid data interpretation being difficult to retrieve. In addition to these challenges, scraping presents data cleaning issues that are familiar, such as changing data types in columns, less than ideal headings, and missing values. We deal with data issues that occur most frequently in this recipe.

Getting ready

You will need Beautiful Soup installed to run the code in this recipe. You can install it with pip by entering pip install beautifulsoup4 in a terminal window or Windows PowerShell.

We will scrape data from a web page, find the following table in that page, and load it into a pandas DataFrame:

Figure 2.1 – COVID-19 data from six countries

Figure 2.1 – COVID-19 data from six countries

Note

I created this web page, http://www.alrb.org/datacleaning/covidcaseoutliers.html, based on COVID-19 data for public use from Our World in Data, available at https://ourworldindata.org/coronavirus-source-data.

How to do it…

We scrape the COVID data from the website and do some routine data checks:

  1. Import the pprint, requests, and BeautifulSoup libraries:
    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> import requests
    >>> from bs4 import BeautifulSoup
  2. Parse the web page and get the header row of the table.

    Use Beautiful Soup's find method to get the table we want and then use find_all to retrieve the elements nested within the th elements for that table. Create a list of column labels based on the text of the th rows:

    >>> webpage = requests.get("http://www.alrb.org/datacleaning/covidcaseoutliers.html")
    >>> bs = BeautifulSoup(webpage.text, 'html.parser')
    >>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
    >>> type(theadrows)
    <class 'bs4.element.ResultSet'>
    >>> labelcols = [j.get_text() for j in theadrows]
    >>> labelcols[0] = "rowheadings"
    >>> labelcols
    ['rowheadings', 'Cases', 'Deaths', 'Cases per Million', 'Deaths per Million', 'population', 'population_density', 'median_age', 'gdp_per_capita', 'hospital_beds_per_100k']
  3. Get the data from the table cells.

    Find all of the table rows for the table we want. For each table row, find the th element and retrieve the text. We will use that text for our row labels. Also, for each row, find all the td elements (the table cells with the data) and save text from all of them in a list. This gives us datarows, which has all the numeric data in the table. (You can confirm that it matches the table from the web page.) We then insert the labelrows list (which has the row headings) at the beginning of each list in datarows:

    >>> rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')
    >>> datarows = []
    >>> labelrows = []
    >>> for row in rows:
    ...   rowlabels = row.find('th').get_text()
    ...   cells = row.find_all('td', {'class':'data'})
    ...   if (len(rowlabels)>3):
    ...     labelrows.append(rowlabels)
    ...   if (len(cells)>0):
    ...     cellvalues = [j.get_text() for j in cells]
    ...     datarows.append(cellvalues)
    ... 
    >>> pprint.pprint(datarows[0:2])
    [['9,394', '653', '214', '15', '43,851,043', '17', '29', '13,914', '1.9'],
     ['16,642', '668', '1848', '74', '9,006,400', '107', '44', '45,437', '7.4']]
    >>> pprint.pprint(labelrows[0:2])
    ['Algeria', 'Austria']
    >>> 
    >>> for i in range(len(datarows)):
    ...   datarows[i].insert(0, labelrows[i])
    ... 
    >>> pprint.pprint(datarows[0:1])
    [['Algeria','9,394','653','214','15','43,851,043','17','29','13,914','1.9']]
  4. Load the data into pandas.

    Pass the datarows list to the DataFrame method of pandas. Notice that all data is read into pandas with the object data type, and that some data has values that cannot be converted into numeric values in their current form (due to the commas):

    >>> totaldeaths = pd.DataFrame(datarows, columns=labelcols)
    >>> totaldeaths.head()
      rowheadings    Cases Deaths  ... median_age gdp_per_capita  \
    0     Algeria    9,394    653  ...         29         13,914   
    1     Austria   16,642    668  ...         44         45,437   
    2  Bangladesh   47,153    650  ...         28          3,524   
    3     Belgium   58,381   9467  ...         42         42,659   
    4      Brazil  514,849  29314  ...         34         14,103   
    >>> totaldeaths.dtypes
    rowheadings               object
    Cases                     object
    Deaths                    object
    Cases per Million         object
    Deaths per Million        object
    population                object
    population_density        object
    median_age                object
    gdp_per_capita            object
    hospital_beds_per_100k    object
    dtype: object
  5. Fix the column names and convert the data to numeric values.

    Remove spaces from column names. Remove all non-numeric data from the first columns with data, including the commas (str.replace("[^0-9]",""). Convert to numeric values, except for the rowheadings column:

    >>> totaldeaths.columns = totaldeaths.columns.str.replace(" ", "_").str.lower()
    >>> for col in totaldeaths.columns[1:-1]:
    ...   totaldeaths[col] = totaldeaths[col].\
    ...     str.replace("[^0-9]","").astype('int64')
    ... 
    >>> totaldeaths['hospital_beds_per_100k'] = totaldeaths['hospital_beds_per_100k'].astype('float')
    >>> totaldeaths.head()
      rowheadings   cases  deaths  ...  median_age  gdp_per_capita  \
    0     Algeria    9394     653  ...          29           13914   
    1     Austria   16642     668  ...          44           45437   
    2  Bangladesh   47153     650  ...          28            3524   
    3     Belgium   58381    9467  ...          42           42659   
    4      Brazil  514849   29314  ...          34           14103   
    >>> totaldeaths.dtypes
    rowheadings                object
    cases                       int64
    deaths                      int64
    cases_per_million           int64
    deaths_per_million          int64
    population                  int64
    population_density          int64
    median_age                  int64
    gdp_per_capita              int64
    hospital_beds_per_100k    float64
    dtype: object

We have now created a pandas DataFrame from an html table.

How it works…

Beautiful Soup is a very useful tool for finding specific HTML elements in a web page and retrieving text from them. You can get one HTML element with find and get one or more with find_all. The first argument for both find and find_all is the HTML element to get. The second argument takes a Python dictionary of attributes. You can retrieve text from all of the HTML elements you find with get_text.

Some amount of looping is usually necessary to process the elements and text, as with step 2 and step 3. These two statements in step 2 are fairly typical:

>>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
>>> labelcols = [j.get_text() for j in theadrows]

The first statement finds all the th elements we want and creates a Beautiful Soup result set called theadrows from the elements it found. The second statement iterates over the theadrows Beautiful Soup result set using the get_text method to get the text from each element, and stores it in the labelcols list.

Step 3 is a little more involved, but makes use of the same Beautiful Soup methods. We find all of the table rows (tr) in the target table (rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')). We then iterate over each of those rows, finding the th element and getting the text in that element (rowlabels = row.find('th').get_text()). We also find all of the table cells (td) for each row (cells = row.find_all('td', {'class':'data'}) and get the text from all table cells (cellvalues = [j.get_text() for j in cells]). Note that this code is dependent on the class of the td elements being data. Finally, we insert the row labels we get from the th elements at the beginning of each list in datarows:

>>> for i in range(len(datarows)):
...   datarows[i].insert(0, labelrows[i])

In step 4, we use the DataFrame method to load the list we created in steps 2 and 3 into pandas. We then do some cleaning similar to what we have done in previous recipes in this chapter. We use string replace to remove spaces from column names and to remove all non-numeric data, including commas, from what are otherwise valid numeric values. We convert all columns, except for the rowheadings column, to numeric.

There's more…

Our scraping code is dependent on several aspects of the web page's structure not changing: the ID of the main table, the presence of th tags with column and row labels, and the td elements continuing to have their class equal to data. The good news is that if the structure of the web page does change, this will likely only affect the find and find_all calls. The rest of the code would not need to change.

You have been reading a chapter from
Python Data Cleaning Cookbook
Published in: Dec 2020
Publisher: Packt
ISBN-13: 9781800565661
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime