Importing data from web pages
We use Beautiful Soup in this recipe to scrape data from a web page and load that data into pandas. Web scraping is very useful when there is data at a website that is updated regularly, but there is no API. We can rerun our code to generate new data whenever the page is updated.
Unfortunately, the web scrapers we build can be broken when the structure of the targeted page changes. That is less likely to happen with APIs because they are designed for data exchange, and carefully curated with that end in mind. The priority for most web designers is the quality of the display of information, not the reliability and ease of data exchange. This causes data cleaning challenges unique to web scraping, including HTML elements that house the data being in surprising and changing locations, formatting tags that obfuscate the underlying data, and explanatory text that aid data interpretation being difficult to retrieve. In addition to these challenges, scraping presents data cleaning issues that are familiar, such as changing data types in columns, less than ideal headings, and missing values. We deal with data issues that occur most frequently in this recipe.
Getting ready
You will need Beautiful Soup installed to run the code in this recipe. You can install it with pip by entering pip install beautifulsoup4
in a terminal window or Windows PowerShell.
We will scrape data from a web page, find the following table in that page, and load it into a pandas DataFrame:
Note
I created this web page, http://www.alrb.org/datacleaning/covidcaseoutliers.html, based on COVID-19 data for public use from Our World in Data, available at https://ourworldindata.org/coronavirus-source-data.
How to do it…
We scrape the COVID data from the website and do some routine data checks:
- Import the
pprint
,requests
, andBeautifulSoup
libraries:>>> import pandas as pd >>> import numpy as np >>> import json >>> import pprint >>> import requests >>> from bs4 import BeautifulSoup
- Parse the web page and get the header row of the table.
Use Beautiful Soup's
find
method to get the table we want and then usefind_all
to retrieve the elements nested within theth
elements for that table. Create a list of column labels based on the text of theth
rows:>>> webpage = requests.get("http://www.alrb.org/datacleaning/covidcaseoutliers.html") >>> bs = BeautifulSoup(webpage.text, 'html.parser') >>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th') >>> type(theadrows) <class 'bs4.element.ResultSet'> >>> labelcols = [j.get_text() for j in theadrows] >>> labelcols[0] = "rowheadings" >>> labelcols ['rowheadings', 'Cases', 'Deaths', 'Cases per Million', 'Deaths per Million', 'population', 'population_density', 'median_age', 'gdp_per_capita', 'hospital_beds_per_100k']
- Get the data from the table cells.
Find all of the table rows for the table we want. For each table row, find the
th
element and retrieve the text. We will use that text for our row labels. Also, for each row, find all thetd
elements (the table cells with the data) and save text from all of them in a list. This gives usdatarows
, which has all the numeric data in the table. (You can confirm that it matches the table from the web page.) We then insert thelabelrows
list (which has the row headings) at the beginning of each list indatarows
:>>> rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr') >>> datarows = [] >>> labelrows = [] >>> for row in rows: ...   rowlabels = row.find('th').get_text() ...   cells = row.find_all('td', {'class':'data'}) ...   if (len(rowlabels)>3): ...     labelrows.append(rowlabels) ...   if (len(cells)>0): ...     cellvalues = [j.get_text() for j in cells] ...     datarows.append(cellvalues) ... >>> pprint.pprint(datarows[0:2]) [['9,394', '653', '214', '15', '43,851,043', '17', '29', '13,914', '1.9'], ['16,642', '668', '1848', '74', '9,006,400', '107', '44', '45,437', '7.4']] >>> pprint.pprint(labelrows[0:2]) ['Algeria', 'Austria'] >>> >>> for i in range(len(datarows)): ...   datarows[i].insert(0, labelrows[i]) ... >>> pprint.pprint(datarows[0:1]) [['Algeria','9,394','653','214','15','43,851,043','17','29','13,914','1.9']]
- Load the data into pandas.
Pass the
datarows
list to theDataFrame
method of pandas. Notice that all data is read into pandas with the object data type, and that some data has values that cannot be converted into numeric values in their current form (due to the commas):>>> totaldeaths = pd.DataFrame(datarows, columns=labelcols) >>> totaldeaths.head()   rowheadings    Cases Deaths  ... median_age gdp_per_capita  \ 0     Algeria    9,394    653  ...         29         13,914   1     Austria   16,642    668  ...         44         45,437   2  Bangladesh   47,153    650  ...         28          3,524   3     Belgium   58,381   9467  ...         42         42,659   4      Brazil  514,849  29314  ...         34         14,103   >>> totaldeaths.dtypes rowheadings               object Cases                     object Deaths                    object Cases per Million         object Deaths per Million        object population                object population_density        object median_age                object gdp_per_capita            object hospital_beds_per_100k    object dtype: object
- Fix the column names and convert the data to numeric values.
Remove spaces from column names. Remove all non-numeric data from the first columns with data, including the commas (
str.replace("[^0-9]",""
). Convert to numeric values, except for therowheadings
column:>>> totaldeaths.columns = totaldeaths.columns.str.replace(" ", "_").str.lower() >>> for col in totaldeaths.columns[1:-1]: ...   totaldeaths[col] = totaldeaths[col].\ ...     str.replace("[^0-9]","").astype('int64') ... >>> totaldeaths['hospital_beds_per_100k'] = totaldeaths['hospital_beds_per_100k'].astype('float') >>> totaldeaths.head()   rowheadings   cases  deaths  ...  median_age  gdp_per_capita  \ 0     Algeria    9394     653  ...          29           13914   1     Austria   16642     668  ...          44           45437   2  Bangladesh   47153     650  ...          28            3524   3     Belgium   58381    9467  ...          42           42659   4      Brazil  514849   29314  ...          34           14103   >>> totaldeaths.dtypes rowheadings                object cases                       int64 deaths                      int64 cases_per_million           int64 deaths_per_million          int64 population                  int64 population_density          int64 median_age                  int64 gdp_per_capita              int64 hospital_beds_per_100k    float64 dtype: object
We have now created a pandas DataFrame from an html
table.
How it works…
Beautiful Soup is a very useful tool for finding specific HTML elements in a web page and retrieving text from them. You can get one HTML element with find
and get one or more with find_all
. The first argument for both find
and find_all
is the HTML element to get. The second argument takes a Python dictionary of attributes. You can retrieve text from all of the HTML elements you find with get_text
.
Some amount of looping is usually necessary to process the elements and text, as with step 2 and step 3. These two statements in step 2 are fairly typical:
>>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th') >>> labelcols = [j.get_text() for j in theadrows]
The first statement finds all the th
elements we want and creates a Beautiful Soup result set called theadrows
from the elements it found. The second statement iterates over the theadrows
Beautiful Soup result set using the get_text
method to get the text from each element, and stores it in the labelcols
list.
Step 3 is a little more involved, but makes use of the same Beautiful Soup methods. We find all of the table rows (tr
) in the target table (rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')
). We then iterate over each of those rows, finding the th
element and getting the text in that element (rowlabels = row.find('th').get_text()
). We also find all of the table cells (td
) for each row (cells = row.find_all('td', {'class':'data'}
) and get the text from all table cells (cellvalues = [j.get_text() for j in cells]
). Note that this code is dependent on the class of the td
elements being data
. Finally, we insert the row labels we get from the th
elements at the beginning of each list in datarows
:
>>> for i in range(len(datarows)): ...   datarows[i].insert(0, labelrows[i])
In step 4, we use the DataFrame
method to load the list we created in steps 2 and 3 into pandas. We then do some cleaning similar to what we have done in previous recipes in this chapter. We use string replace
to remove spaces from column names and to remove all non-numeric data, including commas, from what are otherwise valid numeric values. We convert all columns, except for the rowheadings
column, to numeric.
There's more…
Our scraping code is dependent on several aspects of the web page's structure not changing: the ID of the main table, the presence of th
tags with column and row labels, and the td
elements continuing to have their class equal to data. The good news is that if the structure of the web page does change, this will likely only affect the find
and find_all
calls. The rest of the code would not need to change.