HTML
You can use pandas to read HTML tables from websites. This makes it easy to ingest tables such as those found on Wikipedia.
In this recipe, we will scrape tables from the Wikipedia entry for The Beatles Discography (https://en.wikipedia.org/wiki/The_Beatles_discography). In particular, we want to scrape the table in the image that was on Wikipedia in 2024:
Figure 4.4: Wikipedia page for The Beatles Discography
Before attempting to read HTML, users will need to install a third-party library. For the examples in this section, we will use lxml
:
python -m pip install lxml
How to do it
pd.read_html
allows you to read a table from a website:
url = "https://en.wikipedia.org/wiki/The_Beatles_discography"
dfs = pd.read_html(url, dtype_backend="numpy_nullable")
len(dfs)
60
Contrary to the other I/O methods we have seen so far, pd.read_html
doesn’t return a pd.DataFrame
but, instead, returns a list of pd.DataFrame
objects...