Interacting with Wikipedia articles
Since the source of our content for this example is going to be a Wikipedia article, it’s important to understand how to gather the data. Wikipedia has an API endpoint (https://en.wikipedia.org/w/api.php) that returns the page content as JSON, as shown in Figure 7.4:
Figure 7.4 – Viewing the JSON output of a Wikipedia article
The link is constructed using the following components:
- API URL: https://en.wikipedia.org/w/api.php
- Action parameter: Query
- Format parameter: JSON
- Titles parameter: The title of the Wikipedia article
- Properties (Prop) parameter: Extracts (the full content of the article)
The data itself that we’ll be using is in the extract node, nested inside the query JSON object.
When reviewing Wikipedia articles, you might notice that the headings are a mix of heading level 2 (<H2>
) tags, which are used as main topic headings, and heading level 3 ...