Three approaches to scrape a web page
Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, firstly with regular expressions, then with the popular BeautifulSoup
module, and finally with the powerful lxml
module.
Regular expressions
If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/2/howto/regex.html.
To scrape the area using regular expressions, we will first try matching the contents of the <td>
element, as follows:
>>> import re >>> url = 'http://example.webscraping.com/view/UnitedKingdom-239' >>> html = download(url) >>> re.findall('<td class="w2p_fw">(.*?)</td>', html) ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom&apos...