7. Advanced Web Scraping and Data Gathering
Activity 7.01: Extracting the Top 100 e-books from Gutenberg
Solution:
These are the steps to complete this activity:
- Import the necessary libraries, including regex and
BeautifulSoup
:import urllib.request, urllib.parse, urllib.error import requests from bs4 import BeautifulSoup import ssl import re
- Read the HTML from the URL:
top100url = 'https://www.gutenberg.org/browse/scores/top' response = requests.get(top100url)
- Write a small function to check the status of the web request:
def status_check(r): Â Â Â Â if r.status_code==200: Â Â Â Â Â Â Â Â print("Success!") Â Â Â Â Â Â Â Â return 1 Â Â Â Â else: Â Â Â Â Â Â Â Â print("Failed!") Â Â Â Â Â Â Â Â return -1
- Check the status of response:
status_check(response)
The output is as follows...