Adding cache support to the link crawler
To support caching, the download
function developed in Chapter 1, Introduction to Web Scraping, needs to be modified to check the cache before downloading a URL. We also need to move throttling inside this function and only throttle when a download is made, and not when loading from a cache. To avoid the need to pass various parameters for every download, we will take this opportunity to refactor the download
function into a class, so that parameters can be set once in the constructor and reused multiple times. Here is the updated implementation to support this:
class Downloader: def __init__(self, delay=5, user_agent='wswp', proxies=None, num_retries=1, cache=None): self.throttle = Throttle(delay) self.user_agent = user_agent self.proxies = proxies self.num_retries = num_retries self.cache = cache def __call__(self, url): result = None if self.cache: try: ...