To support caching, the download function developed in Chapter 1, Introduction to Web Scraping, needs to be modified to check the cache before downloading a URL. We also need to move throttling inside this function and only throttle when a download is made, and not when loading from a cache. To avoid the need to pass various parameters for every download, we will take this opportunity to refactor the download function into a class so parameters can be set in the constructor and reused numerous times. Here is the updated implementation to support this:
from chp1.throttle import Throttle
from random import choice
import requests
class Downloader:
def __init__(self, delay=5, user_agent='wswp', proxies=None, cache={}):
self.throttle = Throttle(delay)
self.user_agent = user_agent
self.proxies = proxies
self.num_retries = None # we will set...