In the previous chapter, we learned how to scrape data from crawled web pages and save the results to a CSV file. What if we now want to scrape an additional field, such as the flag URL? To scrape additional fields, we would need to download the entire website again. This is not a significant obstacle for our small example website; however, other websites can have millions of web pages, which could take weeks to recrawl. One way scrapers avoid these problems is by caching crawled web pages from the beginning, so they only need to be downloaded once.
In this chapter, we will cover a few ways to do this using our web crawler.
In this chapter, we will cover the following topics:
- When to use caching
- Adding cache support to the link crawler
- Testing the cache
- Using requests - cache
- Redis cache implementation