Scrapy comes with the ability to cache HTTP requests. This can greatly reduce crawling times if pages have already been visited. By enabling the cache, Scrapy will store every request and response.
Caching responses
How to do it
There is a working example in the 06/10_file_cache.py script. In Scrapy, caching middleware is disabled by default. To enable this cache, set HTTPCACHE_ENABLED to True and HTTPCACHE_DIR to a directory on the file system (using a relative path will create the directory in the project's data folder). To demonstrate, this script runs a crawl of the NASA site, and caches the content. It is configured using the following:
if __name__ == "__main__":
process = CrawlerProcess({
&apos...