The development of a web crawler is a process of exploration, and one that will iterate through various refinements to retrieve the requested information. During the development process, you will often be hitting remote servers, and the same URLs on those servers, over and over. This is not polite. Fortunately, scrapy also comes to the rescue by providing caching middleware that is specifically designed to help in this situation.
Using an HTTP cache for development
How to do it
Scrapy will cache requests using a middleware module named HttpCacheMiddleware. Enabling it is as simple as configuring the HTTPCACHE_ENABLED setting to True:
process = CrawlerProcess({
'AUTOTHROTTLE_TARGET_CONCURRENCY': 3
})
process.crawl...