The length of a crawl, in terms of number of pages that can be parsed, can be controlled with the CLOSESPIDER_PAGECOUNT setting.
Controlling the length of a crawl
How to do it
We will be using the script in 06/07_limit_length.py. The script and scraper are the same as the NASA sitemap crawler with the addition of the following configuration to limit the number of pages parsed to 5:
if __name__ == "__main__":
process = CrawlerProcess({
'LOG_LEVEL': 'INFO',
'CLOSESPIDER_PAGECOUNT': 5
})
process.crawl(Spider)
process.start()
When this is run, the following output will be generated (interspersed in the logging output):
<200 https://www.nasa.gov/exploration...