As you begin to add more and more target websites into your scraping requirements, you will eventually hit a point where you wish you could make more calls, faster. In a single program, the crawl delay might add extra time to your scraper, adding unnecessary time to process the other sites. Do you see the problem in the following diagram?
data:image/s3,"s3://crabby-images/6fa99/6fa99493fa1f9082ea88ae44eb7c74e84bc07920" alt=""
If these two sites could be run in parallel, there would not be any interference. Maybe the time to access and parse a page is longer than the crawl delay for this website, and launching a second request before the processing of the first response completes could save you time as well. Look how the situation is improved in the following diagram:
data:image/s3,"s3://crabby-images/88c5e/88c5e583eab6001127d1346e070c5844decccb1e" alt=""
In any of these cases, you will need to introduce concurrency into your web scraper.
In this chapter, we will cover the following topics:
- What is concurrency
- Concurrency pitfalls...