Running scrapers in parallel
I'm not saying this just because I coded it, but our scraper has a pretty good structure. Every piece is separated into different functions, making it easy to identify which parts can run in parallel.
I don't want to sound repetitive, but remember, the site being scraped, in this case, Packt, is our friend and even my publisher. We don't affect the site; we want to look like normal users. We don't want to run 1,000 calls in parallel. We don't need to do that. So, we will try to run our scraper in parallel but with caution.
The good news is that we don't have to code a parallel architecture to solve this. We will use a package called puppeteer-cluster (https://www.npmjs.com/package/puppeteer-cluster). This is what this library does according to the description at npmjs:
- Handles crawling errors
- Auto restarts the browser in case of a crash
- Can automatically retry if a job fails
- Offers different concurrency...