Improving our crawler
Now that we've had an in-depth look at both ThreadPoolExecutors
and ProcessPoolExecutors
, it's time to actually put these newly learned concepts into practice. In Chapter 5, Communication between Threads, we started developing a multithreaded web crawler that was able to crawl every available link on a given website.
Note
The full source code for this Python web crawler can be found at this link: https://github.com/elliotforbes/python-crawler.
It didn't, however, output the results in the most readable format, and the code could be improved using ThreadPoolExecutors
. So, let's have a look at implementing both more readable code and more readable results.
The plan
Before we get started, we need to define a general plan as to how we are going to improve our crawler.
New improvements
A few examples of the improvements we might wish to make are as follows:
- We want to refactor our code to use
ThreadPoolExecutors
- We want to output the results of a crawl in a more readable format such...