When you are crawling a website, you may not always know where you will end up. Many links in web pages take you to external sites that you may not trust as much as your target sites. These linked pages could contain irrelevant information or could be used for malicious purposes. It is important to define boundaries for your web scraper to safely navigate through unknown sources.
Boundaries
Whitelists
Whitelisting domains is a process by means of which you explicitly allow your scraper to access certain websites. Any site listed on the whitelist is OK for the web scraper to access, whereas any site that is not listed is automatically skipped. This is a simple way to ensure that your scraper only accesses pages for a small...