Before jumping into too much code, there are a few points you will need to keep in mind as you begin running a web scraper. It is important to remember that we all must be good citizens of the internet in order for everyone to get along. Keeping that in mind, there are many tools and best practices to follow in order to ensure that you are being fair and respectful when adding a load to an outside web server. Stepping outside of these guidelines could put your scraper at risk of being blocked by the web server, or in extreme cases, you could find yourself in legal trouble.
In this chapter, we will cover the following topics:
- What is a robots.txt file?
- What is a User-Agent string?
- How can you throttle your web scraper?
- How do you use caching?