Creating scrapers
Let's try to scrape book prices from the Packt site. The terms and conditions said nothing about scrapers (https://www.hardkoded.com/ui-testing-with-puppeteer/packtpub-terms). But the robots.txt
file has some clear rules:
User-agent: * Disallow: /index.php/ Disallow: /*? Disallow: /checkout/ Disallow: /app/ Disallow: /lib/ Disallow: /*.php$ Disallow: /pkginfo/ Disallow: /report/ Disallow: /var/ Disallow: /catalog/ Disallow: /customer/ Disallow: /sendfriend/ Disallow: /review/ Disallow: /*SID=
They don't want us to go to those pages. But the site has a pretty massive sitemap.xml
, with over 9,000 lines. If robots.txt
is the "don't go here" sign for scrapers, sitemap.xml
is the "please, check this out" sign. These are the first items on the sitemap.xml
file:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns...