Gap
Gap has a well structured website with a Sitemap
to help web crawlers locate their updated content. If we use the techniques from Chapter 1, Introduction to Web Scraping, to investigate a website, we would find their robots.txt
file at http://www.gap.com/robots.txt, which contains a link to this Sitemap:
Sitemap: http://www.gap.com/products/sitemap_index.xml
Here are the contents of the linked Sitemap
file:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.gap.com/products/sitemap_1.xml</loc> <lastmod>2015-03-03</lastmod> </sitemap> <sitemap> <loc>http://www.gap.com/products/sitemap_2.xml</loc> <lastmod>2015-03-03</lastmod> </sitemap> </sitemapindex>
As shown here, this Sitemap
link is just an index and contains links to other Sitemap
files. These other Sitemap
files then...