Introducing web scraping
Throughout the book, we repeatedly see data’s value in creating intelligent systems. None of the discussions presented so far would make any sense without its presence. For instance, we incorporated publicly available corpora and built-in datasets from Python libraries in various case studies. In reality, however, suitable corpora are rarely available for free, and it’s the data scientist’s primary responsibility to harvest them. The world wide web (WWW) is a goldmine where we can resort to finding or augmenting our datasets using web scraping, the process of collecting and parsing raw data from the web. Afterward, the data is converted into the appropriate format to proceed with the subsequent analysis.
For this task to succeed, web crawlers are used to retrieve the requested content. These are also known as spiders because they crawl all over the web, just as real spiders crawl on their spiderwebs. The specific processing is performed...