Chapter 6. Scraping Link-Based External Data
This chapter aims to explain a common pattern for enhancing local data with external content found at URLs or over APIs. Examples of this are when URLs are received from GDELT or Twitter. We offer readers a tutorial using the GDELT news index service as a source of news URLs, demonstrating how to build a web scale news scanner that scrapes global breaking news of interest from the Internet. We explain how to build this specialist web scraping component in a way that overcomes the challenges of scale. In many use cases, accessing the raw HTML content is not sufficient enough to provide deeper insights into emerging global events. An expert data scientist must be able to extract entities out of that raw text content to help build the context needed track broader trends.
In this chapter, we will cover the following topics:
- Create a scalable web content fetcher using the Goose library
- Leverage the Spark framework for Natural Language Processing...