Challenges to news data analysis
The analysis of news data was probably one of the most challenging tasks in this book. We will try to give the reader a summary of the toughest problems that we encountered in the process of developing this chapter:
Lack of API sources: News data providers are not always very API friendly. We were lucky to have a prestigious source such as The Guardian, which believes in open access to its data and goes to great lengths to ensure that. But, apart from a couple of big names such as The New York Times and The Guardian, we won't find a lot of data providers going down the API route.
Web scraping: Web scraping HTML data for text is quite a complex process. Once again, we were lucky that the HTML structure for our data sources was quite simple. A more involved structure would have meant a larger and more elaborate process of data scraping. (We encourage the reader to take a look at the HTML structure for any New York Times article to realize the complexity that...