Summary
In this chapter, we introduced the field of NLP, a complex field of study with many challenges and opportunities.
The first part of the chapter focused on how to get textual data from the Web. Blogs were a natural candidate for text mining, given the abundance of textual data out there. After dealing with two of the most popular free blogging platforms, WordPress.com and Blogger, we generalized the problem by introducing the XML standard for web feed, specifically RSS and Atom. Given its strong presence on the Web, and probably in everyday life of many Internet users, Wikipedia also deserved to be mentioned in a discussion about textual content. We saw how it's easy to interact with all of these services in Python either by using available libraries or by quickly implementing our own functions.
The second part of the chapter was about NLP. We already introduced some NLP concepts throughout the book, but this was the first time we took time to provide a more formal introduction. We...