Introduction
We've been talking about all of the data that's out there in the world. However, structured or semistructured data—the kind you'd find in spreadsheets or in tables on web pages—is vastly overshadowed by the unstructured data that's being produced. This includes news articles, blog posts, tweets, Hacker News discussions, StackOverflow questions and responses, and any other natural text that seems like it is being generated by the petabytes daily.
This unstructured content contains information. It has rich, subtle, and nuanced data, but getting it is difficult. In this chapter, we'll explore some ways to get some of the information out of unstructured data. It won't be fully nuanced and it will be very rough, but it's a start. We've already looked at how to acquire textual data. In Chapter 1, Importing Data for Analysis, we looked at this in the Scraping textual data from web pages recipe. Still, the Web is going to be your...