Summary
In this chapter, you have learned ways to collect data by scraping web pages. Moreover, you were introduced to various types of semi-structured data formats, namely JSON and XML. Different ways of retrieving data in real time from a website such as Twitter have been explained with examples. Finally, you were introduced to various methods to deal with different kinds of local files, such as PDF, Word documents, text files, and Excel files.
In the next chapter, you will learn about topic modeling, which is an unsupervised natural language processing technique. It helps in grouping the documents according to the topic detected in them.