In the previous two chapters, we covered the basics of machine learning: we spoke about supervised and unsupervised problems.
In this chapter, we will take a look at how to use these methods for processing textual information, and we will illustrate most of our ideas with our running example: building a search engine. Here, we will finally use the text information from the HTML and include it into the machine learning models.
First, we will start with the basics of natural language processing, and implement some of the basic ideas ourselves, and then look into efficient implementations available in NLP libraries.
This chapter covers the following topics:
- Basics of information retrieval
- Indexing and searching with Apache Lucene
- Basics of natural language processing
- Unsupervised models for texts - dimensionality reduction, clustering...