The key functionality of search engines is indexing. The following diagram shows how documents downloaded by the crawler are processed to build the index file:
The index is shown as an inverted index in the preceding diagram. As you can see, the user queries are directed to the inverted index. Although we use the terms index and inverted index interchangeably in this chapter, inverted index is a more accurate name for it. First, let's see what the index for the search engine is. The whole reason for indexing documents is to provide a fast searching functionality. The idea is simple: each time the crawler downloads documents, the search engine processes its contents to divide it into words that refer to that document. This process is called tokenization. Let's say we have a document downloaded from Wikipedia containing the following text (for brevity...