Preparing the index
A search engine does not store your web pages, it stores an index of your web pages. For your page to appear in a search engine's index, first that search engine sends a search spider to visit your site and read your web pages' content. The spider returns the information to a document processor that processes your web pages into a format that the query processor understands. The document processor performs several formatting tasks—it might remove stop words, lower-value terms that bear little relation to the page's topic, such as the, and, it, and many more. The document processor will also perform term stemming, where suffixes like -ing, -er, -es, and -ed are stripped from search terms. In essence, a document processor trims the content to reveal the contextual elements of a web page and prepares the entry for indexing.
The index contains much of the information from your pages, along with other data that the search engine uses to evaluate and categorize your pages. As a highly simplified example, Google's index of your page will contain the text of your page on a date in the recent past when its spider last visited along with other data which are as follows:
- A table of terms in order of the frequency in which they appear on your page (called the inverted file)
- The page's PageRank
- A term weight assignment: a numerical value that reflects the frequency of appearance of particular terms on a page
- The page's meta tags
- The page's destination URL
This description is grossly simplified, but illustrates that what the search engine attempts to match is not your page itself, but a processed and analyzed version of your page.