Understanding a search engine's inner working
One of my favorite academic questions to ask people about search technology is, "When do you think Internet searching was invented?" While the exact date is elusive, the answer is nearly always wrong—by several decades. Routinely, people reflect the common understanding that search technology was invented in the 1990s.
Actually, a search engine merely employs search query and indexing principles that were conceived and implemented decades before in a mainframe environment. Indexing, coupled with search queries, allowed early computer operators to quickly select relevant information from large databases in the infancy of the computer age. The Internet is simply a much larger database and a modern search engine is simply a much more robust and sophisticated search query tool.
Preparing the index
A search engine does not store your web pages; it stores an index of your web pages. For your page to appear in a search engine's index, that search engine first sends a search spider to visit your site and read your web pages' content. The spider returns the information to a document processor that processes your web pages into a format that the query processor understands. The document processor performs several formatting tasks: it might remove stop words, lower-value terms that bear little relation to the page's topic, like "the," "and," "it," and the likes. The document processor will also perform term stemming, where suffixes like -ing, -er, -es, -ed are stripped from search terms. In essence, a document processor trims content to reveal the contextual elements of a web page and prepares the entry for indexing.
The index contains much of the information from your pages, along with the other data that the search engine uses to evaluate and categorize your pages. As a highly-simplified example, Google's index of your page will contain the text of your page on a date in the recent past when its spider last visited along with other data such as:
A table of terms in order of the frequency in which they appear on your page (called the inverted file)
The page's PageRank
A term weight assignment, a numerical value that reflects the frequency of appearance of particular terms on a page
The page's meta tags
The page's destination URL
That description is grossly simplified, but points out that what the search engine attempts to match is not your page itself, but a processed and analyzed version of your page.
Querying the index
Once the index is prepared, the page is available for querying. The query processor, along with a search and matching engine, performs the nuts and bolts of the search function, thus matching a user's query to store entries in the search engine's index. The final element is a sound methodology for ranking query results. If everything works as planned, the search engine returns a sensibly ordered set of results to each user's query.
Peeking into the mechanics of search gives us a few guidelines to follow. One core principle that emerges is that keywords are the signposts that search engines use to determine the subject and value of web pages—without relevant and contextual words on your pages, the search engines cannot accurately index your pages. The other important idea is that a search engine searches an index—it doesn't search your pages directly. Therefore, if your pages aren't in the index, they aren't going to be found. These concepts will re-emerge as we work through the chapters in this book.