How search engines assess sites
Search engines all function in approximately the same fashion: A software agent, known as a bot, a spider, or a crawler, visits a page, gathers the content, and stores it in the search engine's data repository. Once the information is in the repository, it is indexed. The crawling and indexing processes are constant and on-going. Each of the major search engines maintains multiple crawlers that work tirelessly to refresh its index. The spiders find new pages by a variety of methods, typically including XML Site Maps, URLs already in the index, links to pages discovered while indexing, and URLs submitted for inclusion by users. How frequently they visit a specific site, and how deeply they spider the site each visit, varies.
When a user visits the search engine and runs a search, the search engine extracts from the search engine's index a list of pages that are relevant to the query, and then displays that list of pages to the user. The output on the search results page is defined according to each search engine's own criteria. The ranking methodology used by each engine is the result of the search engine's secret algorithm.
The search engine's crawler is primarily interested in certain types of information on the page, particularly the URL, the text, and the links on the page. Formatting is not indexed. Images and other media are indexed by most search engines, but to varying degrees of depth. Some types of media, such as Flash or attached files, are rarely indexed, though there are exceptions.
If you have a Google Webmaster account, you can see a web page exactly as the Googlebot (the name of the Google crawler) sees it. To do this, log into Google Webmaster Tools, and click on a site profile. In the navigation menu on the left-hand side, select the Diagnostics menu and then select the Fetch as Googlebot option. Type the URL of the page you want to see, and the system will produce the results. You can see in the following screenshots, a webpage, followed by the Googlebot's view of the same page:
Here's the spider's view of the same page: