The major search engines that account for most market share today are the automated type: Google, Yahoo!, and Microsoft Bing (although Yahoo! and Microsoft have reached an agreement in which Yahoo! will replace its search engine with Bing, although this hasn’t happened as of late 2009). All there, as well as other smaller search engines that exist and operate their own technology, have similar overall infrastructure as follows:
- Web crawlers (also known as “spiders” or “robots”) that crawl the Web. There crawlers follow links to discover the pages on the Web.
- Extraction processes that gather information from those pages (such as textual content, metadata, and links).
- Index storage that stores the content from Web pages. Content is generally stored using word-based keys, similar to the index in a book. When you look up a word in the index of a book, you learn the page number that world is on. Similarly, with a search engine index, the search engine can look up a word that someone is searching for and find out all the Web pages associated with that word.
- Results scoring that determines what pages are the most relevant for each search. When someone does a search (called a “query”) and the search engine checks the index of all the Web pages associated with that search, the search engine needs a way to rank those Web pages in an order that is useful for the searcher. Search engines use a number of factors in scoring and these factors are adjusted all of the time based on new algorithms, tests, and other criteria. Search engines keep the details of these scoring factors secret. Once the search engine compiles and ranks the pages that are relevant for the query, it displays them in a list called “organic results.” The ranking process happens at the time of the query.