When opening up a web browser and going to our favorite search engine’s webpage, very little thought is spent on trying to figure out how the search engine goes about finding relevant content for the searcher. We understand that keywords are needed in order to retrieve relevant content from our search. But do we understand how the web search engine goes about performing its search engine content discovery to present us a front page of fresh results?
Enter the search engine crawler, or search engine spiders as some call them.
Crawlers may be more easily recognizable by their search engine names: Google, Bing, Yahoo!. The search engine name also denotes the user-agent, or the company that set the crawler to work, and the most likely name of the crawler bot: the crawler bot GoogleBot’s user-agent is Googlebot, Bingbot’s user-agent is bingbot, etc. For a list of the top-ten crawlers, click on the link below:
This still begs the question, what does a crawler do? Crawler bots attempt to index the billions of interconnected public pages of the internet! They start with a seed set of known and trusted websites that contain the keyword searched by the user, and then try to find all the other pages related to that subject through the links in that first set of pages, scanning their content for the same keywords. Crawlers open a page, analyze its contents, and see where else it links to, all the while giving a location address to what they find.
The indexing requires that select pieces of information from the analyzed pages be stored in a database. This information includes all the significant, or key, terms in the page, along with a map of where the crawler has been during its analysis, and any links found that lead to further relevant information, among other things. Once a search is performed for these keywords, the search engine doesn’t have to scour billions of pages across the Internet to returns results; the crawler already did that. At this point, the search engine performs some information retrieval from the databases storing the indexed page information. This can include passage retrieval from the page itself so that a user can determine more closely if clicking on a link is worthwhile.
If you were searching for a bibimbap recipe, and wanted to specify the ingredients, a crawler would have already indexed the public webpages of the Internet to find the recipes and index the ingredients ahead of your search! The next step in the consumption of search is the ranking and relevance of each search result
Ranking and relevance determine the order in which a search engine presents its results to the user. This is done by determining the perceived importance of the contents of the site, the authority of the site, citations, how fresh the contents are, and how well they match the intention of the original query. The crawler has made this process quick and easy, because all of this information is indexed, ready to be compared to the user’s query.
For a more in-depth and technical explanation of the algorithmic processing performed by crawler bots, see Google’s own article on their procedure below:
Web crawler, spiders, bots, whatever you choose to call them, are doing a significant and important work of keeping the contents of the internet indexed, all for providing us with fresh, relevant search results in under a second.