The normal behavior of an agent that crawls (fetches) web pages is to get the webpage, extract the links on that page to other webpages and then go and get those linked pages, extract the links from them, repeat. The goal of Google, Yahoo, et al is to have crawled and indexed every page that exists. Google probably has close to 10 billion pages indexed. What we want is only the minimum amount of pages necessary to provide the best results for the searcher, always with the understanding that if they are looking for something really specific then Google is more likely to have that page indexed - hopefully the search wording can be defined well enough that the result will not be #800, behind a seemingly infinite number of hotel reservation and viagra sales sites.

We are starting from the assumption (based on experience) that an index of somewhere under 100,000 pages is enough to make the overall search experience better than Google’s, at least for a medium sized city. The question then becomes “Which 100,000 pages?”. The responsibility for that decision lies primarily with the crawler and partially with the human manager of the search engine. The crawler will make the immediate decision whether to crawl and index a page and the person managing the site will review the results thus obtained and adjust the behavior of the crawler by using the appropriate “tool” at their disposal.

“Intelligence” when applied to the crawler means that it has a number of rules that it follows in order to decide if it wants to index a page, or even to contemplate indexing a page. Suppose that it has gone and fetched a page, the next thing it does is examine that page for a number of factors. First it must decide if it even wants to know about the page. Perhaps the page can be assumed to be pornographic, based on the text. In that case get rid of it and don’t even bother going any further except to decide whether to ban all content from the entire offending site.

Assuming that the initial evaluation finds no reason to abort the examination of the page the next step is to look for elements that might indicate some degree of relevance to the subject matter of the vertical search engine. A search engine for “Nanaimo” might well look for “BC”, “Canada”, “west coast”, “pacific”, “Vancouver Island”, “harbor city” as relevant to Nanaimo even if the actual word “Nanaimo” isn’t on the page. It would be the responsibility of the manager to make a list of such relevant words which means that the manager ought to be familiar with the various subtleties of the area that the vertical search engine intends to cover. A search engine for “Vancouver” might want to look for occurances of the term “olympics” for example.

The crawler assigns a “relevance value” to the page based on this and other factors and then assuming that there is cause to be optimistic it will extract the links and examine them to see if there is reason to crawl the pages that are linked to. If it finds that one or more links go to a site previously labelled as a “spam” site then the crawler will make a decision and potentially test its decision before making it final.

The crawler can be set to follow some number of degrees past some set number of relevance - for example it may go out two more depths of crawl from a page with a certain degree of relevance and only one more level from another. The idea is to try and follow a “chain” but to know when to lop it off and which chains to follow. The crawler is helped by the manager who will “inject” urls that they deem helpful and to cut off chains that the crawler should have but didn’t. The manager can ban pages and sites and disallow the crawler from following links from spam sites ie cut off chains. This necessitates the manager to do periodically do searches, assess the results and reward/penalize according to their assessment.

It is also possible and potentially desirable to allow some degree of management by the end user, should the user do a search and find a offensive page amongst the results they might have the means to report the page. The page/site might be flagged for managerial review or it might be affected immediately or if over a certain threshold of user dissatisfication.

The rules that the crawler follows will be constantly refined to deliver the best results and those rules can be shared by all of the geo based search engines that we support. Rules that apply to blog search and to news search must be developed as well. Recrawling intervals must be set and there must be differing intervals based on a number of factors. The more intelligent the crawler the less menial work that the manager has to do and therefore the interventions that the manager makes will incrementally improve the results.

The major search engines cannot feasibly have human intervention for every subject and geo location so the results of the vertical search engines will be better than Googles except, as noted many times, in the case of a very specific and unique result that the searcher is attempting to find.