Search


In the previous post we outlined the way that we are forming the commands and where we are creating some directories. We will be doing three search indexes and merging them - the main subject is “nanaimo”, with tastes of “diving” and “music”.
(more…)

It has become apparent that a sort of template is needed - an explanation of what is happening plus an example of the commands in a generic form.  First, an outline of the project, we want to keep in mind the eventual process to follow once we progress past the initial stage of making sure that the program is working, etc. We’ll discuss what we are doing and a complete list of commands that is generally copy/paste ready will be found at the end of this post.
(more…)

Nutch (the program we use to create our search sites) is capable of acting in “distributed” fashion. Nutch is integrated with a program called “Hadoop” that provides the “distributed” functionality. In the case of the geocentric search engines there exists the possibility that Nanaimo along with another 10 cities on Vancouver Island might be approached to contribute their resources to an overall “Vancouver Island” vertical search engine, or that 200 towns and cities in BC might want to all contribute to an overall “BC” search engine. Hadoop makes it possible.
(more…)

After this post there won’t be any explanation of the following terms or explanation of the following general concepts. Nutch is an open-source search engine. It is programmed using the JAVA language and makes use of a program called “Tomcat” (”Apache Tomcat” actually).

Nutch consists of a crawler, a parser, a database to store webpages, an indexer and a user interface. The crawler “crawls” webpages, or “fetches” them. It knows which pages to crawl based on a “fetchlist”. The URLs of the webpages to be crawled are “injected” into the fetchlist. The parser(s) examine the webpages and store them in the database in the proper form so that they can be “indexed” by the indexer.
(more…)

The normal behavior of an agent that crawls (fetches) web pages is to get the webpage, extract the links on that page to other webpages and then go and get those linked pages, extract the links from them, repeat. The goal of Google, Yahoo, et al is to have crawled and indexed every page that exists. Google probably has close to 10 billion pages indexed. What we want is only the minimum amount of pages necessary to provide the best results for the searcher, always with the understanding that if they are looking for something really specific then Google is more likely to have that page indexed - hopefully the search wording can be defined well enough that the result will not be #800, behind a seemingly infinite number of hotel reservation and viagra sales sites.
(more…)

When we say that the search results of our vertical search engine are better than Google et al what we mean is that there is a “flavor” to the results. Our results will never be more comprehensive than theirs because we don’t index anywhere near the number of webpages that they do. A vertical search engine for a city of 100,000 people might well have less than 100,000 pages indexed as opposed to Googles 5 billion. Of couse, most people rarely delve down to the #818 result using Google, the point being that all those pages that Google has tend to obfuscate a lot of the pages that have real relevance.
(more…)

We are in the process of introducing some technologies for geogeneric sites, for the most part right now those technologies are WordPress plugins for general use over all geo-generics, but also a few generic-specificic ones and specifically one for the “search” generic.  We will be demonstrating eventually at “NanaimoSearch(dot)com. We are in the process of developing a vertical search engine process that will give geos such as “NanaimoSearch(dot)com”, “ChicagoSearch(dotcom”, “LosAngelesSearch(dot)com”, etc can use to the end of providing better, more relevent results than the major search engines such as Google, Yahoo and MSN.
(more…)

« Previous Page