After this post there won’t be any explanation of the following terms or explanation of the following general concepts. Nutch is an open-source search engine. It is programmed using the JAVA language and makes use of a program called “Tomcat” (”Apache Tomcat” actually).

Nutch consists of a crawler, a parser, a database to store webpages, an indexer and a user interface. The crawler “crawls” webpages, or “fetches” them. It knows which pages to crawl based on a “fetchlist”. The URLs of the webpages to be crawled are “injected” into the fetchlist. The parser(s) examine the webpages and store them in the database in the proper form so that they can be “indexed” by the indexer.

A “crawl” could get 10,000 webpages in one session. Each of these sessions is stored in its own database called a “segment” so our example segment might have 10,000 webpages stored in it. Each segment can be indexed seperately or as one of many segments indexed at once.

The user interface is the webpage where a searcher enters their search query and then the indexes created by Nutch will be used to provide search results. One important thing that Nutch can do is to “merge” indexes ie the index of one or any number of segments (crawls) can be combined to produce a separate index.

An example of a usage of merging indexes might be if you were doing a vertical search engine focusing on the city of Nanaimo you might get 30,000 webpages that are tightly connected to Nanaimo but tests show a lack of results that might be desired for British Columbia, the province of Canada where Nanaimo is located. There might exist an index for a vertical search based on “BC” which at its most tightly focused has 3,000 pages. The manager might realize that Nanaimo is a university town and want to add some “music/arts” flavor to the index so they might get an index from a vertical search engine focused on “alternative music”. The “BC”, “music” and “Nanaimo” indexes could be merged to produce an index with a definitely unique flavor.

To sum up, a crawl is the process of the crawler fetching all the pages in a given fetchlist the results of said crawl to be parsed by the correct parser(s) and stored in segments in a database. Segments are indexed and the resultant indexes each of which can be generated from one or more segments can be merged to create one overall index. The index is queried by the user to provide a search result.