Posted by admin under Search
No Comments
After this post there won’t be any explanation of the following terms or explanation of the following general concepts. Nutch is an open-source search engine. It is programmed using the JAVA language and makes use of a program called “Tomcat” (”Apache Tomcat” actually).
Nutch consists of a crawler, a parser, a database to store webpages, an indexer and a user interface. The crawler “crawls” webpages, or “fetches” them. It knows which pages to crawl based on a “fetchlist”. The URLs of the webpages to be crawled are “injected” into the fetchlist. The parser(s) examine the webpages and store them in the database in the proper form so that they can be “indexed” by the indexer.
(more…)