In the previous post we outlined the way that we are forming the commands and where we are creating some directories. We will be doing three search indexes and merging them - the main subject is “nanaimo”, with tastes of “diving” and “music”.

The first task is to have a set of urls to begin the crawl - in this first case we will keep the list in a file. We will create a directory called “search_urls” in our home directory - the absolute address will be “/home/greg/search_urls/”. This directory will hold our files with urls in them. We’ll need a directory “nanaimo”, “music” and “diving”. In each we’ll place a file named something appropriate.

In the “/home/greg/search_urls/nanaimo/” directory we’ll create a files called “initial.txt”. In that file are a number of urls, one per line.

http://www.nanaimo.ca/
http://www2.canada.com/nanaimodailynews
http://www.rdn.bc.ca/

Note that before we start we have a directory “/home/greg/” with three directories in it - “nutch”, “nutchcrawls”, “search_urls”. This can be seen with the command

ls /home/greg/

We have three directories within the “nutchcrawls” directory - “nanaimo”, “diving” and “music”. This can be seen with the command

ls /home/greg/nutchcrawls/

At this point those directories have nothing in them. Our job is to “inject” some urls into the fetchlist and then to crawl those urls. We start with the command -

/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt

The first part “/home/greg/nutch/bin/nutch inject …” tells Nutch to run the “inject” command. The next part “…. /home/greg/nutchcrawls/nanaimo/crawl/crawldb …” tells Nutch the directory to add the urls to. The last part “/home/greg/search_urls/nanaimo/initial.txt” tells Nutch where to find the list of urls to inject.

Previously the “/home/greg/nutchcrawls/nanaimo/” directory was empty but after running the above command note that there is now a “crawl” directory within it, and a “crawldb” directory within that. Here is the output of the command -

Injector: starting
Injector: crawlDb: /home/greg/nutchcrawls/nanaimo/crawl/crawldb
Injector: urlDir: /home/greg/search_urls/nanaimo/initial.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done

The command -

ls /home/greg/nutchcrawls/nanaimo/

shows a “crawl” directory has been created and the command -

ls /home/greg/nutchcrawls/nanaimo/crawl

shows a “crawldb” directory has been created - just as specified in the “inject” command.