It has become apparent that a sort of template is needed - an explanation of what is happening plus an example of the commands in a generic form.  First, an outline of the project, we want to keep in mind the eventual process to follow once we progress past the initial stage of making sure that the program is working, etc. We’ll discuss what we are doing and a complete list of commands that is generally copy/paste ready will be found at the end of this post.

One of the things we want to do is to merge indexes, the reasons for doing so can be found elsewhere on this site. We need to have some way of dealing with the naming of the various components that will be created.

Another thing to keep in mind is that we want to do everything from one spot, no changing directories to do the various tasks which means for our purposes that we’ll give the commands as if one had just logged in to their home directory. If your home directory is named for you this means that when you log in as the user “greg” you would start off in the absolute directory “/home/greg/”. This would be the directory that Nutch is in, its absolute address would be “/home/greg/nutch/”.

A typical nutch command is

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

What this command does is index pages that have been fetched previously by the crawler. This command is given from inside the nutch directory and references a file named “nutch” in the “bin” folder. The nutch file then accepts the command “index” and lists some other stuff that we’ll get into in a bit. Rather than have to be working in the “nutch” directory we want to work from our home directory. This means that the command that we will use is “/home/greg/nutch/bin/nutch index …..”.

If your home directory and username is “dave” then you would replace “greg” with “dave”. The command has now become -

/home/greg/nutch/bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

Of course, “index” is not the first command that we’ll use but the example illustrates how we will be using absolute addresses. In the case of the above “crawl/indexes”, “crawl/crawldb”, “crawl/linkdb” and “crawl/segments/*” the “crawl” refers to a directory created to hold the various directories that a crawl will require. The “indexes”,”crawldb”, “linkdb” and “segments” directories need to be specified for the index command.

In the case of this example, we are imagining that we’ll be doing a search engine for the city of Nanaimo and that we will be merging indexes to achieve our desired result. We want to “flavor” our “Nanaimo” search index which means that we will add a “taste” of another subject, another vertical. For the examples sake we will imagine that people interested in Nanaimo are interested in the recreational pursuit “diving”. We will want to add a “taste” of “diving” to the overall “flavor” of the “Nanaimo” search index and we will eventually achieve this by merging indexes.

For a good example case we will also add a “taste” of “music”. This means that we will be doing three separate vertical search engines and we will need to name our directories accordingly and appropriately. To keep things neat we will first create a directory called “nutchcrawls” within our home directory. The absolute address will be “/home/greg/nutchcrawls/”.

Within the “nutchcrawls” directory we will have other directories named for the scope of the crawl. In our example case that means creating three directories “nanaimo”, “diving” and “music”. It is in each of those directories that the “indexes”,”crawldb”, “linkdb” and “segments” directories referenced in the example command will reside.

The command now becomes -

/home/greg/nutch/bin/nutch index /home/greg/nutchcrawls/nanaimo/crawl/indexes /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/*

So, now that we know where the results of the crawling and indexing are going to be and we are familiar with the use of absolute file and directory addressing we can look at the commands that would bring us to the use of the indexing command.