Last post we created a file with some urls in it and injected them into a “crawldb” directory within a “crawl” within a “nanaimo” directory within a “nutchcrawls” directory within our home directory. The “crawldb” directory was created by the inject command that we used.

/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt

Next task is to generate the other resources that Nutch will need. This is accomplished with the “generate” command. Here is the one we’ll be using -

/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments

The first part “/home/greg/nutch/bin/nutch generate …” tells Nutch what to do. The second part is the absolute address of the crawldb directory where the urls to fetch are. The last part defines the directory where the results of a crawl(fetch) will be kept. The command above creates some directories within the “/home/greg/nutchcrawls/nanaimo/crawl/” directory.

The command above produces this result -

Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317091415
Generator: filtering: true
Generator: jobtracker is ‘local’, generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.

The command -

ls /home/greg/nutchcrawls/nanaimo/crawl/

shows that a “segments” directory has been created within the “crawl” directory. There is one directory created within that - a directory based on the time it was created - “20090317091415″.

Next command is the “fetch” command.