Mon 2 Mar 2009
Vertical Search Engine Template - Inject URLs
Posted by admin under Search
In the previous post we outlined the way that we are forming the commands and where we are creating some directories. We will be doing three search indexes and merging them - the main subject is “nanaimo”, with tastes of “diving” and “music”.
The first task is to have a set of urls to begin the crawl - in this first case we will keep the list in a file. We will create a directory called “search_urls” in our home directory - the absolute address will be “/home/greg/search_urls/”. This directory will hold our files with urls in them. We’ll need a directory “nanaimo”, “music” and “diving”. In each we’ll place a file named something appropriate.
In the “/home/greg/search_urls/nanaimo/” directory we’ll create a files called “initial.txt”. In that file are a number of urls, one per line.
http://www.nanaimo.ca/
http://www2.canada.com/nanaimodailynews
http://www.rdn.bc.ca/
Note that before we start we have a directory “/home/greg/” with three directories in it - “nutch”, “nutchcrawls”, “search_urls”. This can be seen with the command
ls /home/greg/
We have three directories within the “nutchcrawls” directory - “nanaimo”, “diving” and “music”. This can be seen with the command
ls /home/greg/nutchcrawls/
At this point those directories have nothing in them. Our job is to “inject” some urls into the fetchlist and then to crawl those urls. We start with the command -
/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt
The first part “/home/greg/nutch/bin/nutch inject …” tells Nutch to run the “inject” command. The next part “…. /home/greg/nutchcrawls/nanaimo/crawl/crawldb …” tells Nutch the directory to add the urls to. The last part “/home/greg/search_urls/nanaimo/initial.txt” tells Nutch where to find the list of urls to inject.
Previously the “/home/greg/nutchcrawls/nanaimo/” directory was empty but after running the above command note that there is now a “crawl” directory within it, and a “crawldb” directory within that. Here is the output of the command -
Injector: starting
Injector: crawlDb: /home/greg/nutchcrawls/nanaimo/crawl/crawldb
Injector: urlDir: /home/greg/search_urls/nanaimo/initial.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
The command -
ls /home/greg/nutchcrawls/nanaimo/
shows a “crawl” directory has been created and the command -
ls /home/greg/nutchcrawls/nanaimo/crawl
shows a “crawldb” directory has been created - just as specified in the “inject” command.
No Responses to “ Vertical Search Engine Template - Inject URLs ”
Comments:
Leave a Reply
You must be logged in to post a comment.