Sun 22 Mar 2009
Vertical Search Engine Template - Command Summary
Posted by admin under Search
To sum up the commands that we’ve used to go three iterations of crawling -
/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt
/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments
/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch updatedb /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch invertlinks /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments -topN 1000
/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317171019
/home/greg/nutch/bin/nutch updatedb /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317171019
/home/greg/nutch/bin/nutch invertlinks /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317171019
/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments -topN 1000
/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317175829
/home/greg/nutch/bin/nutch updatedb /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317175829
/home/greg/nutch/bin/nutch invertlinks /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317175829
/home/greg/nutch/bin/nutch index /home/greg/nutchcrawls/nanaimo/crawl/indexes /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601 /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317171019 /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317175829
Notice that we omitted the indexing between fetches - the indexing can be done at any point - the indexer operates on segments. The last command is where we create the index for all three segments. We specify the three segments explicitly. Again, in this case we could have used the wild card like this -
/home/greg/nutch/bin/nutch index /home/greg/nutchcrawls/nanaimo/crawl/indexes /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/*
That would have accomplished the same thing - later when we mix segments from different base search engines (ie one segment from the music crawls, one from the diving crawls and two from the nanaimo crawls) we will have to name the segments explicitly so we have made certain to demonstrate it and get used to the technique.
The commands listed will create an index composed of three crawls. To fetch more rounds should by now be easy enough - just continue the process - remember, it is better to go slow and make sure that you understand the process.
The commands ought be ready to cut and paste, substituting for the different users and directories as needed. Copy the whole thing and do a search and replace and then enter the commands one at a time. Later we’ll work with “CRON” to automate it.
No Responses to “ Vertical Search Engine Template - Command Summary ”
Comments:
Leave a Reply
You must be logged in to post a comment.