Fri 13 Mar 2009
Vertical Search Engine Template - Extract Links
Posted by admin under Search
So far we have a working index of a database of three pages. We need to extract the links from those pages and then fetch the new pages. We do that with the “generate” command that we used before. In this case though we will add a “limiter” to limit the number of links that we get. Of course, with only 3 pages so far we won’t get that many but it is in our example for illustrative purposes.
So far we have run the following commands -
/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt
/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments
/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch updatedb /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch invertlinks /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch index /home/greg/nutchcrawls/nanaimo/crawl/indexes /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
Now we will run the generate command again -
/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments -topN 1000
The last part of the command “-topN 1000″ tells Nutch that we only want the 1000 top scoring links. Again, we won’t get that many at this point but the concept of “-topN” is important to grasp. If we had 100,000 pages in our database there might possibly be 30,000 new pages to get and that might “dilute” the “flavor” of our search engine too much - we might tell it for example to only get the top 2000 links.
Before we run the command we’ll examine it a little more thoroughly and discuss what we are trying to do and what the best way to accomplish it would be.
We are going to examine the links from “/home/greg/nutchcrawls/nanaimo/crawl/crawldb” and create a segment in the “/home/greg/nutchcrawls/nanaimo/crawl/segments” directory to hold the soon to be fetched pages. We can quite easily specify a different directory if we want to - for the time being it is simpler to use the existing “segments” directory but just because it holds actual Nutch “segments” doesn’t mean that it has to be named “segments”.
It could just as easily be named “news_segments”, “test_segments” or any other label that better describes the contents. Our “nanaimo” search engine might contain many “segments” directories if it suits our purposes. The same applies to a lot of the other directories that we have created. For now we’ll keep it simple but just want to introduce some variations that could happen so that they won’t be completely unexpected when it comes time to use different naming techniques.
Running the generate command gives an output of -
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317171019
Generator: filtering: true
Generator: topN: 1000
Generator: jobtracker is ‘local’, generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Next we will fetch the second batch of urls.
No Responses to “ Vertical Search Engine Template - Extract Links ”
Comments:
Leave a Reply
You must be logged in to post a comment.