Fri 3 Apr 2009
Intelligent Link Generation - Initial Thoughts
Posted by admin under Search
No Comments
After a fetch or an index, not sure exactly where or when as I’m writing but at some point(_s) we’ll want to rank the links to follow, keep the ones that we deem acceptable and either discard the rest or store them somewhere with a “do not index or follow” instruction.
Then after we have Nutch crawl those selected links we examine each retrieved page to see what is on it and decide whether to follow any links on that page and whether to even keep it.
The idea is that we want to confine our index to the smallest amount of pages necessary to provide the most relevant results for a given search in the area that the search engine is specializing in. This article is just thoughts concerning the approach of actually making sure that the capabilities we need are included in our Nutch based search engine even if it means having to code them ourself.
(more…)