Fri 3 Apr 2009
Intelligent Link Generation - Initial Thoughts
Posted by admin under Search
After a fetch or an index, not sure exactly where or when as I’m writing but at some point(_s) we’ll want to rank the links to follow, keep the ones that we deem acceptable and either discard the rest or store them somewhere with a “do not index or follow” instruction.
Then after we have Nutch crawl those selected links we examine each retrieved page to see what is on it and decide whether to follow any links on that page and whether to even keep it.
The idea is that we want to confine our index to the smallest amount of pages necessary to provide the most relevant results for a given search in the area that the search engine is specializing in. This article is just thoughts concerning the approach of actually making sure that the capabilities we need are included in our Nutch based search engine even if it means having to code them ourself.
The first action of our vertical search engine would be to give it the url(s) you want it to get, on the first crawl. There are other basic situations that could exist but the most common would be to start with one or more authority sites that are known to have links to the most of the main part of the chosen specialty. The hope is that those links when fetched will provide an overall good view of the general field. The next fetch then would be to get the pages that those links go to. Each fetch will provide mostly links to other pages on its own site as well as a lesser amount that lead to other sites.
The general theme then seems to be get a link to another site, crawl that site and each outgoing link will represent another site to add to our index because Nutch will end up crawling each site more of less completely, since most sites have a site map at that point every other uncrawled page on the site will be crawled the very next crawl. Unless we interfere, which is our job.
So, we should be able to somehow rank our links and then have some sort of relevent-likely cutoff point. Perhaps the links could be sorted depending on whether the page that they are gleaned from has the right content ie contains certain specified words or phrases.
Each link might also be flagged or tagged in a number of ways, for example the page it came from might have the word “Viagra” on it and we might want to tag it as “suspicious” or in the opposite manner a link may fail the overall litmus test yet have enough proper words present that we might want to give it a “promising” tag. Perhaps both.
We might at some point want to manually review pages that were tagged as one or both “promising” and “suspicious” so during the indexing of the links we might want to flag the pages that those links lead to as “review”. Occasional reviews might weed out a lot of the spam and otherwise non-relavant pages before they get a chance to spread. Of course the search engine administrator will be able to review and penalize during a search but even so you can’t get everything
At this point I’m introducing another potential use of these vertical search engines and that is as a tool for a personal robot, one that goes out and familiarizes itself with your interests as your interests occur. This would include temporary or fleeting interests so that these personal robots could have meaningful communications as concerns any mutual interests that might exist. A simple example is one of looking for a restaurant, a dining partner, an interest in buying something.
Your personal robot would know what you want and how important it is so as you travel it would be “conversing” with other people’s robots, or a business’ robot. Part of the information that it shares might be a list of links or of pages but more importantly it might get a “sense” of some detail by examining its own collection of data - its search engine.
This would seem to dictate a merging of indexes to create a search engine rather than through expansion of an existing index. This is really how Nutch works anyway. What is really called for is a naming convention to give some clue as to how and why the particular index came to be created. At the moment we’ve been only taking care of the when ie we name our indexes after their timestamp which at first thought seems redundant unless we would otherwise overwrite them and needed the old index.
Regardless, for our purposes we’ll start naming them according to what they are, what they represent. What are some examples? Perhaps “First_Crawl”, “Second_Crawl”, “Second_Crawl_Filtered”,”Specialty_Generated_First_Crawl”, “Specialty_Generated_Second_Crawl”, “Specialty2_Generated_First_Crawl”, etc. If more come to mind I’ll add them. Now that we know that we will abandon the standard Nutch examples of timestamp-based names for the crawls and/or indexes we can address when and where we would use the naming convention.
A lot of the time we will be getting a good base in the first four or five crawls and other than recently published materials our index will remain stable. A vertical search engine centered on Nanaimo isn’t likely to change its base all that much over time. The first four or five crawls will pick up most of the historical stuff that won’t be added to anymore, there is unlikely to be a different history of Nanaimo, in general.
This means that most of the time we will be following leads to incrementally improve our search engine index. The base will almost always be covered via the first four or five crawls. After that there may start to be other threads that are followed rather than wholesale crawling of sites, which is where the link generating process needs to have some intelligence. Remember we want to keep our index as “tight” as possible, any “loosening” or “relaxing” of focus would come about by adding other indexes that are themselves “tight”. More of an expansion of interest than a “relaxing” of standards but referring to the process as “relaxing” might in some cases be appropriate on occasion.
The point is that the first few crawls could be named just that - “First_Crawl”, “Second_Crawl”, etc - it wouldn’t be until the base had been established that the nuances of naming would require any worrying about. We have to remember though that the idea is to automate the process as much as possible - only interfering where we must. Ideally these interferences can take place on occasion rather than on a schedule.
Given that we’ve created the base the next thing is to get some new pages to fetch but we want to start to intelligently predict the most likely pages to be fetched. We have to do this be examining, sorting and filtering the links that we already have. How can we decide which links are best for the advancement of our particular interest? We don’t want to continually expand our index until we compete with Google, we want to use the advantages of small data pools to get better results, to achieve a tighter focus than Google can possibly match - because they can’t have a knowledgable “interferer” for every subject and even if they could, there is just too much garbage when you index 10 billion pages, you can’t escape it.
What criteria should we use? Words. If we can define our specialty in one word then the obvious thing is to look and see how the link relates to our word. The next step is synonyms where applicable. Following that would be related words. The only resources available are the page the link was extracted from and the url of the link itself. It is probably reasonable to infer that the presence of the word “nanaimo” in the url “http://somesite.com/xxxxxnanaimoxxxx.html” would indicate some degree of relatedness to the topic of “Nanaimo”.
Looking at the page that the link was extracted from can yield some clues as well. Every link (anchor tag) has the potential to have a “title” attribute. A lot of the time they aren’t present but a lot of the time they are as well. Examination of these title attributes can provide some indication of the relevance of a particular link. At times it may be a positive thing, other times it could be a negative indicator - the lure of a spammer intent on getting better results in Google by “spamming” a given term, including using the term within the title attribute of the anchor tag (the link).
Some positive feelings toward an unknown link may come from the actual page content of the page that it was extracted from, apart from the prementioned “title” attribute. The page may mention words relating to the subject. Especially useful might be words that have been judged to be in the same family of words that are commonly used when discussing the vertical that the search engine is indexing, the specialty, the subject.
These “words” could be names ie there might be a few famous people from Nanaimo and so you might want to give some credit to a page or site that at least is tangentially interesting if not in fact relevant. There might be other words - beach, pulp mill, harbour, coast, bc, canada - the presence of one or more of these words might make for a link worth trusting, a page worth fetching? Maybe if a page has six or more out of ten “subsidiary” words as well as at least one “secondary” word. If it has a “primary word” then get it? Give the words a point score?
At any rate, it would seem as if the content of the page must be examined at the same time as the generation of the links and that there must be a way to score or otherwise determine an interesting link without having to get every page that is possible to get. This will become even more important when the processes and techniques of adding daily news results to the indexes is addressed.
The rules for ranking links are not necessarily clearly defined then. There may be positive and negative indicators, phrases may become significant.
The next step is to examine how Nutch takes care of link generation and where the plugin point(s) for the processes that are required to suit our needs are. The rules that we use to “filter” the links will likely dictate the naming of the fetches/indexes/etc.
No Responses to “ Intelligent Link Generation - Initial Thoughts ”
Comments:
Leave a Reply
You must be logged in to post a comment.