Warning: include(/home/associated1/ASSOCIATEDGEOGENERICS.COM/wp-supercache.php) [function.include]: failed to open stream: No such file or directory in /home/associated1/ASSOCIATEDGEOGENERICS.COM/wp-config.php on line 31

Warning: include() [function.include]: Failed opening '/home/associated1/ASSOCIATEDGEOGENERICS.COM/wp-supercache.php' for inclusion (include_path='.:/usr/local/lib/php:/usr/local/php5/lib/pear') in /home/associated1/ASSOCIATEDGEOGENERICS.COM/wp-config.php on line 31
Search » Associated Geogenerics Dot Com

Search


So far we have a working index of a database of three pages. We need to extract the links from those pages and then fetch the new pages. We do that with the “generate” command that we used before. In this case though we will add a “limiter” to limit the number of links that we get. Of course, with only 3 pages so far we won’t get that many but it is in our example for illustrative purposes.

So far we have run the following commands -
(more…)

We have used these commands to “inject” and “generate” -

/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt

/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments

Next we will do the fetch ie make a crawl. This is the command that we will use -

/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601

(more…)

Last post we created a file with some urls in it and injected them into a “crawldb” directory within a “crawl” within a “nanaimo” directory within a “nutchcrawls” directory within our home directory. The “crawldb” directory was created by the inject command that we used.

/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt

Next task is to generate the other resources that Nutch will need. This is accomplished with the “generate” command. Here is the one we’ll be using -
(more…)

At this point we have injected the initial urls to crawl, generated the fetchlist, did the crawl and updated the database. The four commands that were used are -
(more…)

After a fetch or an index, not sure exactly where or when as I’m writing but at some point(_s) we’ll want to rank the links to follow, keep the ones that we deem acceptable and either discard the rest or store them somewhere with a “do not index or follow” instruction.

Then after we have Nutch crawl those selected links we examine each retrieved page to see what is on it and decide whether to follow any links on that page and whether to even keep it.

The idea is that we want to confine our index to the smallest amount of pages necessary to provide the most relevant results for a given search in the area that the search engine is specializing in. This article is just thoughts concerning the approach of actually making sure that the capabilities we need are included in our Nutch based search engine even if it means having to code them ourself.
(more…)

We are using Linux, specifically a Ubuntu Distribution. We intend to use a program called “cron” to execute commands that are in a script file. There exists a file known as a “crontab” which holds commands to execute at certain times. Every minute the “crontab” is checked to see if anything is supposed to happen.

Since we have been working with Nutch we have to execute about 7 commands to get a “3 degree” search database created and indexed. Each of these command could be considered a cron “job”. To better suit later purposes we’re going to adopt a sort of object oriented approach to things in general and these cron jobs in particular. In this case it means encapsulating our commands into blocks of jobs which will mean that we can refer to sets of commands rather than have to list the commands every time.
(more…)

To sum up the commands that we’ve used to go three iterations of crawling -
(more…)

The second crawl has been done and indexed. For this example we are going to do a final crawl. There is no real differences that weren’t noted previously so we are just going to do this third crawl and index all in this post. We will now run the following commands -
(more…)

Having completed the second crawl it is now the task to updatedb, invertlinks and index the new pages which are stored in the new segment “20090317171019″. We will run these commands, one after the other -
(more…)

We have extracted the links from the pages fetched in the first crawl and are now ready to do the second crawl. So far we have completed these commands -
(more…)

The next step is to make an index. We have already done 5 steps -
(more…)

We have completed the first crawl by using the following three commands -
(more…)

Next Page »


192.168.1.1
192.168.1.1