We are using Linux, specifically a Ubuntu Distribution. We intend to use a program called “cron” to execute commands that are in a script file. There exists a file known as a “crontab” which holds commands to execute at certain times. Every minute the “crontab” is checked to see if anything is supposed to happen.

Since we have been working with Nutch we have to execute about 7 commands to get a “3 degree” search database created and indexed. Each of these command could be considered a cron “job”. To better suit later purposes we’re going to adopt a sort of object oriented approach to things in general and these cron jobs in particular. In this case it means encapsulating our commands into blocks of jobs which will mean that we can refer to sets of commands rather than have to list the commands every time.

Cron files are files that contain commands to be executed. We can put our 7 commands into a file and then just tell cron to do the commands that are in that file at a specified time. What we’ll be doing is grouping commands that accomplish certain tasks in various files. The benefit will be more obvious when we get into the processes that maintain the search engine databases and indexes especially for things that are frequently updated and changing - news may need 100 rounds of fetching new pages and updating the index every day.

We will at some future time discuss how to set the crontab entries but initially we’re going to concentrate on defining the overall tasks (script files) that we will want to have as groups of commands. An example would be going out and looking for recent news every 15 minutes. We might call that file “update-news.txt” and the file would contain the commands to generate a fetch list, crawl the pages in the fetch list, generate a fetch list based on links from the new pages, get those pages and then add all the newly fetch pages to the overall database and do an indexing.

The crontab file will be much easier to deal with if we can just say “Do update-news.txt” every 15 minutes rather than specify the complete command set. We might want to update our blog database every hour and do a more general crawl once a day. Our crontab file would look like something like -

15min update-news.txt
60min update-blogs.txt
1day update-general.txt

Script files can take arguments so we can make script files for the general case and use the arguments to define the differences that each use requires. For example user “dave” may have his indexes in “/home/dave/……” and user “susan” in “/home/susan/….”. A particular crawl may need to be done for the “nanaimo” search engine and so the reference would be to “/home/greg/…/nanaimo/…” while another might require “/home/greg/…/music”.

Each chunk of information that is needed by Nutch but that differs between users or tasks is an “argument”. The crontab file might now look something like the following -

1day update-general.txt /home/greg/nutchcrawls/nanaimo
15min update-news.txt /home/greg/nutchcrawls/nanaimo/news
60min update-blogs.txt /home/greg/nutchcrawls/nanaimo/blog

Next we’ll look at separation of tasks beginning with the creation and building of the simple example that we have so far.