We have extracted the links from the pages fetched in the first crawl and are now ready to do the second crawl. So far we have completed these commands -

/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt

/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments

/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601

/home/greg/nutch/bin/nutch updatedb /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601

/home/greg/nutch/bin/nutch invertlinks /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601

/home/greg/nutch/bin/nutch index /home/greg/nutchcrawls/nanaimo/crawl/indexes /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601

/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments -topN 1000

Now we will run the fetch command again - the newly created segment directory is “20090317171019″.

/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317171019

Running that command give this lengthy output - next we’ll updatedb, invertlinks and index - the output is useful to examine - notice some “parsing” errors, they won’t hurt us for now and we’ll discuss them later. Except for the output this is the end of this post.

Fetcher: starting
Fetcher: segment: /home/ronpaul/nutchcrawls/nanaimo/crawl/segments/20090317171019
Fetcher: threads: 10
fetching http://www.rdn.bc.ca/js/init_handler.js
fetching http://www.nanaimo.ca/EN/topnav/news/rss/whats-new.rss
fetching http://www.rdn.bc.ca/highlights/include_notices_index.asp
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-residents.jpg)
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-visitors.jpg)
fetching http://www.nanaimo.ca/photo_upload/gallery_viewer.aspx
fetching http://www.nanaimo.ca/EN/main/want/view/privacy-policy.html
fetching http://www.nanaimo.ca/EN/main/municipal/notices/rss/public-notices.rss
fetching http://www.nanaimo.ca/js/randomPhoto.js
fetching http://www.nanaimo.ca/EN/main/municipal/public-notices.html
Error parsing: http://www.nanaimo.ca/EN/main/municipal/notices/rss/public-notices.rss: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rss+xml url=http://www.nanaimo.ca/EN/main/municipal/notices/rss/public-notices.rss
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)

fetching http://www.rdn.bc.ca/swf/features.swf
fetching http://www.nanaimo.ca/js/highlight.js
Error parsing: http://www.rdn.bc.ca/swf/features.swf: org.apache.nutch.parse.ParseException: parser not found for contentType=application/x-shockwave-flash url=http://www.rdn.bc.ca/swf/features.swf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)

fetching http://www.nanaimo.ca/EN/main/departments/parks-recreation-culture/leisure-guide.html
fetching http://www.nanaimo.ca/EN/main/municipal/online-services-payments.html
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-municipal-hall.jpg)
fetching http://www.rdn.bc.ca/js/ajax_controller.js
fetching http://www.rdn.bc.ca/members/memberlogin.asp
fetching http://www.nanaimo.ca/EN/main/residents.html
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-business.jpg)
fetching http://www.nanaimo.ca/CouncilMeetings/CouncilMeetings.aspx
fetching http://www.nanaimo.ca/assets/Departments/Parks~Rec~Culture/Publications~and~Forms/Portals.pdf
fetching http://www.google-analytics.com/urchin.js
Error parsing: http://www.google-analytics.com/urchin.js: org.apache.nutch.parse.ParseException: parser not found for contentType=text/javascript url=http://www.google-analytics.com/urchin.js
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)

fetching http://www.nanaimo.ca/assets/Departments/Community~Planning/Environmental~Planning/Green~Nanaimo~Awards/GreenNanaimoAwardLaunch.pdf
fetching http://www.nanaimo.ca/js/external.js
fetching http://www.nanaimo.ca/js/common.js
Error parsing: http://www.nanaimo.ca/EN/topnav/news/rss/whats-new.rss: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rss+xml url=http://www.nanaimo.ca/EN/topnav/news/rss/whats-new.rss
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)

fetching http://www.nanaimo.ca/EN/main/want/view/legal-disclaimer.html
fetching http://www.nanaimo.ca/EN/main/departments/FireRescue.html
fetching http://www.rdn.bc.ca/js/cms_index.js
fetching http://www.nanaimo.ca/js/jquery.cookie.js
fetching http://www.nanaimo.ca/EN/main/visitors.html
Error parsing: http://www.nanaimo.ca/assets/Departments/Parks~Rec~Culture/Publications~and~Forms/Portals.pdf: org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.nanaimo.ca/assets/Departments/Parks~Rec~Culture/Publications~and~Forms/Portals.pdf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)

fetching http://www.nanaimo.ca/CityMap/Default.aspx
fetching http://www.nanaimo.ca/js/jquery.site-enhancement.js
fetching http://www.nanaimo.ca/js/custom.js
fetching http://www.nanaimo.ca/js/jquery.headers.js
fetching http://www2.canada.com/nanaimodailynews/index.html
fetching http://www.nanaimo.ca/EN/main/municipal.html
fetching http://www.nanaimo.ca/EN/main/visitors/NanaimoBars.html
fetching http://www.nanaimo.ca/EN/topnav/news/2009WhatsNew.html
fetching http://www.nanaimo.ca/EN/main/business.html
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-departments.jpg)
fetching http://www.nanaimo.ca/js/menu_public.js
fetching http://www.busonline.ca/regions/nan/schedules/map.cfm
fetching http://www.nanaimo.ca/assets/Departments/Fire~Rescue/Fire_Chiefs_Conference.pdf
Error parsing: http://www.nanaimo.ca/assets/Departments/Fire~Rescue/Fire_Chiefs_Conference.pdf: org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.nanaimo.ca/assets/Departments/Fire~Rescue/Fire_Chiefs_Conference.pdf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)

fetching http://www.nanaimo.ca/EN/main/departments.html
Error parsing: http://www.nanaimo.ca/assets/Departments/Community~Planning/Environmental~Planning/Green~Nanaimo~Awards/GreenNanaimoAwardLaunch.pdf: org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.nanaimo.ca/assets/Departments/Community~Planning/Environmental~Planning/Green~Nanaimo~Awards/GreenNanaimoAwardLaunch.pdf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)

fetching http://www.rdn.bc.ca/highlights/include_highlights_index.asp
fetching http://www.nanaimo.ca/EN/index.html
fetching http://www.rdn.bc.ca/cms/js/menu.js
fetching http://www.nanaimo.ca/job_postings/job_postings.aspx
fetching http://www.shapingourfuture.ca/
fetching http://www.nanaimo.ca/js/quickmenus.js
fetching http://www.nanaimo.ca/EN/main/want.html
fetching http://www.nanaimo.ca/EN/topnav/sitemap.html
fetching http://www.rdn.bc.ca/stats/include_stats_index.asp
fetching http://www.nanaimo.ca/CityServices/CityServices.aspx
fetching http://www.nanaimo.ca/js/jquery-1.2.6.pack.js
fetching http://www.nanaimo.ca/photo_upload/Default.aspx
fetching http://www.nanaimo.ca/weather/weather.aspx
Fetcher: done