Sun 15 Mar 2009
Vertical Search Engine Template - Second Fetch
Posted by admin under Search
We have extracted the links from the pages fetched in the first crawl and are now ready to do the second crawl. So far we have completed these commands -
/home/greg/nutch/bin/nutch inject /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/search_urls/nanaimo/initial.txt
/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments
/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch updatedb /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch invertlinks /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch index /home/greg/nutchcrawls/nanaimo/crawl/indexes /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/linkdb /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317080601
/home/greg/nutch/bin/nutch generate /home/greg/nutchcrawls/nanaimo/crawl/crawldb /home/greg/nutchcrawls/nanaimo/crawl/segments -topN 1000
Now we will run the fetch command again - the newly created segment directory is “20090317171019″.
/home/greg/nutch/bin/nutch fetch /home/greg/nutchcrawls/nanaimo/crawl/segments/20090317171019
Running that command give this lengthy output - next we’ll updatedb, invertlinks and index - the output is useful to examine - notice some “parsing” errors, they won’t hurt us for now and we’ll discuss them later. Except for the output this is the end of this post.
Fetcher: starting
Fetcher: segment: /home/ronpaul/nutchcrawls/nanaimo/crawl/segments/20090317171019
Fetcher: threads: 10
fetching http://www.rdn.bc.ca/js/init_handler.js
fetching http://www.nanaimo.ca/EN/topnav/news/rss/whats-new.rss
fetching http://www.rdn.bc.ca/highlights/include_notices_index.asp
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-residents.jpg)
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-visitors.jpg)
fetching http://www.nanaimo.ca/photo_upload/gallery_viewer.aspx
fetching http://www.nanaimo.ca/EN/main/want/view/privacy-policy.html
fetching http://www.nanaimo.ca/EN/main/municipal/notices/rss/public-notices.rss
fetching http://www.nanaimo.ca/js/randomPhoto.js
fetching http://www.nanaimo.ca/EN/main/municipal/public-notices.html
Error parsing: http://www.nanaimo.ca/EN/main/municipal/notices/rss/public-notices.rss: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rss+xml url=http://www.nanaimo.ca/EN/main/municipal/notices/rss/public-notices.rss
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)fetching http://www.rdn.bc.ca/swf/features.swf
fetching http://www.nanaimo.ca/js/highlight.js
Error parsing: http://www.rdn.bc.ca/swf/features.swf: org.apache.nutch.parse.ParseException: parser not found for contentType=application/x-shockwave-flash url=http://www.rdn.bc.ca/swf/features.swf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)fetching http://www.nanaimo.ca/EN/main/departments/parks-recreation-culture/leisure-guide.html
fetching http://www.nanaimo.ca/EN/main/municipal/online-services-payments.html
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-municipal-hall.jpg)
fetching http://www.rdn.bc.ca/js/ajax_controller.js
fetching http://www.rdn.bc.ca/members/memberlogin.asp
fetching http://www.nanaimo.ca/EN/main/residents.html
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-business.jpg)
fetching http://www.nanaimo.ca/CouncilMeetings/CouncilMeetings.aspx
fetching http://www.nanaimo.ca/assets/Departments/Parks~Rec~Culture/Publications~and~Forms/Portals.pdf
fetching http://www.google-analytics.com/urchin.js
Error parsing: http://www.google-analytics.com/urchin.js: org.apache.nutch.parse.ParseException: parser not found for contentType=text/javascript url=http://www.google-analytics.com/urchin.js
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)fetching http://www.nanaimo.ca/assets/Departments/Community~Planning/Environmental~Planning/Green~Nanaimo~Awards/GreenNanaimoAwardLaunch.pdf
fetching http://www.nanaimo.ca/js/external.js
fetching http://www.nanaimo.ca/js/common.js
Error parsing: http://www.nanaimo.ca/EN/topnav/news/rss/whats-new.rss: org.apache.nutch.parse.ParseException: parser not found for contentType=application/rss+xml url=http://www.nanaimo.ca/EN/topnav/news/rss/whats-new.rss
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)fetching http://www.nanaimo.ca/EN/main/want/view/legal-disclaimer.html
fetching http://www.nanaimo.ca/EN/main/departments/FireRescue.html
fetching http://www.rdn.bc.ca/js/cms_index.js
fetching http://www.nanaimo.ca/js/jquery.cookie.js
fetching http://www.nanaimo.ca/EN/main/visitors.html
Error parsing: http://www.nanaimo.ca/assets/Departments/Parks~Rec~Culture/Publications~and~Forms/Portals.pdf: org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.nanaimo.ca/assets/Departments/Parks~Rec~Culture/Publications~and~Forms/Portals.pdf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)fetching http://www.nanaimo.ca/CityMap/Default.aspx
fetching http://www.nanaimo.ca/js/jquery.site-enhancement.js
fetching http://www.nanaimo.ca/js/custom.js
fetching http://www.nanaimo.ca/js/jquery.headers.js
fetching http://www2.canada.com/nanaimodailynews/index.html
fetching http://www.nanaimo.ca/EN/main/municipal.html
fetching http://www.nanaimo.ca/EN/main/visitors/NanaimoBars.html
fetching http://www.nanaimo.ca/EN/topnav/news/2009WhatsNew.html
fetching http://www.nanaimo.ca/EN/main/business.html
fetching http://www.nanaimo.ca/url(http:/www.nanaimo.ca/images/header-photo-home-departments.jpg)
fetching http://www.nanaimo.ca/js/menu_public.js
fetching http://www.busonline.ca/regions/nan/schedules/map.cfm
fetching http://www.nanaimo.ca/assets/Departments/Fire~Rescue/Fire_Chiefs_Conference.pdf
Error parsing: http://www.nanaimo.ca/assets/Departments/Fire~Rescue/Fire_Chiefs_Conference.pdf: org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.nanaimo.ca/assets/Departments/Fire~Rescue/Fire_Chiefs_Conference.pdf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)fetching http://www.nanaimo.ca/EN/main/departments.html
Error parsing: http://www.nanaimo.ca/assets/Departments/Community~Planning/Environmental~Planning/Green~Nanaimo~Awards/GreenNanaimoAwardLaunch.pdf: org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.nanaimo.ca/assets/Departments/Community~Planning/Environmental~Planning/Green~Nanaimo~Awards/GreenNanaimoAwardLaunch.pdf
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)fetching http://www.rdn.bc.ca/highlights/include_highlights_index.asp
fetching http://www.nanaimo.ca/EN/index.html
fetching http://www.rdn.bc.ca/cms/js/menu.js
fetching http://www.nanaimo.ca/job_postings/job_postings.aspx
fetching http://www.shapingourfuture.ca/
fetching http://www.nanaimo.ca/js/quickmenus.js
fetching http://www.nanaimo.ca/EN/main/want.html
fetching http://www.nanaimo.ca/EN/topnav/sitemap.html
fetching http://www.rdn.bc.ca/stats/include_stats_index.asp
fetching http://www.nanaimo.ca/CityServices/CityServices.aspx
fetching http://www.nanaimo.ca/js/jquery-1.2.6.pack.js
fetching http://www.nanaimo.ca/photo_upload/Default.aspx
fetching http://www.nanaimo.ca/weather/weather.aspx
Fetcher: done
No Responses to “ Vertical Search Engine Template - Second Fetch ”
Comments:
Leave a Reply
You must be logged in to post a comment.