[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-general
Subject:    Re: [Nutch-general] not crawling relative URLs
From:       Kai_testing Middleton <kai_testing () yahoo ! com>
Date:       2007-06-28 18:30:01
Message-ID: 598180.31713.qm () web59305 ! mail ! re1 ! yahoo ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Ok, I guess I lied.

Nutch IS capable of crawling relative URLs.  

Essentially what happened is that the page I was attempting to crawl, \
http://www.sf911truth.org, had more than 100 outlinks on it and the relative URL for \
about.html that I was expecting to see in my crawl.log was outlink #105.  This was \
fixed by changing db.max.outlinks.per.page  to -1 (unlimited # of outlinks) in \
nutch-site.xml.

For a detailed discussion see "Re: [Nutch-dev] NUTCH-119 :: how hard to fix":
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12592.html

Now it works.

--Kai Middleton




      ____________________________________________________________________________________
 Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user panel and lay \
it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic