[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-general
Subject:    [Nutch-general] how can I index only a portion of html content?
From:       Brent Verner <brent () rcfile ! org>
Date:       2006-07-02 21:33:58
Message-ID: 20060702213358.GD51458 () rcfile ! org
[Download RAW message or body]

Hi,

  I'd like to use nutch to index intranet/site content.  The content is all 
template-based, and I'd like to index only a portion of the html page. 
Specifically, I'd like to only index content/words between a set of comments
in the html page (but I could just as easily surround the content with 
another document node that could be more easily matched).  Is this possible 
without writing a new html parser plugin?  If so, how?

Thanks!
  Brent


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic