From nutch-general Sun Jul 02 21:33:58 2006 From: Brent Verner Date: Sun, 02 Jul 2006 21:33:58 +0000 To: nutch-general Subject: [Nutch-general] how can I index only a portion of html content? Message-Id: <20060702213358.GD51458 () rcfile ! org> X-MARC-Message: https://marc.info/?l=nutch-general&m=115192201024431 Hi, I'd like to use nutch to index intranet/site content. The content is all template-based, and I'd like to index only a portion of the html page. Specifically, I'd like to only index content/words between a set of comments in the html page (but I could just as easily surround the content with another document node that could be more easily matched). Is this possible without writing a new html parser plugin? If so, how? Thanks! Brent Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general