[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-developers
Subject:    [Nutch-dev] nutch - functionality..
From:       "bruce" <bedouglas () earthlink ! net>
Date:       2006-06-23 19:38:00
Message-ID: 144301c696e8$ecda5af0$0301a8c0 () Mesa ! com
[Download RAW message or body]

hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality....

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'nutch' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information....

any thoughts/comments/ideas/etc.. regarding this process.

if i shouldn't use nutch, are there any suggestions as to what app i should
use.

thanks

-bruce




Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic