[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-cvs
Subject:    [Nutch-cvs] [Nutch Wiki] Update of "DissectingTheNutchCrawler" by MattKangas
From:       Apache Wiki <wikidiffs () apache ! org>
Date:       2005-04-20 5:45:28
Message-ID: 20050420054528.25631.23180 () ajax ! apache ! org
[Download RAW message or body]

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change \
notification.

The following page has been changed by MattKangas:
http://wiki.apache.org/nutch/DissectingTheNutchCrawler

------------------------------------------------------------------------------
  
  The main ways to configure the Nutch crawler are as follows:
  
-  1. Configuration files. Default values are in nutch-default.xml, and you should \
override them in nutch-site.xml. [[BR]][[BR]] +  1. Configuration files. Default \
values are in nutch-default.xml, and you should override them in nutch-site.xml.   1. \
URLFilter interface. By default, the class {{{net.nutch.net.RegexURLFilter}}} is \
                used, which reads regular expression patterns from \
                regex-urlfilter.txt. So, you can: 
     *  Edit that file to tune its behavior
-    *  Or, write a new class that implements {{{net.nutch.net.URLFilter}}}, and \
change nutch-site.xml to use it. [[BR]][[BR]] +    *  Or, write a new class that \
                implements {{{net.nutch.net.URLFilter}}}, and change nutch-site.xml \
                to use it. 
-  1. Protocol interface. To add support for a new protocol, write or add a plugin to \
the "plugins" directory. To change protocol behavior, modify the approprite plugin. \
[[BR]][[BR]] +  1. Protocol interface. To add support for a new protocol, write or \
add a plugin to the "plugins" directory. To change protocol behavior, modify the \
                approprite plugin. 
-  1. Parser interface. As for Protocol, you should add/create a plugin for any new \
content-types. Otherwise, you will need to replace the appropriate plugin if you want \
to modify its behavior. [[BR]][[BR]] +  1. Parser interface. As for Protocol, you \
should add/create a plugin for any new content-types. Otherwise, you will need to \
replace the appropriate plugin if you want to modify its behavior.   1. If you need \
to make other changes, refer to our discussion of '''Fetcher''' and \
'''FetchListTool'''. Consider subclassing these classes, overriding the appropriate \
method, then calling your class from the "nutch" script using the full class path.  
  


-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic