[prev in list] [next in list] [prev in thread] [next in thread]
List: nutch-cvs
Subject: [Nutch-cvs] [Nutch Wiki] Update of "DissectingTheNutchCrawler" by MattKangas
From: Apache Wiki <wikidiffs () apache ! org>
Date: 2005-04-20 5:45:28
Message-ID: 20050420054528.25631.23180 () ajax ! apache ! org
[Download RAW message or body]
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change \
notification.
The following page has been changed by MattKangas:
http://wiki.apache.org/nutch/DissectingTheNutchCrawler
------------------------------------------------------------------------------
The main ways to configure the Nutch crawler are as follows:
- 1. Configuration files. Default values are in nutch-default.xml, and you should \
override them in nutch-site.xml. [[BR]][[BR]] + 1. Configuration files. Default \
values are in nutch-default.xml, and you should override them in nutch-site.xml. 1. \
URLFilter interface. By default, the class {{{net.nutch.net.RegexURLFilter}}} is \
used, which reads regular expression patterns from \
regex-urlfilter.txt. So, you can:
* Edit that file to tune its behavior
- * Or, write a new class that implements {{{net.nutch.net.URLFilter}}}, and \
change nutch-site.xml to use it. [[BR]][[BR]] + * Or, write a new class that \
implements {{{net.nutch.net.URLFilter}}}, and change nutch-site.xml \
to use it.
- 1. Protocol interface. To add support for a new protocol, write or add a plugin to \
the "plugins" directory. To change protocol behavior, modify the approprite plugin. \
[[BR]][[BR]] + 1. Protocol interface. To add support for a new protocol, write or \
add a plugin to the "plugins" directory. To change protocol behavior, modify the \
approprite plugin.
- 1. Parser interface. As for Protocol, you should add/create a plugin for any new \
content-types. Otherwise, you will need to replace the appropriate plugin if you want \
to modify its behavior. [[BR]][[BR]] + 1. Parser interface. As for Protocol, you \
should add/create a plugin for any new content-types. Otherwise, you will need to \
replace the appropriate plugin if you want to modify its behavior. 1. If you need \
to make other changes, refer to our discussion of '''Fetcher''' and \
'''FetchListTool'''. Consider subclassing these classes, overriding the appropriate \
method, then calling your class from the "nutch" script using the full class path.
-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic