[prev in list] [next in list] [prev in thread] [next in thread]
List: nutch-general
Subject: Re: [Nutch-general] Interrupting a nutch crawl -- or use topN?
From: Ian Holsman <lists () holsman ! net>
Date: 2007-06-30 23:37:48
Message-ID: 4686E94C.9040005 () holsman ! net
[Download RAW message or body]
Kai_testing Middleton wrote:
> I am running a nutch crawl of 19 sites. I wish to let this crawl go for about two \
> days then gracefully stop it (I don't expect it to complete by then). Is there a \
> way to do this? I want it to stop crawling then build the lucene index. Note that \
> I used a simple nutch crawl command, rather than the "whole web" crawling \
> methodology:
> nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10
>
I use a iterative approach using a script similar to what Sami blogs
about here:
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
I then issue a crawl of 10,000 URLs at a time, and just repeat the
process for as long as the window available. because I use solr to store
the crawl results
It makes the index available during the crawl window.
but I'm a relative newbie as well, so look forward what the experts say.
regards
Ian
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic