'Re: [Nutch-general] Nutch and distributed searching (w/ apologies)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-general
Subject:    Re: [Nutch-general] Nutch and distributed searching (w/ apologies)
From:       Dennis Kubes <kubes () apache ! org>
Date:       2007-07-31 23:52:21
Message-ID: 46AFCB35.7040205 () apache ! org
[Download RAW message or body]

It is not a problem to contact me directly if you have questions. I am 
going to include this post on the mailing list as well in case other 
people have similar questions.

When we originally started (and back when I wrote the tutorial), I 
thought the best approache would be to have a single massive segments, 
crawldb, linkdb, and indexes on the dfs.  And if we had this then we 
would need an index splitter so we split those massive databases to 
have x number of urls on each search server.  The problem with this 
approach though is that is doesn't scale very well (beyond about 50M 
pages).  You have to keep merging whatever you are crawling into your 
master and after a while this takes a good deal of time to sort, merge 
continually index.

The approach we are using these days is focused on smaller distributed 
segments and hence indexes.  Here is how it works:

1) Inject your database with a beginning url list and fetch those pages.
2) Update a single master crawl db (at this point you only have one).
3) Do a generate with a -topN option to get the best urls to fetch.  Do 
this for the number of urls you want on each search server.  A good rule 
of thumb in no more than 2-3 million pages per disk for searching (this 
is for web search engines).  So lets say your crawldb once updated from 
the first run has > 2 million urls, you would do a generate with -topN 
2000000.
4) Fetch this new segment through the fetch command.
5) Update the single master crawldb with this new segment.
6) Create a single master linkdb (at this point you will only have one) 
through the invertlinks command.
7) Index that single fetched segment.
8) Use a script, etc. to push the single index, segments, and linkdb to 
a search server directory from the dfs.
9) do steps 3-8 for as many search servers as you have. When you reach 
the number of search servers you have you can replace the indexes, etc. 
on the first, second, etc. search servers with new fetch cycles.  This 
way your index always has the best pages for the number of servers and 
amount of space you have.

Once you have a linkdb created, meaning the second or greater fetch, 
then you would create a linkdb for just the single segments and then use 
the mergelinkdb command to merge the single into the master linkdb.

When pushing the pieces to search servers you can move the entire 
linkdb, but after a while that is going to get big.  A better way is to 
write a map reduce job that will split the linkdb to only include urls 
for the single segment that you have fetched.  Then you would only move 
that single linkdb piece out, not the entire master linkdb.  If you want 
to get started quick though just copy the entire linkdb to each search 
server.

This approach assumes that you have a search website fronting multiple 
search servers (search-servers.txt) and that you can bring down a single 
search server, update the index and pieces, and then bring the single 
search server back up.  This way the entire index is never down.

Hope this helps and let me know if you have any questions.

Dennis Kubes

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic