'Re: [Nutch-general] deleting URL duplicates - never actually'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-general
Subject:    Re: [Nutch-general] deleting URL duplicates - never actually
From:       "Honda-Search Administrator" <admin () honda-search ! com>
Date:       2006-06-30 17:15:32
Message-ID: 003201c69c68$cdce0280$0300a8c0 () MATT
[Download RAW message or body]

Marko,

Currently the shell command is as follows:

---
# index new segment
bin/nutch index $s1

# update the database
bin/nutch updatedb crawl/db $s1

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup crawl/segments bogus

# Merge indexes
ls -d crawl/segments/* | xargs bin/nutch merge crawl/index
---

Should I actually switch the last two commands around?

Matt 

Original Message ----- 
From: "Marko Bauhardt" <mb@media-style.com>
To: <nutch-user@lucene.apache.org>
Sent: Friday, June 30, 2006 2:57 AM
Subject: Re: deleting URL duplicates - never actually deleted?


> 
>  Do you delete the duplicates before you merge the index? Run first  
> the merge command and then the dedup command.
> 
> But a better way is you create one index of all segments with the  
> index command and then runs the dedup command of this one index.
> 
> Hope this Helps,
> Marko
> 
> 
> Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator:
> 
>> Maybe someone can explain to me how this works.
>>
>> First, my setup.
>>
>> I create a fetchlist each night with FreeFetchlistTool and fetch  
>> those pages.  It often contains the same URLS that are already in  
>> the database, but this tool gets the newest copies of those URLs.
>>
>> I also run nutch dedup after everything is fetched, indexed, etc.   
>> I then merge the segments using the following command:
>>
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> Every night the number of "duplicates" increases.  THis is so  
>> because the duplicates from the day before are not actually deleted  
>> (I assume).
>>
>> Is dedup removing them from some sort of master index and the  
>> segments retain their original information?
>>
>> If so, is there a way to merge the segments into one (or whatever)  
>> so that duplicate URLs do not exist?  Would mergesegs do this?
>>
>> Thanks for any help, and I hope my questionis clear.
>>
>> Matt
>>
>>
> 
> 
>

Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic