[prev in list] [next in list] [prev in thread] [next in thread]
List: nutch-general
Subject: Re: [Nutch-general] deleting URL duplicates - never actually
From: "Honda-Search Administrator" <admin () honda-search ! com>
Date: 2006-06-30 17:15:32
Message-ID: 003201c69c68$cdce0280$0300a8c0 () MATT
[Download RAW message or body]
Marko,
Currently the shell command is as follows:
---
# index new segment
bin/nutch index $s1
# update the database
bin/nutch updatedb crawl/db $s1
# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup crawl/segments bogus
# Merge indexes
ls -d crawl/segments/* | xargs bin/nutch merge crawl/index
---
Should I actually switch the last two commands around?
Matt
Original Message -----
From: "Marko Bauhardt" <mb@media-style.com>
To: <nutch-user@lucene.apache.org>
Sent: Friday, June 30, 2006 2:57 AM
Subject: Re: deleting URL duplicates - never actually deleted?
>
> Do you delete the duplicates before you merge the index? Run first
> the merge command and then the dedup command.
>
> But a better way is you create one index of all segments with the
> index command and then runs the dedup command of this one index.
>
> Hope this Helps,
> Marko
>
>
> Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator:
>
>> Maybe someone can explain to me how this works.
>>
>> First, my setup.
>>
>> I create a fetchlist each night with FreeFetchlistTool and fetch
>> those pages. It often contains the same URLS that are already in
>> the database, but this tool gets the newest copies of those URLs.
>>
>> I also run nutch dedup after everything is fetched, indexed, etc.
>> I then merge the segments using the following command:
>>
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> Every night the number of "duplicates" increases. THis is so
>> because the duplicates from the day before are not actually deleted
>> (I assume).
>>
>> Is dedup removing them from some sort of master index and the
>> segments retain their original information?
>>
>> If so, is there a way to merge the segments into one (or whatever)
>> so that duplicate URLs do not exist? Would mergesegs do this?
>>
>> Thanks for any help, and I hope my questionis clear.
>>
>> Matt
>>
>>
>
>
>
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic