[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-user
Subject: Re: deleting duplicate documents from my index
From: "gekkokid" <me () gekkokid ! org ! uk>
Date: 2006-01-30 12:01:09
Message-ID: 006c01c62594$dca61fe0$b900a8c0 () jhi
[Download RAW message or body]
hi, thats exactly what i did :) works perfectly
thanks
_gk
----- Original Message -----
From: "Chris Hostetter" <hossman_lucene@fucit.org>
To: <java-user@lucene.apache.org>
Sent: Monday, January 30, 2006 5:56 AM
Subject: Re: deleting duplicate documents from my index
>
> : Hi, im trying to delete duplicate documents from my index, the unique
> : indentifier is the documents url (aka field "url").
> :
> : my initial thought of how to acomplish this is to open the index via a
> : reader and sort them by the documents url and then iterate through them
> : looking for a match with the current document and the previous document,
> : if it matches i would delete the current document etc.
>
> Assuming your "url" filed is a keyword field (indexed, not-tokenized) then
> take a look at IndexReader.termEnum ... if you start with new
> Term("url","") and iterate untill the field is no longer url, you'll be
> iterating over every url Term in your index. for each one, check docFreq,
> and if it's more then 1 you've got a dup.
>
> Then look at IndexReader.termDocs for an easy way to find out which docs
> share that url.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic