'Re: deleting duplicate documents from my index'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Re: deleting duplicate documents from my index
From:       "gekkokid" <me () gekkokid ! org ! uk>
Date:       2006-01-30 12:01:09
Message-ID: 006c01c62594$dca61fe0$b900a8c0 () jhi
[Download RAW message or body]

hi, thats exactly what i did :) works perfectly

thanks

_gk
----- Original Message ----- 
From: "Chris Hostetter" <hossman_lucene@fucit.org>
To: <java-user@lucene.apache.org>
Sent: Monday, January 30, 2006 5:56 AM
Subject: Re: deleting duplicate documents from my index


>
> : Hi, im trying to delete duplicate documents from my index, the unique
> : indentifier is the documents url (aka field "url").
> :
> : my initial thought of how to acomplish this is to open the index via a
> : reader and sort them by the documents url and then iterate through them
> : looking for a match with the current document and the previous document,
> : if it matches i would delete the current document etc.
>
> Assuming your "url" filed is a keyword field (indexed, not-tokenized) then
> take a look at IndexReader.termEnum ... if you start with new
> Term("url","") and iterate untill the field is no longer url, you'll be
> iterating over every url Term in your index.  for each one, check docFreq,
> and if it's more then 1 you've got a dup.
>
> Then look at IndexReader.termDocs for an easy way to find out which docs
> share that url.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic