'Re: force deletes - terms enum still has deleted terms?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    Re: force deletes - terms enum still has deleted terms?
From:       Erick Erickson <erickerickson () gmail ! com>
Date:       2018-09-28 14:48:11
Message-ID: CAN4YXveC_LApQArLryrR41yVC8aXmze+i6-Aob8wpNL0uzmUOg () mail ! gmail ! com
[Download RAW message or body]

You might be hitting a rounding error. When this happens, how many
deleted documents are there in the remaining segments? 1?

The calculation for whether to merge the segment is:

double pctDeletes = 100. * ((double) deleted_docs_in_segment /
(double) doc_count_in_segment_including_deleted_docs
if (pctDeletes > forceMergeDeletesPctAllowed) {merge the segment}.

At any rate, calling findForcedMerges instead will purge all deleted
docs no matter what.

NOTE: as of 7.5, the behavior has changed in that both of these
methods will respect the maximum segment size by default. Prior to
7.5, either of these could produce a single segment for all the
segments that were merged (all of them in forceMerge, all with > n%
deleted docs in forceMergeDeletes). If you require a single segment to
result, you can specify the maxSegmentCount as 1.

See LUCENE-7976 for all the gory details of this change if you're curious

Best,
Erick
On Fri, Sep 28, 2018 at 5:41 AM Rob Audenaerde <rob.audenaerde@gmail.com> wrote:
>
> Hi all,
>
> We build a FST on the terms of our index by iterating the terms of the
> readers for our fields, like this:
>
>                         for (final LeafReaderContext ctx : leaves) {
>                             final LeafReader leafReader = ctx.reader();
>
>                             for (final String indexField : indexFields) {
>                                 final Terms terms =
> leafReader.terms(indexField);
>                                 // If the field does not exist in this
> reader, then we get null, so check for that.
>                                 if (terms != null) {
>                                     final TermsEnum termsEnum =
> terms.iterator();
>
> However, it sometimes the building of the FST seems to find terms that are
> from documents that are deleted. This is what we expect, checking the
> javadocs.
>
> So, now we switched the IndexWriter to a config with a TieredMergePolicy
> with: setForceMergeDeletesPctAllowed(0).
>
> When calling indexWriter.forceMergeDeletes(true) we expect that there will
> be no more deletes. However, the deleted terms still sometimes appear. We
> use the DirectoryReader.openIfChanged() to refresh the reader before
> iterating the terms.
>
> Are we forgetting something?
>
> Thanks in advance.
> Rob Audenaerde

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic