[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-user
Subject: Re: Querying into a Collector visits documents multiple times
From: Michael Sokolov <msokolov () gmail ! com>
Date: 2021-09-24 10:52:24
Message-ID: CAGUSZHCBRweMOAB7z9CS=RHziY-nZACiaE=6JNkMf0NNEfcoOA () mail ! gmail ! com
[Download RAW message or body]
Ah sorry never mind. Confused collector and collector manager
On Fri, Sep 24, 2021, 6:51 AM Michael Sokolov <msokolov@gmail.com> wrote:
> Separate issue, but this collector is not going to work with concurrent
> search since the sum is not updated in a thread safe manner. Maybe you
> don't care, since you don't use a thread pool to execute your queries, but
> you probably should!
>
> On Wed, Sep 22, 2021, 8:38 AM Adrien Grand <jpountz@gmail.com> wrote:
>
>> Hi Steven,
>>
>> This collector looks correct to me. Resetting the counter to 0 on the
>> first
>> segment is indeed not necessary.
>>
>> We have plenty of collectors that are very similar to this one and we
>> never
>> observed any double-counting issue. I would suspect an issue in the code
>> that calls this collector. Maybe try to print the stack trace under the `
>> if (context.docBase == 0) {` check to see why your collector is being
>> called twice?
>>
>> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
>> stevenschlansker@gmail.com> wrote:
>>
>> > Hi Lucene users,
>> >
>> > I am developing a search application that needs to do some basic
>> > summary statistics. We use Lucene 8.9.0.
>> > To improve performance for e.g. summing a value across 10,000
>> > documents, we are using DocValues as columnar storage.
>> >
>> > In order to retrieve the DocValues without collecting all hits into a
>> > TopDocs, which we determined to cause a lot of memory pressure and
>> > consume much time, we are using the expert Collector query interface.
>> >
>> > Here's the code, simplified a bit for the list:
>> >
>> > final collector = new Collector() {
>> > long sum = 0;
>> >
>> > @Override
>> > public ScoreMode scoreMode() {
>> > return ScoreMode.COMPLETE_NO_SCORES;
>> > }
>> >
>> > @Override
>> > public LeafCollector getLeafCollector(final LeafReaderContext
>> > context) throws IOException {
>> > if (context.docBase == 0) {
>> > sum = 0; // XXX: this should not be necessary?
>> > }
>> > final var subtotalValue =
>> > context.reader().getNumericDocValues("subtotal");
>> > return new LeafCollector() {
>> > @Override
>> > public void setScorer(final Scorable scorer) throws
>> > IOException {
>> > }
>> >
>> > @Override
>> > public void collect(final int doc) throws IOException {
>> > if (subtotalValue.docID() > doc ||
>> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
>> > return;
>> > }
>> > sum += subtotalValue.longValue();
>> > }
>> > };
>> > }
>> > }
>> > searcher.search(myQuery, collector);
>> > return collector.sum;
>> >
>> > The query is a moderately complicated Boolean query with some
>> > TermQuery and MultiTermQuery instances combined together.
>> > While first testing, I observed that seemingly the collector is called
>> > twice for each document, and the sum is exactly double what you would
>> > expect.
>> >
>> > It seems that the Collector is observing every matched document twice,
>> > and by printing out the Scorer, I see that it's done with two
>> > different BooleanScorer instances.
>> > You can see my hack that resets the collector every time it starts at
>> > docBase 0. which I am sure is not the right approach, but seems to
>> > work.
>> > What is the right pattern to ensure my Collector only observes result
>> > documents once, no matter the input query? I see a note in the
>> > documentation that state is supposed to be stored on the Scorer
>> > implementation, but I am not providing a custom Scorer, nor do I
>> > actually want any scoring at all.
>> >
>> > Thank you for any guidance!
>> > Steven
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> --
>> Adrien
>>
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic