'Re: Use DirectMonotonicWriter store sorted NumericDocValues'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Use DirectMonotonicWriter store sorted NumericDocValues
From:       LuXugang <xuganglu () icloud ! com ! INVALID>
Date:       2021-06-17 1:48:23
Message-ID: DB372904-A9AA-4744-A2EB-28EB1DC7752E () icloud ! com
[Download RAW message or body]

Thanks, Robert, Adrien. your replies are helpful to me

> 2021年6月15日 下午10:19，Robert Muir <rcmuir@gmail.com> 写道：
> 
> Well it definitely wouldn't be as useful as changing to a
> postings-style approach. That would bring a lot more benefits to
> general cases, e.g. use of PFOR and so on.
> 
> But it is also easier to implement right now, to accelerate cases
> where fields are sorted, without hurting other things.
> 
> On Tue, Jun 15, 2021 at 9:53 AM Adrien Grand <jpountz@gmail.com> wrote:
> > 
> > SegmentWriteState has a reference to SegmentInfos which itself has the index \
> > sort, so I believe that it would be possible. 
> > I wonder how useful it would be in practice. E.g. in the Elasticsearch case, even \
> > though we store lots of time-based data and have been looking into index sorting \
> > for storage/query efficiency reasons, the index sorts that we are interested in \
> > in practice look more like `host.name ASC, @timestamp DESC` than just `@timestamp \
> > DESC`. The reason for sorting by `host` first is that it helps a lot with \
> > storage/query efficiency of metadata that is tied to the host (e.g. IP addresses, \
> > operating system, etc.), and then because `host.name` is usually a \
> > low-cardinality field, queries by descending timestamp remain super efficient \
> > thanks to LUCENE-9280. So we'd be more interested in an optimization that would \
> > support piecewise monotonic fields. 
> > On Tue, Jun 15, 2021 at 3:33 PM Robert Muir <rcmuir@gmail.com> wrote:
> > > 
> > > +1 to that idea. Maybe a shorter-term possibility would be to only do
> > > this compression on a field when the user has explicitly configured
> > > index sorting on the field (can we hackishly peek at it and tell?)
> > > 
> > > On Tue, Jun 15, 2021 at 9:04 AM Adrien Grand <jpountz@gmail.com> wrote:
> > > > 
> > > > I believe that this sort of optimization would be more effective and robust \
> > > > if we made doc values look more like postings, with relatively small blocks \
> > > > of values that would get compressed independently and decompressed in bulk. \
> > > > This way, we wouldn't require data to be sorted across entire segments for \
> > > > this optimization to kick in, and we would be less likely to slow down the \
> > > > normal case. 
> > > > On Tue, Jun 15, 2021 at 12:06 PM Robert Muir <rcmuir@gmail.com> wrote:
> > > > > 
> > > > > We did this monotonic detection/compression before in older times, but
> > > > > had to remove it because it caused too many slowdowns.
> > > > > 
> > > > > I think it easily causes too much type pollution, for example, for a
> > > > > typical large index with unsorted docvalues field, big segments aren't
> > > > > won't be sorted, tiny segments with a few values might happen to be
> > > > > sorted (depending on chance/luck), tiny tiny ones with e.g. a single
> > > > > document are sorted. Now we have a mix of monotonic and non-monotonic
> > > > > over the same field.
> > > > > 
> > > > > On the other hand, optimization is very fragile and rare: even for
> > > > > these log users actually sorting on that field at index-time, it will
> > > > > just apply to one field out of the somehow typical dozens/hundreds
> > > > > that they like to have. But may destroy performance of all the other
> > > > > fields and overall causes more harm than good.
> > > > > 
> > > > > On Tue, Jun 15, 2021 at 5:49 AM LuXugang <xuganglu@icloud.com.invalid> \
> > > > > wrote:
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > In class Lucene80DocValuesConsumer#writeValues(FieldInfo field, \
> > > > > > DocValuesProducer valuesProducer), all numericDocValues will be visited \
> > > > > > to calculate gcd, in the meantime,  we can check if all values were \
> > > > > > sorted. if so, maybe we could use DirectMonotonicWriter to store them.  \
> > > > > > DirectMonotonicWriter can get impressive compression. 
> > > > > > In addition, when i use Elasticsearch to store numeric field types, in \
> > > > > > Lucene level,  the data always at least stored by \
> > > > > > NumericDocValues/SortedNumericDocValues. So when indexing some sorted \
> > > > > > values like ID, TIMESTAMP, maybe the upon optimization is applicable. 
> > > > > > Could I have some suggestions?
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: dev-help@lucene.apache.org
> > > > > > 
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: dev-help@lucene.apache.org
> > > > > 
> > > > 
> > > > 
> > > > --
> > > > Adrien
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: dev-help@lucene.apache.org
> > > 
> > 
> > 
> > --
> > Adrien
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic