'Re: Needs help reviewing on Lucene PostingsFormat memory improvement'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Needs help reviewing on Lucene PostingsFormat memory improvement
From:       Anh Dũng Bùi <dungba.sg () gmail ! com>
Date:       2024-02-08 11:40:19
Message-ID: CABEzzOHzY-O_dNfKn61AtV=3Qb-=FGxrMMi9Pf4spomE6prL8A () mail ! gmail ! com
[Download RAW message or body]

Thanks Mike for the reply!

> Read-time for Lucene90BlockTreePostingsFormat was already off-heap?  And
your PR changes write-time to do so as well?

Yeah that's the idea. I changed just the Terms Writer to be off-heap.
Thanks, let's monitor it after the merge.

> Maybe building the synonyms FST (SynonymMap.Builder) would be a good
place for off-heap writing too?

This is a good idea. I see there's one on-going PR that tackles this
already: https://github.com/apache/lucene/pull/13054. I'm excited to see
the feature is rolling out to different parts of Lucene.

> And this exciting PR <https://github.com/apache/lucene/pull/12688> (still
a work in progres) would likely strongly benefit from streaming FST
building, since its FSTs will be much larger than the Lucene90BlockTree
since it stores all terms (not just the sampled prefix/index) in a single
FST for the segment.

I can try to fork this PR and convert to off-heap writing as well.

Regards,
Anh Dung Bui

On Thu, Feb 8, 2024 at 7:43 AM Michael McCandless <lucene@mikemccandless.com>
wrote:

> Hi Anh Dũng Bùi,
>
> Thank you for tackling these and being so gently patient/persisting!
> Sorry for the delay.  I will try to review them soon.  The off-heap
> (streaming?) building of FSTs is really a massive improvement to Lucene,
> inspired by Tantivy's FST implementation:
> https://blog.burntsushi.net/transducers/
>
> Read-time for Lucene90BlockTreePostingsFormat was already off-heap?  And
> your PR changes write-time to do so as well?  This will reduce RAM pressure
> during indexing which is great.  And some Lucene usages generate incredibly
> large FSTs (I'm looking at you HathiTrust!). I don't think we need to
> explicitly measure any performance impact before merging?, but let's watch
> the nightly benchy to see if there is any measurable impact?
>
> And, yes, Lucene90BlockTreePostingsFormat is the default.  You find the
> default codec from Codec.getDefault() and then trace downwards to all its
> sources.
>
> Maybe building the synonyms FST (SynonymMap.Builder) would be a good place
> for off-heap writing too?
>
> And this exciting PR <https://github.com/apache/lucene/pull/12688> (still
> a work in progres) would likely strongly benefit from streaming FST
> building, since its FSTs will be much larger than the Lucene90BlockTree
> since it stores all terms (not just the sampled prefix/index) in a single
> FST for the segment.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 1, 2024 at 10:40 PM Anh Dũng Bùi <dungba.sg@gmail.com> wrote:
>
>> Hi Lucene devs!
>>
>> I have 2 PRs to optimize Lucene PostingsFormat
>> (Lucene90BlockTreePostingsFormat and FSTPostingsFormat) by utilizing a new
>> feature to stream the FST to IndexOutput directly, bypassing the on-heap
>> writing:
>> - https://github.com/apache/lucene/pull/12980
>> - https://github.com/apache/lucene/pull/12985
>>
>> It would be great if someone can help reviewing. I also have some general
>> questions:
>> - How do I measure the memory improvement impact in Lucene?
>> - Is Lucene90BlockTreePostingsFormat the main index format used in
>> Lucene? If not, what is the main format?
>> - Are there other places worth using the new streaming FST feature?
>>
>> Thank you!
>> Anh Dung Bui
>>
>

[Attachment #3 (text/html)]

<div dir="ltr">Thanks Mike for the reply!<div><br></div><div>&gt; Read-time for \
Lucene90BlockTreePostingsFormat was already off-heap?   And your PR changes \
write-time to do so as well?</div><div><br></div><div>Yeah that&#39;s the idea. I \
changed just the Terms Writer to be off-heap. Thanks, let&#39;s monitor it after the \
merge.</div><div><br></div><div>&gt; Maybe building the synonyms FST \
(SynonymMap.Builder) would be a good place for off-heap writing \
too?</div><div><br></div><div>This is a good idea. I see there&#39;s one on-going PR \
that tackles this already:  <a \
href="https://github.com/apache/lucene/pull/13054">https://github.com/apache/lucene/pull/13054</a>. \
I&#39;m excited to see the feature is rolling out to different parts of \
Lucene.</div><div><br></div><div>&gt; And  <a \
href="https://github.com/apache/lucene/pull/12688" target="_blank">this exciting \
PR</a>  (still a work in progres) would likely strongly benefit from streaming FST \
building, since its FSTs will be much larger than the Lucene90BlockTree since it \
stores all terms (not just the sampled prefix/index) in a single FST for the \
segment.</div><br class="gmail-Apple-interchange-newline"><div>I can try to fork this \
PR and convert to off-heap writing as \
well.</div><div><br></div><div>Regards,</div><div>Anh Dung Bui</div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 8, 2024 at \
7:43 AM Michael McCandless &lt;<a \
href="mailto:lucene@mikemccandless.com">lucene@mikemccandless.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Anh \
Dũng Bùi,<div><br></div><div>Thank you for tackling these and being so gently \
patient/persisting!   Sorry for the delay.   I will try to review them soon.   The \
off-heap (streaming?) building of FSTs is really a massive improvement to Lucene, \
inspired by Tantivy&#39;s  FST implementation:  <a \
href="https://blog.burntsushi.net/transducers/" \
target="_blank">https://blog.burntsushi.net/transducers/</a></div><div><br></div><div>Read-time \
for Lucene90BlockTreePostingsFormat was already off-heap?   And your PR changes \
write-time to do so as well?   This will reduce RAM pressure during indexing which is \
great.   And some Lucene usages generate incredibly large FSTs (I&#39;m looking at \
you HathiTrust!). I don&#39;t think we need to explicitly measure any performance \
impact before merging?, but let&#39;s watch the nightly benchy to see if there is any \
measurable impact?</div><div><br></div><div>And, yes, Lucene90BlockTreePostingsFormat \
is the default.   You find the default codec from Codec.getDefault() and then trace \
downwards to all its sources.</div><div><br></div><div>Maybe building the synonyms \
FST (SynonymMap.Builder) would be a good place for off-heap writing \
too?</div><div><br></div><div>And  <a \
href="https://github.com/apache/lucene/pull/12688" target="_blank">this exciting \
PR</a> (still a work in progres) would likely strongly benefit from streaming FST \
building, since its FSTs will be much larger than the Lucene90BlockTree since it \
stores all terms (not just the sampled prefix/index) in a single FST for the \
segment.</div><div><br clear="all"><div><div dir="ltr" class="gmail_signature"><div \
dir="ltr"><div>Mike McCandless<br><br><a href="http://blog.mikemccandless.com" \
target="_blank">http://blog.mikemccandless.com</a></div></div></div></div><br></div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 1, 2024 at \
10:40 PM Anh Dũng Bùi &lt;<a href="mailto:dungba.sg@gmail.com" \
target="_blank">dungba.sg@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Lucene \
devs!</div><div><br></div><div>I have 2 PRs to optimize Lucene PostingsFormat \
(Lucene90BlockTreePostingsFormat and FSTPostingsFormat) by utilizing a new feature to \
stream the FST to IndexOutput directly, bypassing the on-heap writing:</div><div>- <a \
href="https://github.com/apache/lucene/pull/12980" \
target="_blank">https://github.com/apache/lucene/pull/12980</a></div><div>- <a \
href="https://github.com/apache/lucene/pull/12985" \
target="_blank">https://github.com/apache/lucene/pull/12985</a></div><div><br></div><div>It \
would be great if someone can help reviewing. I also have some general \
questions:</div><div>- How do I measure the memory improvement impact in \
Lucene?</div><div>- Is Lucene90BlockTreePostingsFormat the main index format used in \
Lucene? If not, what is the main format?<br></div><div>- Are there other places worth \
using the new streaming FST feature?</div><div><br></div><div>Thank \
you!</div><div>Anh Dung Bui<br></div></div> </blockquote></div>
</blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic