[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Boolean Scorer
From:       Arihant Samar <arisamjay () gmail ! com>
Date:       2021-06-24 11:31:58
Message-ID: CAAd1=mXpp7v5+=6LBm0g7RKJ4S6CMKMRcTgvU13W8ZvhcrjHBQ () mail ! gmail ! com
[Download RAW message or body]

I managed to correct some mistakes and now the tests which checks scores
are passing. Obviously the tests which check about the same thread
generating and collecting fail , but just out of interest I removed those
asserts. Are there any tests or benchmarks which I can compare how these
changes perform.

Thanking you in advance,
Arihant.

On Tue, 22 Jun 2021 at 11:37, Arihant Samar <arisamjay@gmail.com> wrote:

> There was a Jira relating to GPU acceleration where it was mentioned that
> Boolean Scorer has possibilities of GPU usage.
>  So I was just checking first with multithreading in Java itself and
> thought that this function may be amenable to parallelization.
> Hence I was just giving it a try.
> Will this not be useful if there are very long Boolean queries with a lot
> of SHOULD clauses although I have no clue if this is a common situation.
>
> I just need one more little help. Although some of the tests do give the
> error Adrien mentioned that docs should be collected in the same thread
> they were generated, but some tests also give wrong scores itself. Do you
> see anything wrong in the synchronization I have done?
> The synchronization I have done is basically creating an array of
> matching.length size of Reentrant locks and just running the function
> "ScoreWindowIntoBitSetAndReplay " with numScorer threads instead of the for
> loop.
> /// in BooleanScorer.java -> OrCollector -> collect function
> Lock[idx].lock();
> matching[idx] |= 1L << i;
> final Bucket bucket = buckets[i];
> bucket.freq++;
> bucket.score += scorer.score();
> Lock[idx].unlock();
>
>
>
> On Mon, 21 Jun 2021 at 19:04, Adrien Grand <jpountz@gmail.com> wrote:
>
>> It should be possible to make something like this work. The main issue is
>> that Lucene has the expectation that a (Bulk)Scorer is consumed in the
>> thread where it was pulled, so this would require substantial changes to
>> how BooleanScorer currently operates I believe.
>>
>> I'd be curious to know why you are looking into this rather than passing
>> an Executor to IndexSearcher so that it can search segments concurrently.
>> Is it not providing enough concurrency for you?
>>
>> On Mon, Jun 21, 2021 at 9:46 AM Arihant Samar <arisamjay@gmail.com>
>> wrote:
>>
>>> Hi,
>>> There is a function "ScoreWindowIntoBitSetAndReplay" in
>>> "BooleanScorer.java" which runs over all the scorers.
>>> I was wondering if we can use multi-threading here with numScorers
>>> threads. Anyways we are using a special OrCollector here which updates the
>>> matching array and the score in the buckets of 2048 docs. So we can use a
>>> Reentrant lock for synchronization in the collector.
>>>
>>> I just wanted reviews on this since I tried this and some tests were not
>>> passing. So if you could tell what is wrong in this approach, I
>>> would appreciate it.
>>>
>>> Thanking You in advance,
>>> Arihant.
>>>
>>> On Tue, 15 Jun 2021, 19:05 Adrien Grand, <jpountz@gmail.com> wrote:
>>>
>>>> Glad it helped. :)
>>>>
>>>> On Tue, Jun 15, 2021 at 3:28 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>>
>>>>> Thanks for this explanation Adrien! I'd been wondering about this a
>>>>> bit myself since seeing that DrillSideways also implements a TAAT approach
>>>>> (in addition to a doc-at-a-time approach). This really helps clear that up.
>>>>> Appreciate you taking the time to explain!
>>>>>
>>>>> Cheers,
>>>>> -Greg
>>>>>
>>>>> On Mon, Jun 14, 2021 at 2:35 AM Adrien Grand <jpountz@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Arihant,
>>>>>>
>>>>>> The Scorer for disjunctions uses a heap data structure that needs to
>>>>>> be reordered upon every hit. While reordering heaps is efficient as it runs
>>>>>> in logarithmic time, the fact that it needs to run on every document might
>>>>>> add non-negligible overhead. BooleanScorer tries to work around this
>>>>>> overhead by scoring large windows of documents in a more TAAT
>>>>>> (term-at-a-time) fashion so that Lucene only needs to reorder the heap
>>>>>> every 2048 doc IDs (the hardcoded window size).
>>>>>>
>>>>>> This paper gives a bit more context:
>>>>>> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
>>>>>> see section 4 in particular.
>>>>>>
>>>>>> On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> I am new here . I would like to know what is the exact optimisation
>>>>>>> carried out in "Boolean Scorer.java" code which led to a separate class for
>>>>>>> resolving Boolean Queries in bulk documents. I could not find any material
>>>>>>> in the documentation for this as well, hence I decided to ask here.
>>>>>>>
>>>>>>>
>>>>>>> Thanking you in advance,
>>>>>>>
>>>>>>> Arihant.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>>>> Windows 10
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Adrien
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>
>>
>> --
>> Adrien
>>
>

[Attachment #3 (text/html)]

<div dir="ltr"><div>I managed to correct some mistakes and now the tests which checks \
scores are passing. Obviously the tests which check about the same thread generating \
and collecting fail , but just out of interest I removed those asserts. Are there any \
tests or benchmarks which I can compare how these changes \
perform.</div><div><br></div><div>Thanking you in \
advance,</div><div>Arihant.<br></div></div><br><div class="gmail_quote"><div \
dir="ltr" class="gmail_attr">On Tue, 22 Jun 2021 at 11:37, Arihant Samar &lt;<a \
href="mailto:arisamjay@gmail.com">arisamjay@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div \
dir="ltr"><div>There was a Jira relating to GPU acceleration where it was mentioned \
that Boolean Scorer has possibilities of GPU usage.</div><div>  So I was just \
checking first with multithreading in Java itself and thought that this function may \
be amenable to parallelization. <br></div><div>Hence I was just giving it a try. \
<br></div><div>Will this not be useful if there are very long Boolean queries with a \
lot of SHOULD clauses although I have no clue if this is a common \
situation.</div><div><br></div><div>I just need one more little help. Although some \
of the tests do give the error Adrien mentioned that docs should be collected in the \
same thread they were generated, but some tests also give wrong scores itself. Do you \
see anything wrong in the synchronization I have done?<br></div><div>The \
synchronization I have done is basically creating an array of matching.length size of \
Reentrant locks and just running the  function &quot;ScoreWindowIntoBitSetAndReplay

&quot; with numScorer threads instead of the for loop.</div><div>/// in \
BooleanScorer.java -&gt; OrCollector -&gt; collect \
function<br></div><div>Lock[idx].lock();<br></div><div> <table><tbody><tr><td \
id="gmail-m_3415230028955905295gmail-LC135">      matching[idx] <span>|=</span> \
<span>1L</span> <span>&lt;&lt;</span> i;</td>  </tr>
      <tr>
        </tr></tbody></table><table><tbody><tr><td \
id="gmail-m_3415230028955905295gmail-LC136">      <span>final</span> \
<span><span>Bucket</span></span> bucket <span>=</span> buckets[i];</td>  </tr>
      <tr>
        </tr></tbody></table><table><tbody><tr><td \
id="gmail-m_3415230028955905295gmail-LC137">      \
bucket<span>.</span><span>freq</span><span>++</span>;</td>  </tr>
      <tr>
        </tr></tbody></table>      bucket<span>.</span>score <span>+=</span> \
<span>scorer</span><span>.</span><span>score</span>();

</div><div>Lock[idx].unlock();</div><div><br></div><div><br></div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 21 Jun 2021 at 19:04, \
Adrien Grand &lt;<a href="mailto:jpountz@gmail.com" \
target="_blank">jpountz@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr">It should be possible to make \
something like this work. The main issue is that Lucene has the expectation that a \
(Bulk)Scorer is consumed in the thread where it was pulled, so this would require \
substantial changes to how BooleanScorer currently operates I \
believe.<div><br></div><div>I&#39;d be curious to know why you are looking into this \
rather than passing an Executor to IndexSearcher so that it can search segments \
concurrently. Is it not providing enough concurrency for \
you?</div><div></div></div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Mon, Jun 21, 2021 at 9:46 AM Arihant Samar &lt;<a \
href="mailto:arisamjay@gmail.com" target="_blank">arisamjay@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div \
dir="auto">Hi,<div dir="auto">There is a function \
&quot;ScoreWindowIntoBitSetAndReplay&quot; in &quot;BooleanScorer.java&quot; which \
runs over all the scorers.</div><div dir="auto">I was wondering if we can use \
multi-threading here with numScorers threads. Anyways we are using a special \
OrCollector here which updates the matching array and the score in the buckets of \
2048 docs. So we can use a Reentrant lock for synchronization in the \
collector.</div><div dir="auto"><br></div><div dir="auto">I just wanted reviews on \
this since I tried this and some tests were not passing. So if you could tell what is \
wrong in this approach, I would  appreciate it.</div><div dir="auto"><br></div><div \
dir="auto">Thanking You in advance,</div><div dir="auto">Arihant.</div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 15 Jun 2021, 19:05 \
Adrien Grand, &lt;<a href="mailto:jpountz@gmail.com" \
target="_blank">jpountz@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr">Glad it helped. :)</div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 15, 2021 at 3:28 PM \
Greg Miller &lt;<a href="mailto:gsmiller@gmail.com" rel="noreferrer" \
target="_blank">gsmiller@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr">Thanks for this explanation  \
Adrien! I&#39;d been wondering about this a bit myself  since seeing that \
DrillSideways also implements a TAAT approach (in addition to a doc-at-a-time \
approach). This really helps clear that up. Appreciate you taking the time to \
explain!<div><br></div><div>Cheers,</div><div>-Greg</div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jun 14, 2021 at 2:35 AM \
Adrien Grand &lt;<a href="mailto:jpountz@gmail.com" rel="noreferrer" \
target="_blank">jpountz@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello \
Arihant,<div><br></div><div>The Scorer for disjunctions uses a heap data structure \
that needs to be reordered upon every hit. While reordering heaps is efficient as it \
runs in logarithmic time, the fact that it needs to run on every document  might add \
non-negligible overhead. BooleanScorer tries to work around this overhead by scoring \
large windows of documents in a more TAAT (term-at-a-time) fashion so that Lucene \
only needs to reorder the heap every 2048 doc IDs (the hardcoded window \
size).</div><div><br></div><div>This paper gives a bit more context:  <a \
href="http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf" \
rel="noreferrer" target="_blank">http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf</a>, \
see section 4 in particular.</div><div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar &lt;<a \
href="mailto:arisamjay@gmail.com" rel="noreferrer" \
target="_blank">arisamjay@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div lang="EN-IN"><div><p \
class="MsoNormal">Hi , </p><p class="MsoNormal">I am new here . I would like to know \
what is the exact optimisation carried out in "Boolean Scorer.java" code which led to \
a separate class for resolving Boolean Queries in bulk documents. I could not find \
any material in the documentation for this as well, hence I decided to ask \
here.</p><p class="MsoNormal"><br></p><p class="MsoNormal">Thanking you in \
advance,</p><p class="MsoNormal">Arihant.<br></p><p class="MsoNormal"><u></u>  \
<u></u></p><p class="MsoNormal"><a \
name="m_3415230028955905295_m_-355651446825561719_m_4390598802589095752_m_663796961616 \
6651828_m_-2098735838736900253_m_1348256171958901660_m_-2529958919021170347_m_-3757628723102844553_m_9013538365410428761__olk_signature" \
rel="noreferrer">Sent from </a><a \
href="https://go.microsoft.com/fwlink/?LinkId=550986" rel="noreferrer" \
target="_blank"><span>Mail</span><span></span></a><span> for Windows 10</span></p><p \
class="MsoNormal"><span><span><u></u>  <u></u></span></span></p></div></div></div> \
</blockquote></div><br clear="all"><div><br></div>-- <br><div \
dir="ltr">Adrien</div></div></div> </blockquote></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr">Adrien</div>
</blockquote></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr">Adrien</div>
</blockquote></div>
</blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic