[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-dev
Subject:    Re: Vector Search with OpenAI Embeddings: Lucene Is All You Need
From:       Kent Fitch <kent.fitch () gmail ! com>
Date:       2023-09-05 23:39:16
Message-ID: CA+WtSxxVORL9WEUJ7iNAxEsS+X5+jP4FUyzr0LcifaFjFAnQ-Q () mail ! gmail ! com
[Download RAW message or body]

Thanks for your kind comments, Uwe - the Code4Lib editors and reviewers had
a lot to do with making it readable.  I asked them if the Code4Lib journal
uses DOIs and I guess surprisingly for a library/info-systems-run journal,
they told me they dont, but that the url should be "permanent" (Somehow,
I'd bank on the DOI infrastructure lasting longer, but never mind).  Also,
apologies in case you were trying to access some examples at the
https://nla-overproof.projectcomputing.com/knnBlend site over the weekend:
a combination of some aggressive crawling and pro-bono penetration testing
knocked it offline for several hours, which was my fault for not
sufficiently checking parameters on an experimental site before publishing
the url!

best regards

Kent Fitch

On Sat, Sep 2, 2023 at 8:37 PM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hey,
>
> Very nice article! Looks like lots of manual work to look at search
> results in those examples. Great work!
>
> Do you have a DOI name for the article?
>
> Uwe
>
>
> Am 1. September 2023 07:22:09 MESZ schrieb Kent Fitch <
> kent.fitch@gmail.com>:
>
>> My testing shows Lucene's HNSW in a very positive light.  The ability to
>> perform blended searches (vector/semantic and text) is valuable, even with
>> high quality embeddings, and helps when the searcher's intent is to search
>> for specific words or phrases (such as a name, or exact concepts) which get
>> blurred-out by semantics.   I discussed blended searching using Lucene in
>> this Code4Lib article: https://journal.code4lib.org/articles/17443
>>
>> And regarding performance, I have benchmarked Lucene's HNSW (circa
>> Jan2023 snapshot) on a test index of 192 million vectors of 1536
>> dimensions, reduced by PQ coding to 512 bytes and stored in HNSW.  Building
>> this index was slow (lots of time merging...) but once it was built, it did
>> fit entirely in memory (core i7-9800x (8 cores) with 128gb DDR4 memory
>> running at 2400 MT/s) so no IO was required at search time.  (I modified
>> the lucene similarity code to support expansion of each of the 512 PQ byte
>> codes back to 3 floats for the distance calculation.)  I havent updated
>> this to take advantage of the latest SIMD capability, but even so, once the
>> HNSW structure is in memory, a single-threaded topK=10 search thread
>> achieves 2.4 queries/second.  Two threads: 4.9 q/s, 4 threads: 7.2q/s,
>> maxing out at 8 threads: 9.4 q/s.  I guess the non-linear scaling with
>> threads is due to competition for memory bandwidth and cache.  Curiously,
>> I'm not getting nearly as good performance out of the box using Milvus
>> 2.3's diskANN, but I need to find out why before condemning it.
>>
>> Kent Fitch
>>
>> On Thu, Aug 31, 2023 at 7:53 PM Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> Thanks Michael, very interesting!  I of course agree that Lucene is all
>>> you need, heh ;)
>>>
>>> Jimmy Lin also tweeted about the strength of Lucene's HNSW:
>>> https://twitter.com/lintool/status/1681333664431460353?s=20
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Thu, Aug 31, 2023 at 3:31 AM Michael Wechner <
>>> michael.wechner@wyona.com> wrote:
>>>
>>>> Hi Together
>>>>
>>>> You might be interesed in this paper / article
>>>>
>>>> https://arxiv.org/abs/2308.14963
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de
>

[Attachment #3 (text/html)]

<div dir="ltr">Thanks for your kind comments, Uwe - the Code4Lib editors and \
reviewers had a lot to do with making it readable.   I asked them if the Code4Lib \
journal uses DOIs and I guess surprisingly for a library/info-systems-run journal, \
they told me they dont, but that the url should be &quot;permanent&quot; (Somehow, \
I&#39;d bank on the DOI infrastructure lasting longer, but never mind).   Also, \
apologies in case you were trying to access some examples at the  <a \
href="https://nla-overproof.projectcomputing.com/knnBlend">https://nla-overproof.projectcomputing.com/knnBlend</a> \
site over the weekend: a combination of some aggressive crawling  and pro-bono \
penetration testing knocked it offline for several hours, which was my fault for not \
sufficiently checking parameters on an experimental site before publishing the \
url!<div><br></div><div>best regards</div><div><br></div><div>Kent \
Fitch</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On \
Sat, Sep 2, 2023 at 8:37 PM Uwe Schindler &lt;<a \
href="mailto:uwe@thetaphi.de">uwe@thetaphi.de</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div><div dir="auto">Hey,<br><br>Very nice \
article! Looks like lots of manual work to look at search results in those examples. \
Great work!<br><br>Do you have a DOI name for the \
article?<br><br>Uwe</div><br><br><div class="gmail_quote"><div dir="auto">Am 1. \
September 2023 07:22:09 MESZ schrieb Kent Fitch &lt;<a \
href="mailto:kent.fitch@gmail.com" \
target="_blank">kent.fitch@gmail.com</a>&gt;:</div><blockquote class="gmail_quote" \
style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"> <div dir="ltr">My testing shows Lucene&#39;s HNSW \
in a very positive light.   The ability to perform blended searches (vector/semantic \
and text) is valuable, even with high quality embeddings, and helps when the \
searcher&#39;s intent is to search for specific words or phrases (such as a name, or \
exact concepts) which get blurred-out by semantics.     I discussed blended searching \
using Lucene in this Code4Lib article:  <a \
href="https://journal.code4lib.org/articles/17443" \
target="_blank">https://journal.code4lib.org/articles/17443</a><div><br></div><div>And \
regarding performance, I have benchmarked Lucene&#39;s HNSW (circa Jan2023 snapshot) \
on a test index of 192 million vectors of 1536 dimensions, reduced by PQ coding to \
512 bytes and stored in HNSW.   Building this index was slow (lots of time \
merging...) but once it was built, it did fit entirely in memory (core i7-9800x (8 \
cores) with 128gb DDR4 memory running at 2400 MT/s) so no IO was required at search \
time.   (I modified the lucene similarity code to support expansion of each of the \
512 PQ byte codes back to 3 floats for the distance calculation.)   I havent updated \
this to take advantage of the latest SIMD capability, but even so, once the HNSW \
structure is in memory, a single-threaded topK=10 search thread achieves 2.4 \
queries/second.   Two threads: 4.9 q/s, 4 threads: 7.2q/s, maxing out at 8 threads: \
9.4 q/s.   I guess the non-linear scaling with threads is due to competition for \
memory bandwidth and cache.   Curiously, I&#39;m not getting nearly as good \
performance out of the box using Milvus 2.3&#39;s diskANN, but I need to find out why \
before condemning it.</div><div><br></div><div>Kent Fitch</div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 31, 2023 at \
7:53 PM Michael McCandless &lt;<a href="mailto:lucene@mikemccandless.com" \
target="_blank">lucene@mikemccandless.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr">Thanks Michael, very interesting!   \
I of course agree that Lucene is all you need, heh ;)<div><br></div><div>Jimmy Lin \
also tweeted about the strength of Lucene&#39;s HNSW:  <a \
href="https://twitter.com/lintool/status/1681333664431460353?s=20" \
target="_blank">https://twitter.com/lintool/status/1681333664431460353?s=20</a></div><div><div><br \
clear="all"><div><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Mike \
McCandless<br><br><a href="http://blog.mikemccandless.com" \
target="_blank">http://blog.mikemccandless.com</a></div></div></div></div><br></div></div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 31, 2023 at \
3:31 AM Michael Wechner &lt;<a href="mailto:michael.wechner@wyona.com" \
target="_blank">michael.wechner@wyona.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">Hi Together<br> <br>
You might be interesed in this paper / article<br>
<br>
<a href="https://arxiv.org/abs/2308.14963" rel="noreferrer" \
target="_blank">https://arxiv.org/abs/2308.14963</a><br> <br>
Thanks<br>
<br>
Michael<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href="mailto:dev-unsubscribe@lucene.apache.org" \
target="_blank">dev-unsubscribe@lucene.apache.org</a><br> For additional commands, \
e-mail: <a href="mailto:dev-help@lucene.apache.org" \
target="_blank">dev-help@lucene.apache.org</a><br> <br>
</blockquote></div>
</blockquote></div>
</blockquote></div><div dir="auto">--<br>Uwe Schindler<br>Achterdiek 19, 28357 \
Bremen<br><a href="https://www.thetaphi.de" \
target="_blank">https://www.thetaphi.de</a></div></div></blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic