'Re: Payloads for each term'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-dev
Subject:    Re: Payloads for each term
From:       Michael Sokolov <msokolov () gmail ! com>
Date:       2022-01-13 14:41:57
Message-ID: CAGUSZHDw5cvVLrdWEwXttExV2ivyiUS7ZV2sLUBnOFB7_agDzA () mail ! gmail ! com
[Download RAW message or body]

Oh interesting! I did not know about this FeatureField (link was to
the old repo, now gone:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/=
lucene/document/FeatureField.java
worked for me)

On Wed, Nov 11, 2020 at 4:37 PM Mayya Sharipova
<mayya.sharipova@elastic.co.invalid> wrote:
>
> For sparse vectors, we found that Lucene's FeatureField could also be use=
ful. It stores features as terms and feature values as term frequencies, an=
d provides several convenient functions to calculate scores based on featur=
e values.
>
> On Fri, Nov 6, 2020 at 11:16 AM Michael McCandless <lucene@mikemccandless=
.com> wrote:
>>
>> Also, be aware that recent Lucene versions enabled compression for Binar=
yDocValues fields, which might hurt performance of your second solution.
>>
>> This compression is not yet something you can easily turn off, but there=
 are ongoing discussions/PRs about how to make it more easily configurable =
for applications that really care more about search CPU cost over index siz=
e for BinaryDocValues fields: https://issues.apache.org/jira/browse/LUCENE-=
9378
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless <lucene@mikemccandles=
s.com> wrote:
>>>
>>> In addition to payloads having kinda of high-ish overhead (slow down in=
dexing, do not compress very well I think, and slow down search as you must=
 pull positions), they are also sort of a forced fit for your use case, rig=
ht?  Because a payload in Lucene is per-term-position, whereas you really n=
eed this vector per-term (irrespective of the positions where that term occ=
urs in each document)?
>>>
>>> Your second solution is an intriguing one.  So you would use Lucene's c=
ustom term frequencies to store indices into that per-document map encoded =
into a BinaryDocValues field?  During indexing I guess you would need a Tok=
enFilter that hands out these indices in order (0, 1, 2, ...) based on the =
unique terms it sees, and after all tokens are done, it exports a byte[] se=
rialized map?  Hmm, except term frequency 0 is not allowed, so you'd need t=
o + 1 to all indices.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <bruno.roustant@gmail.co=
m> wrote:
>>>>
>>>> Hi Ankur,
>>>> Indeed payloads are the standard way to solve this problem. For light =
queries with a few top N results that should be efficient. For multi-term q=
ueries that could become penalizing if you need to access the payloads of t=
oo many terms.
>>>> Also, there is an experimental PostingsFormat called SharedTermsUnifor=
mSplit (class named STUniformSplitPostingsFormat) that would allow you to e=
ffectively share the overlapping terms in the index while having 50 fields.=
 This would solve the index bloat issue, but would not fully solve the seek=
s issue. You might want to benchmark this approach too.
>>>>
>>>> Bruno
>>>>
>>>> Le ven. 23 oct. 2020 =C3=A0 02:48, Ankur Goel <ankur.goel79@gmail.com>=
 a =C3=A9crit :
>>>>>
>>>>> Hi Lucene Devs,
>>>>>            I have a need to store a sparse feature vector on a per te=
rm basis. The total number of possible dimensions are small (~50) and known=
 at indexing time. The feature values will be used in scoring along with co=
rpus statistics. It looks like payloads were created for this exact same pu=
rpose but some workaround is needed to minimize the performance penalty as =
mentioned on the wiki .
>>>>>
>>>>> An alternative is to override term frequency to be a pointer in a Map=
<pointer, Feature_Vector> serialized and stored in BinaryDocValues. At quer=
y time, the matching docId will be used to advance the pointer to the start=
ing offset of this map. The term frequency will be used to perform lookup i=
nto the serialized map to retrieve the Feature_Vector. That's my current pl=
an but I haven't benchmarked it.
>>>>>
>>>>> The problem that I am trying to solve is to reduce the index bloat an=
d eliminate unnecessary seeks as currently these ~50 dimensions are stored =
as separate fields in the index with very high term overlap and Lucene does=
 not share Terms dictionary across different fields. This itself can be a n=
ew feature for Lucene but will reqiure lots of work I imagine.
>>>>>
>>>>> Any ideas are welcome :-)
>>>>>
>>>>> Thanks
>>>>> -Ankur

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic