'Re: Are the new index consistency checks too strict?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Are the new index consistency checks too strict?
From:       Adrien Grand <jpountz () gmail ! com>
Date:       2021-09-02 12:07:36
Message-ID: CAPsWd+OKVKuhuUHThsKcgr_mU94GzpQs=Nz-voz2X1DkB5QqgQ () mail ! gmail ! com
[Download RAW message or body]

Yes. The idea behind these new enforcements is that all documents must have
a consistent schema, but we still support the case when some documents are
missing values for a field. Whenever a field gets added for the first time
to an index, we generate a FieldInfo for it. And further documents that
have this field must use exactly the same features on this field as the
ones that are configured on this initial FieldInfo.

For instance if you index a document with both terms and doc values on a
given field, then further documents must have both terms and doc values on
this field too, or nothing. They cannot only have terms, or only have doc
values, this is illegal.

Likewise if you index a document with only terms, then further documents
must have either terms, or nothing. They cannot have terms and doc values,
or even doc values only, this is illegal.


On Thu, Sep 2, 2021 at 2:03 PM Michael Sokolov <msokolov@gmail.com> wrote:

> Hmm .. I guess I missed the implication of your comment about
> requiring both points and docvalues for some cases, which I guess
> could be violated if we relaxed this NONE != not NONE enforcement for
> docvalues (or points)...
>
> On Thu, Sep 2, 2021 at 7:46 AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > Oh, and also, I like the idea of making index sorting parent/child aware!
> >
> > On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > >
> > > Yes, I am also supportive of the idea of having a schema that is
> > > enforced, and I like what it enables us to do. I just wonder if we
> > > could relax the enforcement around IndexOptions.NONE (and
> > > DocValuesType.NONE). Would it make sense to enable NONE to be "equal
> > > to" any other IndexOptions, so that eg, you if you index a field with
> > > IndexOptions.DOCS_AND_TERMS then every document must have either
> > > DOCS_AND_TERMS or NONE?  In the case where a field is *only* indexed
> > > as terms, and has no docvalues, this is already allowed. But if you
> > > index a field as both docvalue and terms, then it is not (currently),
> > > which seems weird. I guess the same is true of a field that has no
> > > docvalues on some docs, and has them on others, but is also indexed as
> > > terms everywhere. I think that ought to be allowed (since you can have
> > > a sparse docvalues field that is not indexed with terms).
> > >
> > > On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand <jpountz@gmail.com>
> wrote:
> > > >
> > > > This additional validation that we introduced in Lucene 9 feels like
> a natural extension of the validation that we already had before, such as
> the fact that you cannot have some docs that use SORTED doc values and
> other docs that use NUMERIC doc values on the same field. Actually I would
> have liked to go further by enforcing that all data structures record the
> exact same information but this is challenging due to the fact that
> IndexingChain only has access to the encoded data, e.g. with IntPoint it
> only sees a byte[] rather than the original integer, so we'd have to make
> assumptions about how the data is encoded, which doesn't feel right.
> > > >
> > > > I do like this additional validation very much because I suspect
> that most cases when users would get this error is because they made a
> mistake in their indexing code. And this also helps make Lucene work better
> out-of-the-box. For instance, thanks to this additional validation we
> enabled dynamic pruning when sorting on numeric fields by default - this is
> opt-in on 8.x since this optimization needs to look at both points and doc
> values, so it's broken if not all documents have the same schema. And there
> are other things we could do in the near future like rewriting
> DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report
> that docCount == maxDoc.
> > > >
> > > > In my opinion the correct solution for the problem you are facing
> would be to have a way to make index sorting aware of the parent/child
> relationship so that index sorting would read the sort key of the parent
> document whenever it is on a child document, e.g. as done on LUCENE-5312.
> This way you wouldn't have to duplicate this sort key from your parent
> documents to your child documents, so you wouldn't have any schema issues.
> > > >
> > > > On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> > > >>
> > > >> While upgrading I ran afoul of some inconsistencies in our schema
> > > >> usage, and to fix them I've ended up having to add data to our index
> > > >> that I'd rather not. Let me give a little context: We have a
> > > >> parent/child document structure. Some fields are shared across partn
> > > >> and child docs, others are not. Our index has a sort key, and in
> order
> > > >> for all the parent/child docs to sort together correctly, we add the
> > > >> same (docvalues) fields that are part of the sortkey to both parent
> > > >> and child docs. Some of these fields are *also* indexed as postings
> > > >> (StringField) of the same name, but we only index the postings field
> > > >> on the parent document, since child documents are never searched for
> > > >> on their own - always in conjunction with a parent.
> > > >>
> > > >> The schema-checking code we added in Lucene 9 does not allow this:
> it
> > > >> enforces that all documents having a field should have the same
> "index
> > > >> options", and failing to index the postings gets interpreted as
> having
> > > >> index options = NONE (because of the presence of the doc values
> field
> > > >> of the same name, I think?)
> > > >>
> > > >> Our current solution is to also index the postings for the child
> > > >> document (but just with an empty string value). This seems gross,
> and
> > > >> creates postings in the index that we will never use.
> > > >>
> > > >> Another possibility would be to rename the fields so that the
> postings
> > > >> and docvalues fields have different names. But in this case our
> > > >> application-level schema diverges from our Lucene schema, adding a
> > > >> layer of complexity we'd rather not introduce.
> > > >>
> > > >> Finally, could we relax this constraint, always allowing index
> > > >> options=NONE regardless of how other docs are indexed? Would it
> cause
> > > >> problems?
> > > >>
> > > >> -Mike
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: dev-help@lucene.apache.org
> > > >>
> > > >
> > > >
> > > > --
> > > > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

-- 
Adrien

[Attachment #3 (text/html)]

<div dir="ltr">Yes. The idea behind these new enforcements is that all documents must \
have a consistent schema, but we still support the case when some documents are \
missing values for a field. Whenever a field gets added for the first time to an \
index, we generate a FieldInfo for it. And further documents that have this field \
must use exactly the same features on this field as the ones that are configured on \
this initial FieldInfo.<div><br></div><div>For instance if you index a document with \
both terms and doc values on a given field, then further documents must have both \
terms and doc values on this field too, or nothing. They cannot only have terms, or \
only have doc values, this is illegal.</div><div><br></div><div>Likewise if you index \
a document with only terms, then further documents must have either terms, or \
nothing. They cannot have terms and doc values, or even doc values only, this is \
illegal.</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Thu, Sep 2, 2021 at 2:03 PM Michael Sokolov &lt;<a \
href="mailto:msokolov@gmail.com">msokolov@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hmm .. I guess I \
missed the implication of your comment about<br> requiring both points and docvalues \
for some cases, which I guess<br> could be violated if we relaxed this NONE != not \
NONE enforcement for<br> docvalues (or points)...<br>
<br>
On Thu, Sep 2, 2021 at 7:46 AM Michael Sokolov &lt;<a \
href="mailto:msokolov@gmail.com" target="_blank">msokolov@gmail.com</a>&gt; \
wrote:<br> &gt;<br>
&gt; Oh, and also, I like the idea of making index sorting parent/child aware!<br>
&gt;<br>
&gt; On Thu, Sep 2, 2021 at 7:45 AM Michael Sokolov &lt;<a \
href="mailto:msokolov@gmail.com" target="_blank">msokolov@gmail.com</a>&gt; \
wrote:<br> &gt; &gt;<br>
&gt; &gt; Yes, I am also supportive of the idea of having a schema that is<br>
&gt; &gt; enforced, and I like what it enables us to do. I just wonder if we<br>
&gt; &gt; could relax the enforcement around IndexOptions.NONE (and<br>
&gt; &gt; DocValuesType.NONE). Would it make sense to enable NONE to be \
&quot;equal<br> &gt; &gt; to&quot; any other IndexOptions, so that eg, you if you \
index a field with<br> &gt; &gt; IndexOptions.DOCS_AND_TERMS then every document must \
have either<br> &gt; &gt; DOCS_AND_TERMS or NONE?   In the case where a field is \
*only* indexed<br> &gt; &gt; as terms, and has no docvalues, this is already allowed. \
But if you<br> &gt; &gt; index a field as both docvalue and terms, then it is not \
(currently),<br> &gt; &gt; which seems weird. I guess the same is true of a field \
that has no<br> &gt; &gt; docvalues on some docs, and has them on others, but is also \
indexed as<br> &gt; &gt; terms everywhere. I think that ought to be allowed (since \
you can have<br> &gt; &gt; a sparse docvalues field that is not indexed with \
terms).<br> &gt; &gt;<br>
&gt; &gt; On Wed, Sep 1, 2021 at 12:24 PM Adrien Grand &lt;<a \
href="mailto:jpountz@gmail.com" target="_blank">jpountz@gmail.com</a>&gt; wrote:<br> \
&gt; &gt; &gt;<br> &gt; &gt; &gt; This additional validation that we introduced in \
Lucene 9 feels like a natural extension of the validation that we already had before, \
such as the fact that you cannot have some docs that use SORTED doc values and other \
docs that use NUMERIC doc values on the same field. Actually I would have liked to go \
further by enforcing that all data structures record the exact same information but \
this is challenging due to the fact that IndexingChain only has access to the encoded \
data, e.g. with IntPoint it only sees a byte[] rather than the original integer, so \
we&#39;d have to make assumptions about how the data is encoded, which doesn&#39;t \
feel right.<br> &gt; &gt; &gt;<br>
&gt; &gt; &gt; I do like this additional validation very much because I suspect that \
most cases when users would get this error is because they made a mistake in their \
indexing code. And this also helps make Lucene work better out-of-the-box. For \
instance, thanks to this additional validation we enabled dynamic pruning when \
sorting on numeric fields by default - this is opt-in on 8.x since this optimization \
needs to look at both points and doc values, so it&#39;s broken if not all documents \
have the same schema. And there are other things we could do in the near future like \
rewriting DocValuesFieldExistsQuery to a MatchAllDocsQuery when points/terms report \
that docCount == maxDoc.<br> &gt; &gt; &gt;<br>
&gt; &gt; &gt; In my opinion the correct solution for the problem you are facing \
would be to have a way to make index sorting aware of the parent/child relationship \
so that index sorting would read the sort key of the parent document whenever it is \
on a child document, e.g. as done on LUCENE-5312. This way you wouldn&#39;t have to \
duplicate this sort key from your parent documents to your child documents, so you \
wouldn&#39;t have any schema issues.<br> &gt; &gt; &gt;<br>
&gt; &gt; &gt; On Wed, Sep 1, 2021 at 4:44 PM Michael Sokolov &lt;<a \
href="mailto:msokolov@gmail.com" target="_blank">msokolov@gmail.com</a>&gt; \
wrote:<br> &gt; &gt; &gt;&gt;<br>
&gt; &gt; &gt;&gt; While upgrading I ran afoul of some inconsistencies in our \
schema<br> &gt; &gt; &gt;&gt; usage, and to fix them I&#39;ve ended up having to add \
data to our index<br> &gt; &gt; &gt;&gt; that I&#39;d rather not. Let me give a \
little context: We have a<br> &gt; &gt; &gt;&gt; parent/child document structure. \
Some fields are shared across partn<br> &gt; &gt; &gt;&gt; and child docs, others are \
not. Our index has a sort key, and in order<br> &gt; &gt; &gt;&gt; for all the \
parent/child docs to sort together correctly, we add the<br> &gt; &gt; &gt;&gt; same \
(docvalues) fields that are part of the sortkey to both parent<br> &gt; &gt; &gt;&gt; \
and child docs. Some of these fields are *also* indexed as postings<br> &gt; &gt; \
&gt;&gt; (StringField) of the same name, but we only index the postings field<br> \
&gt; &gt; &gt;&gt; on the parent document, since child documents are never searched \
for<br> &gt; &gt; &gt;&gt; on their own - always in conjunction with a parent.<br>
&gt; &gt; &gt;&gt;<br>
&gt; &gt; &gt;&gt; The schema-checking code we added in Lucene 9 does not allow this: \
it<br> &gt; &gt; &gt;&gt; enforces that all documents having a field should have the \
same &quot;index<br> &gt; &gt; &gt;&gt; options&quot;, and failing to index the \
postings gets interpreted as having<br> &gt; &gt; &gt;&gt; index options = NONE \
(because of the presence of the doc values field<br> &gt; &gt; &gt;&gt; of the same \
name, I think?)<br> &gt; &gt; &gt;&gt;<br>
&gt; &gt; &gt;&gt; Our current solution is to also index the postings for the \
child<br> &gt; &gt; &gt;&gt; document (but just with an empty string value). This \
seems gross, and<br> &gt; &gt; &gt;&gt; creates postings in the index that we will \
never use.<br> &gt; &gt; &gt;&gt;<br>
&gt; &gt; &gt;&gt; Another possibility would be to rename the fields so that the \
postings<br> &gt; &gt; &gt;&gt; and docvalues fields have different names. But in \
this case our<br> &gt; &gt; &gt;&gt; application-level schema diverges from our \
Lucene schema, adding a<br> &gt; &gt; &gt;&gt; layer of complexity we&#39;d rather \
not introduce.<br> &gt; &gt; &gt;&gt;<br>
&gt; &gt; &gt;&gt; Finally, could we relax this constraint, always allowing index<br>
&gt; &gt; &gt;&gt; options=NONE regardless of how other docs are indexed? Would it \
cause<br> &gt; &gt; &gt;&gt; problems?<br>
&gt; &gt; &gt;&gt;<br>
&gt; &gt; &gt;&gt; -Mike<br>
&gt; &gt; &gt;&gt;<br>
&gt; &gt; &gt;&gt; ---------------------------------------------------------------------<br>
 &gt; &gt; &gt;&gt; To unsubscribe, e-mail: <a \
href="mailto:dev-unsubscribe@lucene.apache.org" \
target="_blank">dev-unsubscribe@lucene.apache.org</a><br> &gt; &gt; &gt;&gt; For \
additional commands, e-mail: <a href="mailto:dev-help@lucene.apache.org" \
target="_blank">dev-help@lucene.apache.org</a><br> &gt; &gt; &gt;&gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; --<br>
&gt; &gt; &gt; Adrien<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href="mailto:dev-unsubscribe@lucene.apache.org" \
target="_blank">dev-unsubscribe@lucene.apache.org</a><br> For additional commands, \
e-mail: <a href="mailto:dev-help@lucene.apache.org" \
target="_blank">dev-help@lucene.apache.org</a><br> <br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" \
class="gmail_signature">Adrien</div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic