[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-dev
Subject:    Re: [DISCUSS] NULL handling and the unfrozen collection issue
From:       Benjamin Lerer <b.lerer () gmail ! com>
Date:       2024-05-16 14:33:54
Message-ID: CAAy+TbQ0wQRjPC2MAs5jsjoymJk0egtMp_aBEoA+wFBqbAdYqg () mail ! gmail ! com
[Download RAW message or body]

I found some other confusing behavior in LWT around null value and empty
multicell collection. I opened CASSANDRA-19637
<https://issues.apache.org/jira/browse/CASSANDRA-19637>for those interested.

Le jeu. 4 avr. 2024 Ã  18:34, Caleb Rackliffe <calebrackliffe@gmail.com> a
écrit :

> The easiest way to check out how Accord uses IS NULL and IS NOT NULL is to
> look at the examples in the cep-15-accord branch:
> 
> 
> https://github.com/apache/cassandra/blob/cep-15-accord/test/distributed/org/apache/cassandra/distributed/test/accord/AccordCQLTestBase.java
>  
> tl;dr We did indeed try to go with an approach that more closely matches
> SQL, although there may still be some edges we didn't test.
> 
> I'd have no problem w/ moving to 3-value logic everywhere, I guess, but
> "everywhere" could just mean filtering queries and Accord. (i.e. If we want
> to deprecate LWT col = NULL syntax, do we really want people rewriting
> those LWTs...or just moving to the new Accord syntax, which obviously
> supports it? We should "spend" our user query rewrite budget wisely.)
> 
> On Thu, Apr 4, 2024 at 4:53 AM Benjamin Lerer <blerer@apache.org> wrote:
> 
> > Now, in Cassandra setting a column to null means deleting it and if *all*
> > > columns in a row are null the row is deleted. This might be another edge
> > > case...
> > 
> > 
> > It is slightly more complicated than that as the primary key columns
> > count in (*all* columns)
> > 
> > For example if you have the following table: CREATE TABLE tlb (pk int, c
> > int, v int, PRIMARY KEY (pk, c)) and the following row: INSERT INTO tlb
> > (pk, c, v) VALUES (1, 1, 1)
> > deleting the column v (DELETE v FROM %s WHERE pk = 1 AND c = 1) will not
> > delete the row as the primary key columns have a timestamp and therefore do
> > exist. So the row will still exist with a null value for column v.
> > 
> > If the row was created through an UPDATE (UPDATE tlb SET v = 1 WHERE pk =
> > 1 AND c = 1) things will be different as an UPDATE statement do not create
> > a timestamp for the clustering columns. By consequence, if V is deleted
> > (set to null) the row will not have any columns anymore and will be deleted.
> > 
> > The issue here is that in practice we never consider partition keys or
> > clustering columns as null if the database returns a row for it. Whether
> > the primary key columns have a timestamp or not.
> > I believe that we should ignore that fact too as far as IS NULL/IS NOT
> > NULL are concerned. If a row exists, its primary columns should be
> > considered as not null. Otherwise we are getting on a really slippery
> > slope. The INSERT/UPDATE logic is confusing enough in my opinion without
> > adding another layer to it.
> > 
> > One other issue that we have though is that the code for != LWT does not
> > work with the Three-Valued logic. If you have: [...] IF col != 1  and col
> > is null then in the TVL the value should be UNKNOWN therefore the condition
> > should not match.
> > It feels to me that we should probably keep that behavior for backward
> > compatibility reasons but probably change the behavior in Accord if it is
> > not already done.
> > 
> > 
> > 
> > Le jeu. 21 mars 2024 Ã  01:10, German Eichberger via dev <
> > dev@cassandra.apache.org> a écrit :
> > 
> > > Hi,
> > > 
> > > +1 I like doing it the SQL way. This makes sense to me.
> > > 
> > > Now, in Cassandra setting a column to null means deleting it and if
> > > *all*​ columns in a row are null the row is deleted. This might be
> > > another edge case...
> > > 
> > > German
> > > ------------------------------
> > > *From:* Benjamin Lerer <b.lerer@gmail.com>
> > > *Sent:* Wednesday, March 20, 2024 9:15 AM
> > > *To:* dev@cassandra.apache.org <dev@cassandra.apache.org>
> > > *Subject:* [EXTERNAL] [DISCUSS] NULL handling and the unfrozen
> > > collection issue
> > > 
> > > You don't often get email from b.lerer@gmail.com. Learn why this is
> > > important <https://aka.ms/LearnAboutSenderIdentification>
> > > Hi everybody,
> > > 
> > > CEP-29 (CQL NOT Operator) is hitting the grey area of how we want as a
> > > community to handle NULL including for things like unfrozen (multi-cell)
> > > collections and I would like to make a proposal for moving forward with
> > > NULL related issues.
> > > 
> > > We have currently 2 tickets open about NULL handling (I might have
> > > missed others):
> > > 
> > > 1. CASSANDRA-10715
> > > <https://issues.apache.org/jira/browse/CASSANDRA-10715>: Allowing
> > > Filtering on NULL
> > > 2. CASSANDRA-17762
> > > <https://issues.apache.org/jira/browse/CASSANDRA-17762>: LWT IF col
> > > = NULL is inconsistent with SQL NULL
> > > 
> > > We also had previously some discussion on which we touched the subject:
> > > 
> > > - [DISCUSS] LWT UPDATE semantics with + and - when null
> > > - CEP-15 multi key transaction syntax
> > > 
> > > In all those tickets and discussions the consensus was to have a
> > > behavior similar to SQL.
> > > 
> > > For null comparisons, SQL uses the three-value logic (
> > > https://modern-sql.com/concept/three-valued-logic) introducing the need
> > > for IS NULL and IS NOT NULL operators. Those conflict with the col = NULL
> > > predicate supported in LWT conditions (CASSANDRA-17762
> > > <https://issues.apache.org/jira/browse/CASSANDRA-17762>).
> > > 
> > > So far, as Cassandra was only using inclusive operators, comparisons
> > > were behaving in an expected way. According to three-valued logic NULL
> > > CONTAINS 'foo' should return UNKNOWN and the filtering behavior should
> > > exclude everything which is not true.Therefore the row should not be
> > > returned as expected. With exclusive operators things are more tricky.
> > > NULL NOT CONTAINS 'foo' will also return UNKNOWN causing the row to not be
> > > returned which might not match people's expectations.
> > > This behavior can be even more confusing once you take into account
> > > empty and null collections. NOT CONTAINS on an empty collection will return
> > > true while it will return UNKNOWN on a NULL collection. Unfortunately, for
> > > unfrozen (multicell) collections we are unable to differentiate between an
> > > empty and null collection and therefore always treat empty collections as
> > > NULL.
> > > For predicates such as map[myKey] != 'foo' when myKey is not present the
> > > result can also be surprising as it will end up comparing NULL to 'foo'
> > > returning once more UNKNOWN and ignoring the row.
> > > In order to respect the SQL three-valued logic and be able to allow the
> > > user to fetch all the rows which do not contains a specific value we would
> > > need support IS NULL, IS NOT NULL and OR to allow query like:
> > > WHERE c IS NULL OR c NOT CONTAINS 'foo' / WHERE m IS NULL OR m[myKey] !=
> > > foo
> > > 
> > > Supporting the three-valued logic makes sense to me even if some
> > > behavior might end up being confusing. In which case we can easily fix
> > > CASSANDRA-10715 and deprectate support for col = NULL/col != NULL in LWT.
> > > 
> > > What is people's opinion? Should we go for the three-valued logic
> > > everywhere? Should we try something else?
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 


[Attachment #3 (text/html)]

<div dir="ltr">I found some other confusing behavior in LWT around null value and \
empty multicell collection. I opened  <a class="gmail-issue-link" \
href="https://issues.apache.org/jira/browse/CASSANDRA-19637" id="gmail-key-val" \
rel="13579539">CASSANDRA-19637 </a>for those interested.<br></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">Le  jeu. 4 avr. 2024 Ã   18:34, \
Caleb Rackliffe &lt;<a \
href="mailto:calebrackliffe@gmail.com">calebrackliffe@gmail.com</a>&gt; a écrit  \
:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">The \
easiest way to check out how Accord uses IS NULL and IS NOT NULL is to look at the \
examples in the cep-15-accord branch:<div><br></div><div><a \
href="https://github.com/apache/cassandra/blob/cep-15-accord/test/distributed/org/apache/cassandra/distributed/test/accord/AccordCQLTestBase.java" \
target="_blank">https://github.com/apache/cassandra/blob/cep-15-accord/test/distribute \
d/org/apache/cassandra/distributed/test/accord/AccordCQLTestBase.java</a><br></div><div><br></div><div>tl;dr \
We did indeed try to go with an approach that more closely matches SQL, although \
there may still be some edges we didn&#39;t test.</div><div><br></div><div>I&#39;d \
have no problem w/ moving to 3-value logic everywhere, I guess, but \
&quot;everywhere&quot; could just mean filtering queries and Accord. (i.e. If we want \
to deprecate LWT <font face="monospace">col = NULL</font> syntax, do we really want \
people rewriting those LWTs...or just moving to the new Accord syntax, which \
obviously supports it? We should &quot;spend&quot; our user query rewrite budget \
wisely.)</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On \
Thu, Apr 4, 2024 at 4:53 AM Benjamin Lerer &lt;<a href="mailto:blerer@apache.org" \
target="_blank">blerer@apache.org</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"> Now, in Cassandra setting a column to null means \
deleting it and if <b>all</b> columns in a row are null the row is deleted. This \
might be another edge case...

</blockquote><div><br></div><div>It is slightly more complicated than that as the \
primary key columns count in (<b>all</b> columns)<br></div><div><br></div><div>For \
example if you have the following table: CREATE TABLE tlb (pk int, c int, v int, \
PRIMARY KEY (pk, c)) and the following row:  INSERT INTO tlb (pk, c, v) VALUES (1, 1, \
1)</div><div>deleting the column v (DELETE v FROM %s WHERE pk = 1 AND c = 1) will not \
delete the row as the primary key columns have a timestamp and therefore do exist. So \
the row will still exist with a null value for column v.</div><div><br></div><div>If \
the row was created through an UPDATE (UPDATE tlb SET v = 1 WHERE pk = 1 AND c = 1) \
things will be different as an UPDATE statement do not create a timestamp for the \
clustering columns. By consequence, if V is deleted (set to null) the row will not \
have any columns anymore and will be deleted.</div><div><br></div><div>The issue here \
is that in practice we never consider partition keys or clustering columns as null if \
the database returns a row for it. Whether the primary key columns have a timestamp \
or not.</div><div>I believe that we should ignore that fact too as far as IS NULL/IS \
NOT NULL are concerned. If a row exists, its primary columns should be considered as \
not null. Otherwise we are getting on a really slippery slope. The INSERT/UPDATE \
logic is confusing enough in my opinion without adding another layer to \
it.</div><div><br></div><div>One other issue that we have though is that the code for \
!= LWT does not work with the Three-Valued logic. If you have: [...] IF col != 1   \
and col is null then in the TVL the value should be UNKNOWN therefore the condition \
should not match.</div><div>It feels to me that we should probably keep that behavior \
for backward compatibility reasons but probably change the behavior in Accord if it \
is not already done.       <br></div><div><br></div><div>        </div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">Le  jeu. 21 mars 2024 Ã   \
01:10, German Eichberger via dev &lt;<a href="mailto:dev@cassandra.apache.org" \
target="_blank">dev@cassandra.apache.org</a>&gt; a écrit  :<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div>




<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
 Hi,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
 <br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
 +1 I like doing it the SQL way. This makes sense to me.  </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
 <br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
 Now, in Cassandra setting a column to null means deleting it and if <b>all</b>​ \
columns in a row are null the row is deleted. This might be another edge \
case...</div> <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
 <br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:11pt;color:rgb(0,0,0)">
 German</div>
<div id="m_-8189663952534080359m_-8200592184892021744m_-5346449913160367711appendonsend"></div>
 <hr style="display:inline-block;width:98%">
<div id="m_-8189663952534080359m_-8200592184892021744m_-5346449913160367711divRplyFwdMsg" \
dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" \
color="#000000"><b>From:</b> Benjamin Lerer &lt;<a href="mailto:b.lerer@gmail.com" \
target="_blank">b.lerer@gmail.com</a>&gt;<br> <b>Sent:</b> Wednesday, March 20, 2024 \
9:15 AM<br> <b>To:</b> <a href="mailto:dev@cassandra.apache.org" \
target="_blank">dev@cassandra.apache.org</a> &lt;<a \
href="mailto:dev@cassandra.apache.org" \
target="_blank">dev@cassandra.apache.org</a>&gt;<br> <b>Subject:</b> [EXTERNAL] \
[DISCUSS] NULL handling and the unfrozen collection issue</font> <div>  </div>
</div>
<div>
<table border="0" cellspacing="0" cellpadding="0" width="100%" align="left" \
style="border:0px;display:table;width:100%;table-layout:fixed;float:none"> <tbody \
style="display:block"> <tr>
<td valign="middle" width="1px" bgcolor="#A6A6A6" cellpadding="7px 2px 7px 2px" \
style="padding:7px 2px;background-color:rgb(166,166,166)"> </td>
<td valign="middle" width="100%" bgcolor="#EAEAEA" cellpadding="7px 5px 7px 15px" \
color="#212121" style="width:100%;background-color:rgb(234,234,234);padding:7px 5px \
7px 15px;font-family:wf_segoe-ui_normal,&quot;Segoe UI&quot;,&quot;Segoe \
WP&quot;,Tahoma,Arial,sans-serif;font-size:12px;font-weight:normal;color:rgb(33,33,33);text-align:left">
 <div>You don&#39;t often get email from <a href="mailto:b.lerer@gmail.com" \
target="_blank">b.lerer@gmail.com</a>. <a \
href="https://aka.ms/LearnAboutSenderIdentification" target="_blank"> Learn why this \
is important</a></div> </td>
<td valign="middle" align="left" width="75px" bgcolor="#EAEAEA" cellpadding="7px 5px \
7px 5px" color="#212121" \
style="width:75px;background-color:rgb(234,234,234);padding:7px \
5px;font-family:wf_segoe-ui_normal,&quot;Segoe UI&quot;,&quot;Segoe \
WP&quot;,Tahoma,Arial,sans-serif;font-size:12px;font-weight:normal;color:rgb(33,33,33);text-align:left">
 </td>
</tr>
</tbody>
</table>
<div>
<div dir="ltr">
<div>Hi everybody,</div>
<div><br>
</div>
<div>CEP-29 (CQL NOT Operator) is hitting the grey area of how we want as a community \
to handle NULL including for things like unfrozen (multi-cell) collections and I \
would like to make a proposal for moving forward with NULL related issues.<br> </div>
<div><br>
</div>
<div>We have currently 2 tickets open about NULL handling (I might have missed \
others):</div> <div>
<ol>
<li><a href="https://issues.apache.org/jira/browse/CASSANDRA-10715" \
id="m_-8189663952534080359m_-8200592184892021744m_-5346449913160367711x_gmail-key-val" \
rel="12913462" target="_blank">CASSANDRA-10715</a>:  Allowing Filtering on \
NULL</li><li><a href="https://issues.apache.org/jira/browse/CASSANDRA-17762" \
id="m_-8189663952534080359m_-8200592184892021744m_-5346449913160367711x_gmail-key-val" \
rel="13472469" target="_blank">CASSANDRA-17762</a>:  LWT IF col = NULL is \
inconsistent with SQL NULL</li></ol> <div>We also had previously some discussion on \
which we touched the subject:<br> </div>
<ul>
<li>[DISCUSS] LWT UPDATE semantics with + and - when \
<span><span>null</span></span></li><li><font size="2">CEP-15 multi key transaction \
syntax</font> </li></ul> <div>In all those tickets and discussions the consensus was \
to have a behavior similar to SQL.</div> <div><br>
</div>
<div>For null comparisons, SQL uses the three-value logic (<a \
href="https://modern-sql.com/concept/three-valued-logic" \
target="_blank">https://modern-sql.com/concept/three-valued-logic</a>)  introducing \
the need for IS NULL and IS NOT NULL operators. Those conflict with the col = NULL \
predicate supported in LWT conditions (<a \
href="https://issues.apache.org/jira/browse/CASSANDRA-17762" \
id="m_-8189663952534080359m_-8200592184892021744m_-5346449913160367711x_gmail-key-val" \
rel="13472469" target="_blank">CASSANDRA-17762</a>).</div> <div><br>
</div>
<div>So far, as Cassandra was only using inclusive operators, comparisons were \
behaving in an expected way. According to three-valued logic NULL CONTAINS \
&#39;foo&#39; should return UNKNOWN and the <font size="2"><span \
style="color:rgb(29,28,29);font-style:normal;font-variant-ligatures:common-ligatures;f \
ont-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-ind \
ent:0px;text-transform:none;word-spacing:0px;background-color:rgb(255,255,255);text-de \
coration-style:initial;text-decoration-color:initial;display:inline;float:none;font-family:arial,sans-serif"><span></span>filtering
  behavior should exclude everything which is not<span>  </span></span><code \
style="box-sizing:inherit;font-variant-ligatures:none;word-break:normal;border-radius:3px;padding:2px \
3px 1px;line-height:1.50001;font-style:normal;font-variant-caps:normal;font-weight:400 \
;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;word-spacin \
g:0px;text-decoration-style:initial;text-decoration-color:initial;font-family:arial,sans-serif">true.</code><span \
style="color:rgb(29,28,29);font-style:normal;font-variant-ligatures:common-ligatures;f \
ont-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-ind \
ent:0px;text-transform:none;word-spacing:0px;background-color:rgb(255,255,255);text-de \
coration-style:initial;text-decoration-color:initial;display:inline;float:none;font-family:arial,sans-serif">Therefore
  the row should not be returned</span><font face="arial,sans-serif"> as expected. \
With exclusive operators things are more tricky. NULL NOT CONTAINS &#39;foo&#39; will \
also return UNKNOWN causing the row to not be returned which might not match \
people&#39;s expectations.</font></font></div> <div><font size="2"><font \
face="arial,sans-serif">This behavior can be even more confusing once you take into \
account empty and null collections. NOT CONTAINS on an empty collection will return \
true while it will return UNKNOWN on a NULL collection. Unfortunately,  for unfrozen \
(multicell) collections we are unable to differentiate between an empty and null \
collection and therefore always treat empty collections as NULL.</font></font></div> \
<div><font size="2"><font face="arial,sans-serif">For predicates such as map[myKey] \
!= &#39;foo&#39; when myKey is not present the result can also be surprising as it \
will end up comparing NULL to &#39;foo&#39; returning once more UNKNOWN and ignoring \
the row.</font></font></div> <div><font size="2"><font face="arial,sans-serif">In \
order to respect the SQL three-valued logic and be able to allow the user to fetch \
all the rows which do not contains a specific value we would need support IS NULL, IS \
NOT NULL and OR to allow query like:<br> </font></font></div>
<div><font size="2"><font face="arial,sans-serif">WHERE c IS NULL OR c NOT CONTAINS \
&#39;foo&#39; / WHERE m IS NULL OR m[myKey] != foo</font></font></div> <div><font \
size="2"><font face="arial,sans-serif"><br> </font></font></div>
<div><font size="2"><font face="arial,sans-serif">Supporting the three-valued logic \
makes sense to me even if some behavior might end up being confusing. In which case \
we can easily fix CASSANDRA-10715 and deprectate support for col = NULL/col != NULL \
in LWT.</font></font></div> <div><font size="2"><font face="arial,sans-serif"><br>
</font></font></div>
<div><font size="2"><font face="arial,sans-serif">What is people&#39;s opinion? \
Should we go for the three-valued logic everywhere? Should we try something else?   \
<br> </font></font></div>
<div><font size="2"><font face="arial,sans-serif">  <br>
</font></font></div>
<br>
<div><font size="2"><font face="arial,sans-serif">     <br>
</font></font></div>
<div><font size="2"><font face="arial,sans-serif">         </font><span \
style="color:rgb(29,28,29);font-style:normal;font-variant-ligatures:common-ligatures;f \
ont-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-ind \
ent:0px;text-transform:none;word-spacing:0px;background-color:rgb(255,255,255);text-de \
coration-style:initial;text-decoration-color:initial;display:inline;float:none;font-family:arial,sans-serif"><code> \
</code></span><span style="font-family:arial,sans-serif">  </span></font></div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</div>

</div></blockquote></div>
</blockquote></div>
</blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic