'[jira] [Comment Edited] (CASSANDRA-13004) Corruption while adding a column to a table'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-commits
Subject:    [jira] [Comment Edited] (CASSANDRA-13004) Corruption while adding a column to a table
From:       "Alex Petrov (JIRA)" <jira () apache ! org>
Date:       2017-05-31 19:39:04
Message-ID: JIRA.13025830.1480991031000.335055.1496259544300 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/CASSANDRA-13004?page=com.atlassian.jira.pl \
ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031813#comment-16031813 \
] 

Alex Petrov edited comment on CASSANDRA-13004 at 5/31/17 7:38 PM:
------------------------------------------------------------------

I've been able to narrow it down a little bit and wanted to elaborate on why I've \
decided to take some extra time and caution on the issue to avoid protocol changes. \
Another problem is that if we fix the problem in {{SerializationHeader}} and not \
{{ColumnFilter}} we may potentially end up having other issues with {{ColumnFilter}} \
since it would still have inconsistent view of columns on the replica side. To sum it \
up, I do agree that the issue is related to the initial exchange of columns, but I \
think that the problem is hiding within the {{ColumnFilter}}. Two problems, in fact.

It looks like {{ColumnFilter}} can be created with one set of metadata and afterwards \
used with the other. Since {{ColumnFilter}} is used for both serialisation and \
deserialisation, between creating a column filter and deserialising responses from \
replica, column set may get changed, and {{partitionColumns}} of {{CfMetadata}} will \
be a different collection. This will have negative impact on deserialisation: binary \
data will be validated and processed as a different set of columns. In some cases \
this results into corrupt flags, in other cases into EOF and the worst case scenario \
is when the data actually passes all validations and mutation lands the sstable. When \
trying to read, it will contain the data from the other column. The column itself in \
turn will just contain less data. In order to solve this problem, it is sufficient to \
cache the columns during the creation of the column filter.

The second problem is, as Jeff has noted is hiding in the fact that we can not rely \
on the output of subset deserialisation in header \
[here|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/db/SerializationHeader.java#L401], \
because it is happening on replica an in the case with {{fetchesAll}}, it will rely \
on the local replica metadata. Although I would suggest we resolve it a bit \
differently: by changing the way we serialise the column filter. Namely, by avoiding \
relying on the synchronised metadata in case of fetch all, and encoding all the \
columns. If this is considered to be too big of an overhead (I can imagine large \
column families holding a lot of columns may cause quite some traffic), we can at \
least use a hash of concatenated column names. 

On trunk the problem (1) can be fixed simpler, by passing a command cf reference \
instead of grabbing the base cf reference, which is fetched later. 

I have a preliminary patch that addresses (1) and (2) for 3.0 and 3.11 and a separate \
different patch for trunk, but since the logic of ColumnFilter kind of gets out of \
hand with all the excluding cases, I would like to take a bit more time to simplify \
it a bit. I hope I can submit it tomorrow.




was (Author: ifesdjeen):
I've been able to narrow it down a little bit and wanted to elaborate on why I've \
decided to take some extra time and caution on the issue to avoid protocol changes. I \
do agree that the issue is related to the initial exchange of columns, but I think \
that the problem is hiding within the {{ColumnFilter}}. Two problems, in fact.

It looks like {{ColumnFilter}} can be created with one set of metadata and afterwards \
used with the other. Since {{ColumnFilter}} is used for both serialisation and \
deserialisation, between creating a column filter and deserialising responses from \
replica, column set may get changed, and {{partitionColumns}} of {{CfMetadata}} will \
be a different collection. This will have negative impact on deserialisation: binary \
data will be validated and processed as a different set of columns. In some cases \
this results into corrupt flags, in other cases into EOF and the worst case scenario \
is when the data actually passes all validations and mutation lands the sstable. When \
trying to read, it will contain the data from the other column. The column itself in \
turn will just contain less data. In order to solve this problem, it is sufficient to \
cache the columns during the creation of the column filter.

The second problem is, as Jeff has noted is hiding in the fact that we can not rely \
on the output of subset deserialisation in header \
[here|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/db/SerializationHeader.java#L401], \
because it is happening on replica an in the case with {{fetchesAll}}, it will rely \
on the local replica metadata. Although I would suggest we resolve it a bit \
differently: by changing the way we serialise the column filter. Namely, by avoiding \
relying on the synchronised metadata in case of fetch all, and encoding all the \
columns. If this is considered to be too big of an overhead (I can imagine large \
column families holding a lot of columns may cause quite some traffic), we can at \
least use a hash of concatenated column names. 

On trunk the problem (1) can be fixed simpler, by passing a command cf reference \
instead of grabbing the base cf reference, which is fetched later. 

I have a preliminary patch that addresses (1) and (2) for 3.0 and 3.11 and a separate \
different patch for trunk, but since the logic of ColumnFilter kind of gets out of \
hand with all the excluding cases, I would like to take a bit more time to simplify \
it a bit. I hope I can submit it tomorrow.



> Corruption while adding a column to a table
> -------------------------------------------
> 
> Key: CASSANDRA-13004
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13004
> Project: Cassandra
> Issue Type: Bug
> Components: Distributed Metadata
> Reporter: Stanislav Vishnevskiy
> Priority: Blocker
> Fix For: 3.0.x, 3.11.x, 4.x
> 
> 
> We had the following schema in production. 
> {code:none}
> CREATE TYPE IF NOT EXISTS discord_channels.channel_recipient (
> nick text
> );
> CREATE TYPE IF NOT EXISTS discord_channels.channel_permission_overwrite (
> id bigint,
> type int,
> allow_ int,
> deny int
> );
> CREATE TABLE IF NOT EXISTS discord_channels.channels (
> id bigint,
> guild_id bigint,
> type tinyint,
> name text,
> topic text,
> position int,
> owner_id bigint,
> icon_hash text,
> recipients map<bigint, frozen<channel_recipient>>,
> permission_overwrites map<bigint, frozen<channel_permission_overwrite>>,
> bitrate int,
> user_limit int,
> last_pin_timestamp timestamp,
> last_message_id bigint,
> PRIMARY KEY (id)
> );
> {code}
> And then we executed the following alter.
> {code:none}
> ALTER TABLE discord_channels.channels ADD application_id bigint;
> {code}
> And one row (that we can tell) got corrupted at the same time and could no longer \
> be read from the Python driver.  {code:none}
> [E 161206 01:56:58 geventreactor:141] Error decoding response from Cassandra. \
> ver(4); flags(0000); stream(27); op(8); offset(9); len(887); buffer: \
> '\x84\x00\x00\x1b\x08\x00\x00\x03w\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00\x0f\x \
> 00\x10discord_channels\x00\x08channels\x00\x02id\x00\x02\x00\x0eapplication_id\x00\x \
> 02\x00\x07bitrate\x00\t\x00\x08guild_id\x00\x02\x00\ticon_hash\x00\r\x00\x0flast_mes \
> sage_id\x00\x02\x00\x12last_pin_timestamp\x00\x0b\x00\x04name\x00\r\x00\x08owner_id\ \
> x00\x02\x00\x15permission_overwrites\x00!\x00\x02\x000\x00\x10discord_channels\x00\x \
> 1cchannel_permission_overwrite\x00\x04\x00\x02id\x00\x02\x00\x04type\x00\t\x00\x06al \
> low_\x00\t\x00\x04deny\x00\t\x00\x08position\x00\t\x00\nrecipients\x00!\x00\x02\x000 \
> \x00\x10discord_channels\x00\x11channel_recipient\x00\x01\x00\x04nick\x00\r\x00\x05t \
> opic\x00\r\x00\x04type\x00\x14\x00\nuser_limit\x00\t\x00\x00\x00\x01\x00\x00\x00\x08 \
> \x03\x8a\x19\x8e\xf8\x82\x00\x01\xff\xff\xff\xff\x00\x00\x00\x04\x00\x00\xfa\x00\x00 \
> \x00\x00\x08\x00\x00\xfa\x00\x00\xf8G\xc5\x00\x00\x00\x00\x00\x00\x00\x08\x03\x8b\xc \
> 0\xb5nB\x00\x02\x00\x00\x00\x08G\xc5\xffI\x98\xc4\xb4(\x00\x00\x00\x03\x8b\xc0\xa8\x \
> ff\xff\xff\xff\x00\x00\x01<\x00\x00\x00\x06\x00\x00\x00\x08\x03\x81L\xea\xfc\x82\x00 \
> \n\x00\x00\x00$\x00\x00\x00\x08\x03\x81L\xea\xfc\x82\x00\n\x00\x00\x00\x04\x00\x00\x \
> 00\x01\x00\x00\x00\x04\x00\x00\x08\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x \
> 08\x03\x8a\x1e\xe6\x8b\x80\x00\n\x00\x00\x00$\x00\x00\x00\x08\x03\x8a\x1e\xe6\x8b\x8 \
> 0\x00\n\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x040\x07\xf8Q\x00\x00\x00\x04\x0 \
> 0\x00\x00\x00\x00\x00\x00\x08\x03\x8a\x1f\x1b{\x82\x00\x00\x00\x00\x00$\x00\x00\x00\ \
> x08\x03\x8a\x1f\x1b{\x82\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00 \
> \x07\xf8Q\x00\x00\x00\x04\x10\x00\x00\x00\x00\x00\x00\x08\x03\x8a\x1fH6\x82\x00\x01\ \
> x00\x00\x00$\x00\x00\x00\x08\x03\x8a\x1fH6\x82\x00\x01\x00\x00\x00\x04\x00\x00\x00\x \
> 00\x00\x00\x00\x04\x00\x05\xe8A\x00\x00\x00\x04\x10\x02\x00\x00\x00\x00\x00\x08\x03\ \
> x8a+=\xca\xc0\x00\n\x00\x00\x00$\x00\x00\x00\x08\x03\x8a+=\xca\xc0\x00\n\x00\x00\x00 \
> \x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x08\x00\x00\x00\x00\x04\x00\x00\x00\x00 \
> \x00\x00\x00\x08\x03\x8a\x8f\x979\x80\x00\n\x00\x00\x00$\x00\x00\x00\x08\x03\x8a\x8f \
> \x979\x80\x00\n\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00 \
> \x08\x01\x00\x00\x00\x04\xc4\xb4(\x00\xff\xff\xff\xff\x00\x00\x00O[f\x80Q\x07general \
> \x05\xf8G\xc5\xffI\x98\xc4\xb4(\x00\xf8O[f\x80Q\x00\x00\x00\x02\x04\xf8O[f\x80Q\x00\ \
> xf8G\xc5\xffI\x98\x01\x00\x00\xf8O[f\x80Q\x00\x00\x00\x00\xf8G\xc5\xffI\x97\xc4\xb4( \
> \x06\x00\xf8O\x7fe\x1fm\x08\x03\x00\x00\x00\x01\x00\x00\x00\x00\x04\x00\x00\x00\x00'
>  {code}
> And then in cqlsh when trying to read the row we got this. 
> {code:none}
> /usr/bin/cqlsh.py:632: DateOverFlowWarning: Some timestamps are larger than Python \
> datetime can represent. Timestamps are displayed in milliseconds from epoch. \
> Traceback (most recent call last): File "/usr/bin/cqlsh.py", line 1301, in \
> perform_simple_statement result = future.result()
> File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.5.0.post0-d8d0456.zip/cassandra-driver-3.5.0.post0-d8d0456/cassandra/cluster.py", \
> line 3650, in result raise self._final_exception
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 2: invalid \
> start byte {code}
> We tried to read the data and it would refuse to read the name column (the UTF8 \
> error) and the last_pin_timestamp column had an absurdly large value. We ended up \
> rewriting the whole row as we had the data in another place and it fixed the \
> problem. However there is clearly a race condition in the schema change sub-system. \
> Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic