'[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replica'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-commits
Subject:    [jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replica
From:       "DOAN DuyHai (JIRA)" <jira () apache ! org>
Date:       2017-09-30 9:17:04
Message-ID: JIRA.13063582.1492021729000.244617.1506763024787 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.pl \
ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16186997#comment-16186997 \
] 

DOAN DuyHai commented on CASSANDRA-13442:
-----------------------------------------

So by flagging 2 replicas as fully repaired and used for consistent reads and one as \
transient, we're introducing indeed *asymmetry* into the replicas themselves.  

For the following reasoning suppose RF=3,  2 _consistent_ replicas and 1 _transient_ \
replica

*I  Consistency*

Now reasoning with CL from the client POV is no longer straightforward, the client \
should know which replica(s) are fully repaired and which are not.

Furthermore, now we need to keep meta data about repaired and unrepaired range of \
data.

- suppose a {{SELECT * FROM table WHERE partition = xxx AND clustering >= yyyy AND \
clustering <= zzz}}. What if some ranges of clustering have been repaired and some \
other aren't ? Do we query QUORUM on 3 replicas or does the coordinator issue 2 \
queries, one for the repaired range on 1 consistent replica and 2 on a QUORUM of 3 \
replicas for the unrepaired range ?

- how do we reason with CL like global multi-DC QUORUM and EACH_QUORUM for reads ?

*II Failure handling*

Failure of replica is now *asymmetric* too for reads. A failed _transient_ replica is \
a no brainer. A failed _consistent_ replica(s) is a bigger issue. With classic \
replication CL=ONE guarantee you to be highly available in the face of 2 failed \
replicas. Now having 2 failed _consistent_ replicas, since the repaired data has been \
deleted on the _transient_ replica, the CL=ONE contract is broken ...

*III Operations*

- how to manage bootstrap of new node(s) ? Is the new node(s) considered transient \
for all ranges until at least one round of repair has finished ? Obviously \
bootstrapping new node(s) by streaming data from _consistent_ replicas will make more \
sense for repaired data but in this case the bootstrap process itself need to be \
repair-aware

- how to manage decommission ? How do we dispatch repaired and unrepaired range of \
data so that the remaining nodes in the cluster are balanced (with regard to the \
_consistent_ and _transient_ replica state) ?

- what's about MV and secondary index ? I know that MV is likely to be flagged as \
experimental but unless we remove it completely from the base code we should also \
think about those concepts of _transient_ replicas for MV at least.

- what's about backup & restore ? Now any backup process should also be repair-aware \
and also know the location of the _consistent_ replicas

- for *existing* clusters, how to transition from normal replication to this new \
replication scheme ? And vice versa 





> Support a means of strongly consistent highly available replication with tunable \
>                 storage requirements
> -----------------------------------------------------------------------------------------------------
>  
> Key: CASSANDRA-13442
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
> Project: Cassandra
> Issue Type: Improvement
> Components: Compaction, Coordination, Distributed Metadata, Local Write-Read Paths
> Reporter: Ariel Weisberg
> 
> Replication factors like RF=2 can't provide strong consistency and availability \
> because if a single node is lost it's impossible to reach a quorum of replicas. \
> Stepping up to RF=3 will allow you to lose a node and still achieve quorum for \
> reads and writes, but requires committing additional storage. The requirement of a \
> quorum for writes/reads doesn't seem to be something that can be relaxed without \
> additional constraints on queries, but it seems like it should be possible to relax \
> the requirement that 3 full copies of the entire data set are kept. What is \
> actually required is a covering data set for the range and we should be able to \
> achieve a covering data set and high availability without having three full copies. \
>  After a repair we know that some subset of the data set is fully replicated. At \
> that point we don't have to read from a quorum of nodes for the repaired data. It \
> is sufficient to read from a single node for the repaired data and a quorum of \
> nodes for the unrepaired data. One way to exploit this would be to have N replicas, \
> say the last N replicas (where N varies with RF) in the preference list, delete all \
> repaired data after a repair completes. Subsequent quorum reads will be able to \
> retrieve the repaired data from any of the two full replicas and the unrepaired \
> data from a quorum read of any replica including the "transient" replicas. \
> Configuration for something like this in NTS might be something similar to { \
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used for \
> consistency and the second values is the number of transient replicas. If you \
> specify { DC1=3, DC2=3 } then the number of transient replicas defaults to 0 and \
> you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic