'Re: [DISCUSS] Repair Improvement Proposal'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-dev
Subject:    Re: [DISCUSS] Repair Improvement Proposal
From:       David Capwell <dcapwell () apple ! com ! INVALID>
Date:       2021-09-01 22:51:50
Message-ID: F6A4671E-4B3B-4548-BC7B-02A8D9721783 () apple ! com
[Download RAW message or body]

Cool, moving this from dev list to JIRA, will start breaking down tasks and document \
my progress there

https://issues.apache.org/jira/browse/CASSANDRA-16909

> On Aug 27, 2021, at 1:21 PM, David Capwell <dcapwell@apple.com.INVALID> wrote:
> 
> Push vs pull isn't too critical, but there is one edge case to consider; if we \
> didn't think the participate got restarted triggering validation again (which may \
> have caused the process to end) could be a problem. 
> > On Aug 26, 2021, at 9:50 AM, Yifan Cai <yc25code@gmail.com> wrote:
> > 
> > > 
> > > 2. Add retries to specific stages of coordination, such as prepare and
> > > validate. In order to do these retries we first need to know what the
> > 
> > state is for the participant which has yet to reply...
> > 
> > 
> > If I understand it correctly, does it mean retries only happen in the
> > coordinator and the coordinator pulls the states of the participants
> > periodically?
> > If the handling of the requests in the participant is made to be idempotent
> > (which I think is required for retry anyway), pulling the state is
> > unnecessary. For example, the coordinator can just send the PrepareRequest
> > at regular intervals until it receives the PrepareResponse.
> > 
> > - Yifan
> > 
> > On Thu, Aug 26, 2021 at 8:56 AM Blake Eggleston
> > <beggleston@apple.com.invalid> wrote:
> > 
> > > +1 from me, any improvement in this area would be great.
> > > 
> > > It would be nice if this could include visibility into repair streams, but
> > > just exposing the repair state will be a big improvement.
> > > 
> > > > On Aug 25, 2021, at 5:46 PM, David Capwell <dcapwell@gmail.com> wrote:
> > > > 
> > > > Now that 4.0 is out, I want to bring up improving repair again (earlier
> > > > thread
> > > > 
> > > http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3CJIRA.13266448.1572997299000.99567.1572997440168@Atlassian.JIRA%3E
> > >  ),
> > > > specifically the following two JIRAs:
> > > > 
> > > > 
> > > > CASSANDRA-15566 - Repair coordinator can hang under some cases
> > > > 
> > > > CASSANDRA-15399 - Add ability to track state in repair
> > > > 
> > > > 
> > > > Right now repair has an issue if any message is lost, which leads to hung
> > > > or timed out repairs; in addition there is a large lack of visibility
> > > into
> > > > what is going on, and can be even harder if you wish to join coordinator
> > > > with participant state.
> > > > 
> > > > 
> > > > I propose the following changes to improve our current repair subsystem:
> > > > 
> > > > 
> > > > 
> > > > 1. New tracking system for coordinator and participants (covered by
> > > > CASSANDRA-15399).  This system will expose progress on each instance
> > > and
> > > > expose this information for internal access as well as external users
> > > > 2. Add retries to specific stages of coordination, such as prepare and
> > > > validate.  In order to do these retries we first need to know what the
> > > > state is for the participant which has yet to reply, this will leverage
> > > > CASSANDRA-15399 to see what's going on (has the prepare been seen?  Is
> > > > validation running? Did it complete?).  In addition to checking the
> > > > state, we will need to store the validation MerkleTree, this allows for
> > > > coordinator to fetch if goes missing (can be dropped in route to
> > > > coordinator or even on the coordinator).
> > > > 
> > > > 
> > > > What is not in scope?
> > > > 
> > > > - Rewriting all of Repair; the idea is specific "small" changes can fix
> > > > 80% of the issues
> > > > - Handle coordinator node failure.  Being able to recover from a failed
> > > > coordinator should be possible after the above work is done, so is
> > > seen as
> > > > tangental and can be done later
> > > > - Recovery from a downed participant.  Similar to the previous bullet,
> > > > with the state being tracked this acts as a kind of checkpoint, so
> > > future
> > > > work can come in to handle recovery
> > > > - Handling "too large" range. Ideally we should add an ability to split
> > > > the coordination into sub repairs, but this is not the goal of this
> > > work.
> > > > - Overstreaming.  This is a byproduct of the previous "not in scope"
> > > > bullet, and/or large partitions; so is tangental to this work
> > > > 
> > > > 
> > > > Wanted to share here before starting this work again; let me know if
> > > there
> > > > are any concerns or feedback!
> > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > 
> > > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic