'Re: [DRBD-user] Pacemaker cluster with DRBD on ESXI - Fencing on snapshot'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       drbd-user
Subject:    Re: [DRBD-user] Pacemaker cluster with DRBD on ESXI - Fencing on snapshot
From:       jota () disroot ! org
Date:       2018-11-20 10:05:21
Message-ID: fb4a6d5d324bfab955f214c4d6c12b09 () disroot ! org
[Download RAW message or body]

Hello Lars,

Many thanks for your help.
I have removed the servers from RMC backup policy.
In the past we used DataProtector in an older vSphere 6.0, and the issue did not happen.
As RMC backups storage volume and VM, I think that the freeze time has been increased.
I am building a TEST environment to reproduce the issue and solve it.

Regards,

Jota.

14 de noviembre de 2018 13:33, "Lars Ellenberg" <lars.ellenberg@linbit.com> escribiĂł:

> On Tue, Nov 13, 2018 at 12:28:54PM +0000, jota@disroot.org wrote:
> 
>> Hello all,
>> 
>> I am experiencing issues with a pacemaker cluster with drbd.
>> 
>> This is the environment:
>> 2 nodes (CentOS 7.5) - VMs on ESXI 6.5
>> pcs version 0.9.158
>> drbd 8.4
>> 
>> All nights,I have scheduled a backup with HPE RMC within vSphere. This
>> job performs a snapshot of the datastore volume containing the vms,
>> and at the same time, through the vmware-tools, performs a snapshot of
>> each vm. This results on a fencing of each master node every night.
>> Master node has scheduled the backup at 21:00. When it is fenced, the
>> resources goes to the secondary node, that becomes primary. As the
>> second node has scheduled the backup at 22:00, at that time it is
>> fenced too. Is it possible (and safe) to increase some timeouts in
>> order to avoid this?
> 
> This has nothing to do with DRBD.
> 
> But with your cluster manager.
> 
> Cluster memebership has short timeouts on "responsiveness"
> of the nodes. If it declares one node as unresponsive,
> it has to kick that node out of the membership.
> 
> That is done by fencing.
> 
> Snapshots (and snapshot removals, rotating out old ones)
> tend to freeze IO, or even the whole VM.
> 
> If you freeze something with a "real time" dependend component,
> bad things will happen.
> 
> Yes, in a virtualized environment you should increase the "deadtime" or
> "token timeout" (or whatever your cluster manager of choice calls the
> concept) anyways, a few seconds should be ok, unless your
> infrastructure is heavily oversubscribed.
> (so as an example for pacemaker / corosync in your case,
> corosync.conf, totem { token 3000; }
> 
> But I've seen these "stalls" take tens of seconds, sometimes up to minutes.
> You don't want that latency on your cluster membership.
> 
> So you want to tell your backups to *NOT* freeze the VMs.
> 
> If the whole thing is "crash safe", that is, can recover from a
> hard-crash of a single VM single hypervisor setup, all is good,
> it can recover from such a non-frozen snapshot based backup as well.
> 
> If it is not "crash safe" in the above sense, then you cannot do
> failovers either, and need to go back to the drawing board anyways.
> 
> Alternatively, put your cluster in mainenance-mode,
> do what you think you have to do,
> and put live again after that.
> 
> --
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
> 
> DRBD Ž and LINBIT Ž are registered trademarks of LINBIT
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

[prev in list] [next in list] [prev in thread] [next in thread]