'[ceph-users] Re: Temporary shutdown of subcluster and cephfs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Re: Temporary shutdown of subcluster and cephfs
From:       Frank Schilder <frans () dtu ! dk>
Date:       2022-10-25 13:21:28
Message-ID: c42a60cb2e9e42589349c2af74d0cb4c () dtu ! dk
[Download RAW message or body]

Hi Patrick.

> To be clear in case there is any confusion: once you do `fs fail`, the
> MDS are removed from the cluster and they will respawn. They are not
> given any time to flush remaining I/O.

This is fine, there is not enough time to flush anything. As long as they leave the \
meta-data- and data pools in a consistent state, that is, after an "fs set <fs_name> \
joinable true" the MDSes start replaying the journal etc. and the FS comes up \
healthy, everything is fine. If user IO in flight gets lost in this process, this is \
not a problem. A problem would be a corruption of the file system itself.

In my experience, an mds fail is a clean (non-destructive) operation. I have never \
had an FS corruption due to an mds fail. As long as an "fs fail" is also \
non-destructive, it is the best way I can see to cut off all user IO as fast as \
possible and bring all hardware to rest. What I would like to avoid is a power loss \
on a busy cluster where I would have to rely on too many things to be implemented \
correctly. With >800 disks you start seeing unusual firmware fails and also disk \
fails after power up are not uncommon. I just want to take as much as possible out of \
the "does this really work in all corner cases" equation and rather rely on "I did \
this 100 times in the past without a problem" situations.

That users may have to repeat a task is not a problem. Damaging the file system \
itself, on the other hand, is.

Thanks and best regards,
=================
Frank Schilder
AIT Risų Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@redhat.com>
Sent: 25 October 2022 14:51:33
To: Frank Schilder
Cc: Dan van der Ster; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Temporary shutdown of subcluster and cephfs

On Tue, Oct 25, 2022 at 3:48 AM Frank Schilder <frans@dtu.dk> wrote:
> 
> Hi Patrick,
> 
> thanks for your answer. This is exactly the behaviour we need.
> 
> For future reference some more background:
> 
> We need to prepare a quite large installation for planned power outages. Even \
> though they are called planned, we will not be able to handle these manually in \
> good time for reasons irrelevant here. Our installation is protected by an UPS, but \
> the guaranteed uptime on outage is only 6 minutes. So, we talk more about transient \
> protection than uninterrupted power supply. Although we survived more than 20 \
> minute power outages without loss of power to the DC, we need to plan with these 6 \
> minutes. 
> In these 6 minutes, we need to wait for at least 1-2 minutes to avoid unintended \
> shut-downs. In the remaining 4 minutes, we need to take down a 500 node HPC cluster \
> and an 1000OSD+12MDS+2MON ceph sub-cluster. Part of this ceph cluster will continue \
> running on another site with higher power redundancy. This gives maybe 1-2 minutes \
> response time for the ceph cluster and the best we can do is to try to achieve a \
> "consistent at rest" state and hope we can cleanly power down the system before the \
> power is cut. 
> Why am I so concerned about a "consistent at rest" state?
> 
> Its because while not all instances of a power loss lead to data loss, all \
> instances of data loss I know of and were not caused by admin errors were caused by \
> a power loss (see https://tracker.ceph.com/issues/46847). We were asked to prepare \
> for a worst case of weekly power cuts, so no room for taking too many chances here. \
> Our approach is: unmount as much as possible, fail the quickly FS to stop all \
> remaining IO, give OSDs and MDSes a chance to flush pending operations to disk or \
> journal and then try a clean shut down.

To be clear in case there is any confusion: once you do `fs fail`, the
MDS are removed from the cluster and they will respawn. They are not
given any time to flush remaining I/O.

FYI as this may interest you: we have a ticket to set a flag on the
file system to prevent new client mounts:
https://tracker.ceph.com/issues/57090

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic