[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-ha-dev
Subject: Re: [Linux-ha-dev] R: R: [PATCH] Filesystem RA:
From: "Darren Thompson (AkurIT)" <darrent () akurit ! com ! au>
Date: 2013-04-10 13:56:08
Message-ID: 3BEB2F53-80A2-4848-905E-6D9F518D30F4 () akurit ! com ! au
[Download RAW message or body]
Hi G.
I personally recommend as a minimum that you setup a SBD partition and use SBD \
STONITH. It protects against file/ database corruption in the event of an issue on \
the underlying storage.
Hardware (power) STONITH is considered the "best" protection, but I have had clusters \
running for years using just SBD STONITH and I would not deploy a cluster managed \
file system without it,
You should also strongly consider setting the "fence on stop failure" for the same \
reason. The worst possible corruption can be caused by the cluster having a " split \
brain" due to a partially dismounted file system and another node mounting and \
writing to it at the same time.
Regards
D.
On 10/04/2013, at 5:30 PM, "Guglielmo Abbruzzese" <g.abbruzzese@resi.it> wrote:
> Hi Darren,
> I am aware STONITH could help, but unfortunately I cannot add such device to the \
> architecture at the moment. Furthermore, sybase seems to be stopped (the \
> start/stop order should be already granted by the Resource Group structure)
> Resource Group: grp-sdg
> resource_vrt_ip (ocf::heartbeat:IPaddr2): Started NODE_A
> resource_lvm (ocf::heartbeat:LVM): Started NODE_A
> resource_lvmdir (ocf::heartbeat:Filesystem): failed (and so unmanaged)
> resource_sybase (lsb:sybase): stopped
> resource_httpd (lsb:httpd): stopped
> resource_tomcatd (lsb:tomcatd): stopped
> resource_sdgd (lsb:sdgd): stopped
> resource_statd (lsb:statistiched): stopped
>
> I'm just guessing, why the same configuration swapped fine with the previous \
> storage? The only difference could be the changed multipath configuration
> Thanks a lot
> G.
>
>
> -----Messaggio originale-----
> Da: linux-ha-dev-bounces@lists.linux-ha.org \
> [mailto:linux-ha-dev-bounces@lists.linux-ha.org] Per conto di Darren Thompson \
> (AkurIT)
> Inviato: martedì 9 aprile 2013 23:35
> A: High-Availability Linux Development List
> Oggetto: Re: [Linux-ha-dev] R: [PATCH] Filesystem RA:
>
> Hi
>
> The correct way for that to have been handled, given you additional detail would \
> have been for the node to have received a STONITH.
> Things that you should check:
> 1 STONITH device configured correctly and operational.
> 2 the " on fail" for any file system cluster resource stop should be " fence".
> 3 you need to review your constraints to ensure that the order and relationship \
> between SYBASE and file system resource needs to be corrected so that SYBASE is \
> stopped first.
> Hope this helps
>
> Darren
>
>
> Sent from my iPhone
>
> On 09/04/2013, at 11:57 PM, "Guglielmo Abbruzzese" <g.abbruzzese@resi.it> wrote:
>
> > Hi everybody,
> > In my case (very similar to Junko's) when I disconnect the Fibre
> > Channels the "try_umount" procedure in RA Filesystem script doesn't work.
> >
> > After the programmed attempts the active/passive cluster doesn't swap,
> > and the lvmdir resource is flagged as "failed" rather than "stopped".
> >
> > I must say, even if I try to umount the /storage resource manually it
> > doesn't work because of sybase is using some files stored on it
> > (busy); this is why the RA cannot complete the operation in a clean
> > mode. Is there a way to force the swap anyway?
> >
> > Some issues. I already tried:
> > 1) This very test with a different optical SAN/storage in the past,
> > and the RA could always umount correctly the storage;
> > 2) I modified the RA forcing the option "umount -l" even in case I've
> > got a
> > ext4 FR rather than NFS;
> > 3) I killed the hanged processes with the command "fuser -km /storage"
> > but the umount option always failed, and after a while I obtained a
> > kernel panic
> >
> > Is there a way to force the swap anyway, even if the umount is not clean?
> > Any suggestion?
> >
> > Thanks for your time,
> > Regards
> > Guglielmo
> >
> > P.S. lvmdir resource configuration
> >
> > <primitive class="ocf" id="resource_lvmdir" provider="heartbeat"
> > type="Filesystem">
> > <instance_attributes id="resource_lvmdir-instance_attributes">
> > <nvpair id="resource_lvmdir-instance_attributes-device"
> > name="device" value="/dev/VG_SDG_Cluster_RM/LV_SDG_Cluster_RM"/>
> > <nvpair id="resource_lvmdir-instance_attributes-directory"
> > name="directory" value="/storage"/>
> > <nvpair id="resource_lvmdir-instance_attributes-fstype"
> > name="fstype" value="ext4"/>
> > </instance_attributes>
> > <meta_attributes id="resource_lvmdir-meta_attributes">
> > <nvpair id="resource_lvmdir-meta_attributes-multiple-active"
> > name="multiple-active" value="stop_start"/>
> > <nvpair id="resource_lvmdir-meta_attributes-migration-threshold"
> > name="migration-threshold" value="1"/>
> > <nvpair id="resource_lvmdir-meta_attributes-failure-timeout"
> > name="failure-timeout" value="0"/>
> > </meta_attributes>
> > <operations>
> > <op enabled="true" id="resource_lvmdir-startup" interval="60s"
> > name="monitor" on-fail="restart" requires="nothing" timeout="40s"/>
> > <op id="resource_lvmdir-start-0" interval="0" name="start"
> > on-fail="restart" requires="nothing" timeout="180s"/>
> > <op id="resource_lvmdir-stop-0" interval="0" name="stop"
> > on-fail="restart" requires="nothing" timeout="180s"/>
> > </operations>
> > </primitive>
> >
> > 2012/5/9 Junko IKEDA <tsukishima.ha@gmail.com>:
> > > Hi,
> > >
> > > In my case, the umount succeed when the Fibre Channels is
> > > disconnected, so it seemed that the handling status file caused a
> > > longer failover, as Dejan said.
> > > If the umount fails, it will go into a timeout, might call stonith
> > > action, and this case also makes sense (though I couldn't see this).
> > >
> > > I tried the following setup;
> > >
> > > (1) timeout : multipath > RA
> > > multipath timeout = 120s
> > > Filesystem RA stop timeout = 60s
> > >
> > > (2) timeout : multipath < RA
> > > multipath timeout = 60s
> > > Filesystem RA stop timeout = 120s
> > >
> > > case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.
> > >
> > > case (2), Filesystem_stop() succeeds.
> > > Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
> > > The status file is no more inaccessible, so it remains on the
> > > filesystem, in fact.
> > >
> > > > > 758 if [ -f "$STATUSFILE" ]; then
> > > > > 759 rm -f ${STATUSFILE}
> > > > > 760 if [ $? -ne 0 ]; then
> > >
> > > so, the line 761 might not be called as expected.
> > >
> > > > > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
> > >
> > >
> > > By the way, my concern is the unexpected stop timeout and the longer
> > > fail over time, if OCF_CHECK_LEVEL is set as 20, it would be better
> > > to try remove its status file just in case.
> > > It can handle the case (2) if the user wants to recover this case
> > > with
> > STONITH.
> > >
> > >
> > > Thanks,
> > > Junko
> > >
> > > 2012/5/8 Dejan Muhamedagic <dejan@suse.de>:
> > > > Hi Lars,
> > > >
> > > > On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
> > > > > On 2012-05-08T12:08:27, Dejan Muhamedagic <dejan@suse.de> wrote:
> > > > >
> > > > > > > In the default (without OCF_CHECK_LEVE), it's enough to try
> > > > > > > unmount the file system, isn't it?
> > > > > > > https://github.com/ClusterLabs/resource-agents/blob/master/heart
> > > > > > > beat/Filesystem#L774
> > > > > >
> > > > > > I don't see a need to remove the STATUSFILE at all, as that may
> > > > > > (and as you observed it) prevent the filesystem from stopping.
> > > > > > Perhaps to skip it altogether? If nobody objects let's just remove
> > > > > > this code:
> > > > > >
> > > > > > 758 if [ -f "$STATUSFILE" ]; then
> > > > > > 759 rm -f ${STATUSFILE}
> > > > > > 760 if [ $? -ne 0 ]; then
> > > > > > 761 ocf_log warn "Failed to remove status file
> > ${STATUSFILE}."
> > > > > > 762 fi
> > > > > > 763 fi
> > > > >
> > > > > That would mean you can no longer differentiate between a "crash"
> > > > > and a clean unmount.
> > > >
> > > > One could take a look at the logs. I guess that a crash would
> > > > otherwise be noticeable as well :)
> > > >
> > > > > A hanging FC/SAN is likely to be unable to flush any other dirty
> > > > > buffers too, as well, so the umount may not necessarily succeed w/o
> > > > > errors. I think it's unreasonable to expect that the node will
> > > > > survive such a scenario w/o recovery.
> > > >
> > > > True. However, in case of network attached storage or other
> > > > transient errors it may lead to an unnecessary timeout followed by
> > > > fencing, i.e. the chance for a longer failover time is higher.
> > > > Just leaving a file around may not justify the risk.
> > > >
> > > > Junko-san, what was your experience?
> > > >
> > > > Cheers,
> > > >
> > > > Dejan
> > > >
> > > > > Regards,
> > > > > Lars
> > > > >
> > > > > --
> > > > > Architect Storage/HA
> > > > > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
> > > > > Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name
> > > > > everyone gives to their mistakes." -- Oscar Wilde
> > > > >
> > > > > _______________________________________________________
> > > > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > > > Home Page: http://linux-ha.org/
> > > > _______________________________________________________
> > > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > > Home Page: http://linux-ha.org/
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org \
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: \
> http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic