'Re: [Linux-ha-dev] R: [PATCH] Filesystem RA:'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    Re: [Linux-ha-dev] R:  [PATCH] Filesystem RA:
From:       "Darren Thompson (AkurIT)" <darrent () akurit ! com ! au>
Date:       2013-04-09 21:47:28
Message-ID: 626D4773-F87A-459F-8672-5FADCBC06D78 () akurit ! com ! au
[Download RAW message or body]

Hi

The correct way for that to have been handled, given you additional detail would have \
been for the node to have received a STONITH.

Things that you should check:
1 STONITH device configured correctly and operational.
2 the " on fail" for any file system cluster resource stop should be " fence".
3 you need to review your constraints to ensure that the order and relationship \
between SYBASE and file system resource needs to be corrected so that SYBASE is \
stopped first.

Hope this helps

Darren 

Sent from my iPhone

On 09/04/2013, at 11:57 PM, "Guglielmo Abbruzzese" <g.abbruzzese@resi.it> wrote:

> Hi everybody,
> In my case (very similar to Junko's) when I disconnect the Fibre Channels
> the "try_umount" procedure in RA Filesystem script doesn't work. 
> 
> After the programmed attempts the active/passive cluster doesn't swap, and
> the lvmdir resource is flagged as "failed" rather than "stopped". 
> 
> I must say, even if I try to umount the /storage resource manually it
> doesn't work because of sybase is using some files stored on it (busy); this
> is why the RA cannot complete the operation in a clean mode. Is there a way
> to force the swap anyway?
> 
> Some issues. I already tried:
> 1) This very test with a different optical SAN/storage in the past, and the
> RA could always umount correctly the storage;
> 2) I modified the RA forcing the option "umount -l" even in case I've got a
> ext4 FR rather than NFS;
> 3) I killed the hanged processes with the command "fuser -km /storage"  but
> the umount option always failed, and after a while I obtained a kernel panic
> 
> Is there a way to force the swap anyway, even if the umount is not clean?
> Any suggestion?
> 
> Thanks for your time,
> Regards
> Guglielmo
> 
> P.S. lvmdir resource configuration
> 
> <primitive class="ocf" id="resource_lvmdir" provider="heartbeat"
> type="Filesystem">
> <instance_attributes id="resource_lvmdir-instance_attributes">
> <nvpair id="resource_lvmdir-instance_attributes-device"
> name="device" value="/dev/VG_SDG_Cluster_RM/LV_SDG_Cluster_RM"/>
> <nvpair id="resource_lvmdir-instance_attributes-directory"
> name="directory" value="/storage"/>
> <nvpair id="resource_lvmdir-instance_attributes-fstype"
> name="fstype" value="ext4"/>
> </instance_attributes>
> <meta_attributes id="resource_lvmdir-meta_attributes">
> <nvpair id="resource_lvmdir-meta_attributes-multiple-active"
> name="multiple-active" value="stop_start"/>
> <nvpair id="resource_lvmdir-meta_attributes-migration-threshold"
> name="migration-threshold" value="1"/>
> <nvpair id="resource_lvmdir-meta_attributes-failure-timeout"
> name="failure-timeout" value="0"/>
> </meta_attributes>
> <operations>
> <op enabled="true" id="resource_lvmdir-startup" interval="60s"
> name="monitor" on-fail="restart" requires="nothing" timeout="40s"/>
> <op id="resource_lvmdir-start-0" interval="0" name="start"
> on-fail="restart" requires="nothing" timeout="180s"/>
> <op id="resource_lvmdir-stop-0" interval="0" name="stop"
> on-fail="restart" requires="nothing" timeout="180s"/>
> </operations>
> </primitive>
> 
> 2012/5/9 Junko IKEDA <tsukishima.ha@gmail.com>:
> > Hi,
> > 
> > In my case, the umount succeed when the Fibre Channels is 
> > disconnected, so it seemed that the handling status file caused a 
> > longer failover, as Dejan said.
> > If the umount fails, it will go into a timeout, might call stonith 
> > action, and this case also makes sense (though I couldn't see this).
> > 
> > I tried the following setup;
> > 
> > (1) timeout : multipath > RA
> > multipath timeout = 120s
> > Filesystem RA stop timeout = 60s
> > 
> > (2) timeout : multipath < RA
> > multipath timeout = 60s
> > Filesystem RA stop timeout = 120s
> > 
> > case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.
> > 
> > case (2), Filesystem_stop() succeeds.
> > Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
> > The status file is no more inaccessible, so it remains on the 
> > filesystem, in fact.
> > 
> > > > 758 if [ -f "$STATUSFILE" ]; then
> > > > 759 rm -f ${STATUSFILE}
> > > > 760 if [ $? -ne 0 ]; then
> > 
> > so, the line 761 might not be called as expected.
> > 
> > > > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
> > 
> > 
> > By the way, my concern is the unexpected stop timeout and the longer 
> > fail over time, if OCF_CHECK_LEVEL is set as 20, it would be better to 
> > try remove its status file just in case.
> > It can handle the case (2) if the user wants to recover this case with
> STONITH.
> > 
> > 
> > Thanks,
> > Junko
> > 
> > 2012/5/8 Dejan Muhamedagic <dejan@suse.de>:
> > > Hi Lars,
> > > 
> > > On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
> > > > On 2012-05-08T12:08:27, Dejan Muhamedagic <dejan@suse.de> wrote:
> > > > 
> > > > > > In the default (without OCF_CHECK_LEVE), it's enough to try 
> > > > > > unmount the file system, isn't it?
> > > > > > https://github.com/ClusterLabs/resource-agents/blob/master/heart
> > > > > > beat/Filesystem#L774
> > > > > 
> > > > > I don't see a need to remove the STATUSFILE at all, as that may 
> > > > > (and as you observed it) prevent the filesystem from stopping.
> > > > > Perhaps to skip it altogether? If nobody objects let's just remove 
> > > > > this code:
> > > > > 
> > > > > 758         if [ -f "$STATUSFILE" ]; then
> > > > > 759             rm -f ${STATUSFILE}
> > > > > 760             if [ $? -ne 0 ]; then
> > > > > 761                 ocf_log warn "Failed to remove status file
> ${STATUSFILE}."
> > > > > 762             fi
> > > > > 763         fi
> > > > 
> > > > That would mean you can no longer differentiate between a "crash" 
> > > > and a clean unmount.
> > > 
> > > One could take a look at the logs. I guess that a crash would 
> > > otherwise be noticeable as well :)
> > > 
> > > > A hanging FC/SAN is likely to be unable to flush any other dirty 
> > > > buffers too, as well, so the umount may not necessarily succeed w/o 
> > > > errors. I think it's unreasonable to expect that the node will 
> > > > survive such a scenario w/o recovery.
> > > 
> > > True. However, in case of network attached storage or other transient 
> > > errors it may lead to an unnecessary timeout followed by fencing, 
> > > i.e. the chance for a longer failover time is higher.
> > > Just leaving a file around may not justify the risk.
> > > 
> > > Junko-san, what was your experience?
> > > 
> > > Cheers,
> > > 
> > > Dejan
> > > 
> > > > Regards,
> > > > Lars
> > > > 
> > > > --
> > > > Architect Storage/HA
> > > > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
> > > > Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name 
> > > > everyone gives to their mistakes." -- Oscar Wilde
> > > > 
> > > > _______________________________________________________
> > > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org 
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > > Home Page: http://linux-ha.org/
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org 
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
> 
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[prev in list] [next in list] [prev in thread] [next in thread]