'Re: HACMP: umount fails and fuser finds nothing - Fix and Work-around'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       aix-l
Subject:    Re: HACMP: umount fails and fuser finds nothing - Fix and Work-around
From:       Christopher Baker <cbaker () GOODYEAR ! COM>
Date:       2007-01-23 22:22:57
Message-ID: OF66CCF956.AB7EEE3A-ON8525726C.0079665B-8525726C.007AF306 () goodyear ! com
[Download RAW message or body]

This is a multipart message in MIME format.
--=_alternative 007AF3048525726C_=
Content-Type: text/plain; charset="US-ASCII"

Folks,

This is a BUG in AIX.  There is a APAR for it:

APAR IY84689. NFS SERVER DOES NOT RELEASE LOCKS AFTER CLIENT REBOOT

Below is a link to APAR IY84689
http://www-1.ibm.com/support/docview.wss?uid=isg1IY84689


We had a limited maintenance window this past weekend before we got this 
info on the APAR.  Taking a clue from the HACMP script 
"/usr/es/sbin/cluster/events/utils/cl_deativate_fs", we saw that HACMP 
tries three times to umount a local filesystem.  With that it tries 
"fuser" too.  If it still is not unmounted, they do a "stopsrc -s 
rpc.lockd", then try 57 more times to umount.

Later, they restart the rpc.lockd.

Due to the need to handle quota'ed filesystems and control the order that 
filesystems are exported, we do not have HACMP manage the exports.  So, 
this "stopsrc -s rpc.lockd" step is skipped in the "cl_deactive_fs" script 
because it first checks to see if it controls the exporting of the 
filesystem in HACMP by doing a "odmget" for it. 

odmget -q "name=EXPORT_FILESYSTEM AND group=$GROUPNAME" HACMPresource | 
grep value | awk '{print $3}' | sed 's/"//g' | grep -w $FS

Since it is not there, it never ran the stopsrc  command but just tried 
umount 60 times and fails.

We first did a "stopsrc -g nfs" in our HACMP stop script.  That worked, 
but it was too radical... Especially if you were failing back from the 2nd 
NFS server back to your own server.  We then found that just refreshing 
the rpc.lockd was all that was needed.  That is when we found that HACMP 
was trying that too.

We will try this APAR in the distant future when we have another 
maintenance day.

If some one else tries it, please let us know if there is any "side 
effects" and if does it works.

Thanks,

  Chris Baker

IBM AIX Discussion List <aix-l@Princeton.EDU> wrote on 01/10/2007 04:56:03 
PM:

> 
> No, all VG's (except rootvg) are set to "No" for Auto Vary On. 
> 
> So, if I do a lsvg, I see all the VG's from both servers.  But, if I
> do a lsvg -o, I only see the local ones that HACMP varied on.  Of 
> course, if NFS2 is failed over to NFS1, than a lsvg -o is the same 
> as lsvg on NFS1 because NFS2's VG's have been taken over by NFS1. 
> 
> But, again, I am having the problem before the VG's are varied off. 
> 
> Thanks
> 
> 
> IBM AIX Discussion List <aix-l@Princeton.EDU> wrote on 01/10/2007 
01:33:13 PM:
> 
> > is the volume group set to chvg -an "vgname" do lsvg
> > -o to see is set to auto vsryon if set to autovary on
> > set to no to auto varyon 
> > 
> > --- Christopher Baker <cbaker@GOODYEAR.COM> wrote:
> > 
> > > Robert,
> > > 
> > > Thank you for your reply, but I am not having
> > > trouble shutting down the 
> > > AIX system.  I am having the trouble just shutting
> > > down gracefully the 
> > > HACMP environment or failing over.
> > > 
> > > Looking through the hacmp.out, it is quite clear
> > > that HACMP is unable to 
> > > unmount local filesystems that need to be unmounted
> > > so that the other 
> > > system can varyonvg and mount the same filesystems
> > > and export them as the 
> > > first box.
> > > 
> > > If I open a shell window at this hung point, I am
> > > not able to umount the 
> > > filesystem manually.  If I do an lsof or a fuser,
> > > there does not appear to 
> > > be anything holding the mount.
> > > 
> > > Thanks,
> > > 
> > > Christopher M. Baker
> > > Senior Technical Support Analyst
> > > HPCE - Linux Cluster Development and Support
> > > Goodyear Tire and Rubber Company
> > > 330.796.1725
> > > 
> > > =================================================
> > > Contains Confidential and/or Proprietary
> > > Information.
> > > May not be copied or disseminated without the
> > > expressed
> > > written consent of The Goodyear Tire & Rubber
> > > Company.
> > > =================================================
> > > 
> > > 
> > > 
> > > 
> > > Robert Binkley <leebinkley@YAHOO.COM> 
> > > Sent by: IBM AIX Discussion List
> > > <aix-l@Princeton.EDU>
> > > 01/09/2007 09:41 AM
> > > Please respond to
> > > IBM AIX Discussion List <aix-l@Princeton.EDU>
> > > 
> > > 
> > > To
> > > aix-l@Princeton.EDU
> > > cc
> > > 
> > > Subject
> > > Re: HACMP: umount fails and fuser finds nothing
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 1: when you run cd /usr/es/sbin/cluster/cllsserv
> > > ;this
> > > will show what the cluster knows about. 
> > > 2: Do you use an rc.shutdown script to restart you
> > > systems
> > > 3: If ther are entries in the /etc/inittab that the
> > > cluster controls it may fail to shutdown due to not
> > > able to completely exit 0.
> > > 4: Have you ran an snap -r ; snap -e at the time of
> > > the error, and sent it to the ibm hacmp engineers.
> > > 5: when Looking through you hacmp.out file and
> > > greping
> > > for exit 1 
> > > 
> > > --- Christopher Baker <cbaker@GOODYEAR.COM> wrote:
> > > 
> > > > Folks,
> > > > 
> > > > Has anyone had this problem.  We have two servers
> > > > (LPARS) on two different 
> > > > P570's that are the two systems in a HACMP
> > > cluster. 
> > > > We cannot 
> > > > successfully fail over or gracefully stop HACMP
> > > > because the system is 
> > > > unable to umount one or more local filesystems. 
> > > In
> > > > the hacmp.out file we 
> > > > see that the umount gets a "Device busy".  It then
> > > > does a 
> > > > 
> > > >         fuser -k -u -x /dev/nfs1lv11
> > > > 
> > > > That returns nothing
> > > > 
> > > >         /dev/nfs1lv11:
> > > > 
> > > > It then tries to umount the filesystem "/s" that
> > > is
> > > > on LV /dev/nfs1lv11. 
> > > > This continues forever.
> > > > 
> > > > At the command line, an lsof and a "fuser -d
> > > > /dev/nfs1lv11" come back 
> > > > blank.
> > > > 
> > > > INFO:  These are NFS servers and the filesystems
> > > we
> > > > have trouble with are 
> > > > exported to many IBM workstations as well as
> > > non-IBM
> > > > systems and SAMBA 
> > > > shared.
> > > > We un-export all filesystems and stop the SAMBA
> > > > processes before HACMP 
> > > > tries to umount them.
> > > > 
> > > > 
> > > > We have a similar, but much less frequent problem
> > > on
> > > > other (non-NFS 
> > > > server) clusters where a zombied process will not
> > > > "kill -9" and it is 
> > > > using a local or NFS mounted filesystem.  If we
> > > need
> > > > to fail that over or 
> > > > just stop HACMP so we can reboot, we have to do a
> > > > "shutdown".
> > > > 
> > > > Any assistance would be of help.
> > > > 
> > > > 
> > > > Christopher M. Baker
> > > > Senior Technical Support Analyst
> > > > HPCE - Linux Cluster Development and Support
> > > > Goodyear Tire and Rubber Company
> > > > 330.796.1725
> > > > 
> > > > =================================================
> > > > Contains Confidential and/or Proprietary
> > > > Information.
> > > > May not be copied or disseminated without the
> > > > expressed
> > > > written consent of The Goodyear Tire & Rubber
> > > > Company.
> > > > =================================================
> > > > 
> > > 
> > > 
--=_alternative 007AF3048525726C_=
Content-Type: text/html; charset="US-ASCII"


<br><font size=2 face="sans-serif">Folks,</font>
<br>
<br><font size=2 face="sans-serif">This is a BUG in AIX. &nbsp;There is
a APAR for it:</font>
<br>
<br><font size=3>APAR IY84689. NFS SERVER DOES NOT RELEASE LOCKS AFTER
CLIENT REBOOT</font>
<br><font size=2 face="sans-serif"><br>
Below is a link to APAR IY84689</font>
<br><font size=2 face="sans-serif">http://www-1.ibm.com/support/docview.wss?uid=isg1IY84689</font>
<br>
<br>
<br><font size=2 face="sans-serif">We had a limited maintenance window
this past weekend before we got this info on the APAR. &nbsp;Taking a clue
from the HACMP script &quot;/usr/es/sbin/cluster/events/utils/cl_deativate_fs&quot;,
we saw that HACMP tries three times to umount a local filesystem. &nbsp;With
that it tries &quot;fuser&quot; too. &nbsp;If it still is not unmounted,
they do a &quot;stopsrc -s rpc.lockd&quot;, then try 57 more times to umount.</font>
<br>
<br><font size=2 face="sans-serif">Later, they restart the rpc.lockd.</font>
<br>
<br><font size=2 face="sans-serif">Due to the need to handle quota'ed filesystems
and control the order that filesystems are exported, we do not have HACMP
manage the exports. &nbsp;So, this &quot;stopsrc -s rpc.lockd&quot; step
is skipped in the &quot;cl_deactive_fs&quot; script because it first checks
to see if it controls the exporting of the filesystem in HACMP by doing
a &quot;odmget&quot; for it. &nbsp;</font>
<br>
<br><font size=2 face="sans-serif">odmget -q &quot;name=EXPORT_FILESYSTEM
AND group=$GROUPNAME&quot; HACMPresource | grep value | awk '{print $3}'
| sed 's/&quot;//g' | grep -w $FS</font>
<br>
<br><font size=2 face="sans-serif">Since it is not there, it never ran
the stopsrc &nbsp;command but just tried umount 60 times and fails.</font>
<br>
<br><font size=2 face="sans-serif">We first did a &quot;stopsrc -g nfs&quot;
in our HACMP stop script. &nbsp;That worked, but it was too radical...
Especially if you were failing back from the 2nd NFS server back to your
own server. &nbsp;We then found that just refreshing the rpc.lockd was
all that was needed. &nbsp;That is when we found that HACMP was trying
that too.</font>
<br>
<br><font size=2 face="sans-serif">We will try this APAR in the distant
future when we have another maintenance day.</font>
<br>
<br><font size=2 face="sans-serif">If some one else tries it, please let
us know if there is any &quot;side effects&quot; and if does it works.</font>
<br>
<br><font size=2 face="sans-serif">Thanks,<br>
</font>
<br><font size=2 face="sans-serif">&nbsp; Chris Baker</font>
<br>
<br><font size=2><tt>IBM AIX Discussion List &lt;aix-l@Princeton.EDU&gt;
wrote on 01/10/2007 04:56:03 PM:<br>
<br>
&gt; <br>
&gt; No, all VG's (except rootvg) are set to &quot;No&quot; for Auto Vary
On. <br>
&gt; <br>
&gt; So, if I do a lsvg, I see all the VG's from both servers. &nbsp;But,
if I<br>
&gt; do a lsvg -o, I only see the local ones that HACMP varied on. &nbsp;Of
<br>
&gt; course, if NFS2 is failed over to NFS1, than a lsvg -o is the same
<br>
&gt; as lsvg on NFS1 because NFS2's VG's have been taken over by NFS1.
<br>
&gt; <br>
&gt; But, again, I am having the problem before the VG's are varied off.
<br>
&gt; <br>
&gt; Thanks<br>
&gt; <br>
&gt; <br>
&gt; IBM AIX Discussion List &lt;aix-l@Princeton.EDU&gt; wrote on 01/10/2007
01:33:13 PM:<br>
&gt; <br>
&gt; &gt; is the volume group set to chvg -an &quot;vgname&quot; do lsvg<br>
&gt; &gt; -o to see is set to auto vsryon if set to autovary on<br>
&gt; &gt; set to no to auto varyon <br>
&gt; &gt; <br>
&gt; &gt; --- Christopher Baker &lt;cbaker@GOODYEAR.COM&gt; wrote:<br>
&gt; &gt; <br>
&gt; &gt; &gt; Robert,<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; Thank you for your reply, but I am not having<br>
&gt; &gt; &gt; trouble shutting down the <br>
&gt; &gt; &gt; AIX system. &nbsp;I am having the trouble just shutting<br>
&gt; &gt; &gt; down gracefully the <br>
&gt; &gt; &gt; HACMP environment or failing over.<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; Looking through the hacmp.out, it is quite clear<br>
&gt; &gt; &gt; that HACMP is unable to <br>
&gt; &gt; &gt; unmount local filesystems that need to be unmounted<br>
&gt; &gt; &gt; so that the other <br>
&gt; &gt; &gt; system can varyonvg and mount the same filesystems<br>
&gt; &gt; &gt; and export them as the <br>
&gt; &gt; &gt; first box.<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; If I open a shell window at this hung point, I am<br>
&gt; &gt; &gt; not able to umount the <br>
&gt; &gt; &gt; filesystem manually. &nbsp;If I do an lsof or a fuser,<br>
&gt; &gt; &gt; there does not appear to <br>
&gt; &gt; &gt; be anything holding the mount.<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; Thanks,<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; Christopher M. Baker<br>
&gt; &gt; &gt; Senior Technical Support Analyst<br>
&gt; &gt; &gt; HPCE - Linux Cluster Development and Support<br>
&gt; &gt; &gt; Goodyear Tire and Rubber Company<br>
&gt; &gt; &gt; 330.796.1725<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; =================================================<br>
&gt; &gt; &gt; Contains Confidential and/or Proprietary<br>
&gt; &gt; &gt; Information.<br>
&gt; &gt; &gt; May not be copied or disseminated without the<br>
&gt; &gt; &gt; expressed<br>
&gt; &gt; &gt; written consent of The Goodyear Tire &amp; Rubber<br>
&gt; &gt; &gt; Company.<br>
&gt; &gt; &gt; =================================================<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; Robert Binkley &lt;leebinkley@YAHOO.COM&gt; <br>
&gt; &gt; &gt; Sent by: IBM AIX Discussion List<br>
&gt; &gt; &gt; &lt;aix-l@Princeton.EDU&gt;<br>
&gt; &gt; &gt; 01/09/2007 09:41 AM<br>
&gt; &gt; &gt; Please respond to<br>
&gt; &gt; &gt; IBM AIX Discussion List &lt;aix-l@Princeton.EDU&gt;<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; To<br>
&gt; &gt; &gt; aix-l@Princeton.EDU<br>
&gt; &gt; &gt; cc<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; Subject<br>
&gt; &gt; &gt; Re: HACMP: umount fails and fuser finds nothing<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; 1: when you run cd /usr/es/sbin/cluster/cllsserv<br>
&gt; &gt; &gt; ;this<br>
&gt; &gt; &gt; will show what the cluster knows about. <br>
&gt; &gt; &gt; 2: Do you use an rc.shutdown script to restart you<br>
&gt; &gt; &gt; systems<br>
&gt; &gt; &gt; 3: If ther are entries in the /etc/inittab that the<br>
&gt; &gt; &gt; cluster controls it may fail to shutdown due to not<br>
&gt; &gt; &gt; able to completely exit 0.<br>
&gt; &gt; &gt; 4: Have you ran an snap -r ; snap -e at the time of<br>
&gt; &gt; &gt; the error, and sent it to the ibm hacmp engineers.<br>
&gt; &gt; &gt; 5: when Looking through you hacmp.out file and<br>
&gt; &gt; &gt; greping<br>
&gt; &gt; &gt; for exit 1 <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; --- Christopher Baker &lt;cbaker@GOODYEAR.COM&gt; wrote:<br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; Folks,<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; Has anyone had this problem. &nbsp;We have two servers<br>
&gt; &gt; &gt; &gt; (LPARS) on two different <br>
&gt; &gt; &gt; &gt; P570's that are the two systems in a HACMP<br>
&gt; &gt; &gt; cluster. <br>
&gt; &gt; &gt; &gt; We cannot <br>
&gt; &gt; &gt; &gt; successfully fail over or gracefully stop HACMP<br>
&gt; &gt; &gt; &gt; because the system is <br>
&gt; &gt; &gt; &gt; unable to umount one or more local filesystems. <br>
&gt; &gt; &gt; In<br>
&gt; &gt; &gt; &gt; the hacmp.out file we <br>
&gt; &gt; &gt; &gt; see that the umount gets a &quot;Device busy&quot;.
&nbsp;It then<br>
&gt; &gt; &gt; &gt; does a <br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &nbsp; &nbsp; &nbsp; &nbsp; fuser -k -u -x /dev/nfs1lv11<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; That returns nothing<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &nbsp; &nbsp; &nbsp; &nbsp; /dev/nfs1lv11:<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; It then tries to umount the filesystem &quot;/s&quot;
that<br>
&gt; &gt; &gt; is<br>
&gt; &gt; &gt; &gt; on LV /dev/nfs1lv11. <br>
&gt; &gt; &gt; &gt; This continues forever.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; At the command line, an lsof and a &quot;fuser -d<br>
&gt; &gt; &gt; &gt; /dev/nfs1lv11&quot; come back <br>
&gt; &gt; &gt; &gt; blank.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; INFO: &nbsp;These are NFS servers and the filesystems<br>
&gt; &gt; &gt; we<br>
&gt; &gt; &gt; &gt; have trouble with are <br>
&gt; &gt; &gt; &gt; exported to many IBM workstations as well as<br>
&gt; &gt; &gt; non-IBM<br>
&gt; &gt; &gt; &gt; systems and SAMBA <br>
&gt; &gt; &gt; &gt; shared.<br>
&gt; &gt; &gt; &gt; We un-export all filesystems and stop the SAMBA<br>
&gt; &gt; &gt; &gt; processes before HACMP <br>
&gt; &gt; &gt; &gt; tries to umount them.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; We have a similar, but much less frequent problem<br>
&gt; &gt; &gt; on<br>
&gt; &gt; &gt; &gt; other (non-NFS <br>
&gt; &gt; &gt; &gt; server) clusters where a zombied process will not<br>
&gt; &gt; &gt; &gt; &quot;kill -9&quot; and it is <br>
&gt; &gt; &gt; &gt; using a local or NFS mounted filesystem. &nbsp;If we<br>
&gt; &gt; &gt; need<br>
&gt; &gt; &gt; &gt; to fail that over or <br>
&gt; &gt; &gt; &gt; just stop HACMP so we can reboot, we have to do a<br>
&gt; &gt; &gt; &gt; &quot;shutdown&quot;.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; Any assistance would be of help.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; Christopher M. Baker<br>
&gt; &gt; &gt; &gt; Senior Technical Support Analyst<br>
&gt; &gt; &gt; &gt; HPCE - Linux Cluster Development and Support<br>
&gt; &gt; &gt; &gt; Goodyear Tire and Rubber Company<br>
&gt; &gt; &gt; &gt; 330.796.1725<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; =================================================<br>
&gt; &gt; &gt; &gt; Contains Confidential and/or Proprietary<br>
&gt; &gt; &gt; &gt; Information.<br>
&gt; &gt; &gt; &gt; May not be copied or disseminated without the<br>
&gt; &gt; &gt; &gt; expressed<br>
&gt; &gt; &gt; &gt; written consent of The Goodyear Tire &amp; Rubber<br>
&gt; &gt; &gt; &gt; Company.<br>
&gt; &gt; &gt; &gt; =================================================<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; <br>
&gt; &gt; &gt; </tt></font>
--=_alternative 007AF3048525726C_=--
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic