[prev in list] [next in list] [prev in thread] [next in thread] 

List:       amd-dev
Subject:    Badness when setting ping interval
From:       Nick Williams <Nick.Williams () morganstanley ! com>
Date:       2003-11-03 14:23:24
[Download RAW message or body]

A while ago we discussed setting ping to -1 for TCP mounts. I've 
recently tried setting this within some of our mounts for specific 
fileservers (specifically, for fileservers which have their own failover 
capability which could take longer than a minute...). It looked like it 
worked, but then we later found that 'some' machines could no longer 
access the filesystems, getting an Input/Output error! This happened on 
both sol8 (amd v6.0.3) and on linux (2.4.9-e12enterprise from redhat, 
with amd v6.0.8). However, other machines (same version) had no problems 
to the same fileservers.

Here's a more detailed timeline of things happening on a linux machine. 
The key in question is 'govts':
govts   -sublink:=govts;rhost:=ln0fnf02;rfs:=/d/ln0fnf02/d25 
host!=ln0fnf02;type:=nfs;opts:=ping=-1 || type:=link;fs:=${rfs}

Nov  2 15:30:33 haifd51 amd[1739]: get_nfs_version: returning (3,tcp) on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: get_nfs_version: returning (3,udp) on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: Using NFS version 3, protocol tcp on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: file server ln0fnf02, type nfs, state 
wired up
Nov  2 15:30:33 haifd51 amd[1739]: Trying mount of 
ln0fnf02:/d/ln0fnf02/d25 on /tmp_amd/govts fstype nfs
Nov  2 15:30:33 haifd51 amd[1739]: recompute_portmap: NFS version 3
Nov  2 15:30:33 haifd51 amd[1739]: Using MOUNT version: 3
Nov  2 15:30:33 haifd51 amd[1739]: get_nfs_version: returning (3,tcp) on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: get_nfs_version: returning (3,udp) on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: Using NFS version 3, protocol tcp on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: Trying mount of 
ln0fnf02:/d/ln0fnf02/d25 on /tmp_amd/govts fstype nfs
Nov  2 15:30:33 haifd51 amd[1739]: call_mountd: NFS version 3, mount 
version 3
Nov  2 15:30:33 haifd51 amd[1739]: get_nfs_version: returning (3,tcp) on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: get_nfs_version: returning (3,udp) on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: Using NFS version 3, protocol tcp on 
host ln0fnf02
Nov  2 15:30:33 haifd51 amd[1739]: Trying mount of 
ln0fnf02:/d/ln0fnf02/d25 on /tmp_amd/govts fstype nfs
Nov  2 15:30:33 haifd51 amd[1739]: prime_nfs_fhandle_cache: NFS version 3
Nov  2 15:30:33 haifd51 amd[27648]: Using remopts="ping=-1"
Nov  2 15:30:33 haifd51 amd[27648]: mount_nfs_fh: NFS version 3
Nov  2 15:30:33 haifd51 amd[27648]: mount_nfs_fh: using NFS transport tcp
Nov  2 15:30:33 haifd51 amd[1739]: ln0fnf02:/d/ln0fnf02/d25 mounted 
fstype nfs on /a/ln0fnf02/d/ln0fnf02/d25

And since the process using this directory ties it up forever, we see 
continual messages relating to amd trying to unmount it:
Nov  3 00:55:16 haifd51 amd[1739]: "/tmp_amd/govts" on 
/a/ln0fnf02/d/ln0fnf02/d25 still active
Nov  3 00:57:16 haifd51 amd[1739]: "/tmp_amd/govts" on 
/a/ln0fnf02/d/ln0fnf02/d25 still active

But these messages stopped early this morning and amd has never looked 
at it again. The last message was:
Nov  3 04:03:16 haifd51 amd[1739]: "/tmp_amd/govts" on 
/a/ln0fnf02/d/ln0fnf02/d25 still active

At this point in time, any access to /tmp_amd/govts hangs for a few 
minutes then give the 'Input/Output' error. Nothing we can do fixes the 
problem at this point, so we look at reverting the ping value back up to 
its default (or at least, not to -1).

I forcibly try to remove it (don't expect it to work, coz production 
processes are holding it open, but at least I can see what amd thinks 
about it)
Nov  3 11:24:43 haifd51 amd[1739]: "/tmp_amd/govts" forcibly timed out
Nov  3 11:24:43 haifd51 amd[1739]: "/tmp_amd/govts" on 
/a/ln0fnf02/d/ln0fnf02/d25 still active

But I'm still not seeing any other attempted unmounts on a timer basis. 
I changed the key so that it now says ping=60...
Nov  3 12:40:42 haifd51 amd[1739]: reload #5 of map /etc/amd.map succeeded
and it now users can 'cd' and 'ls' to their hearts content - things just 
work, although I notice that I'm still not seeing any unmount attempts.

I'm trying to replicate this problem at the moment on non-production 
machines, so maybe I'll get further, but I'm curious to see if anyone 
else has seen anything like this behaviour before.

Nick

_______________________________________________
amd-dev mailing list: amd-dev@cs.columbia.edu
Am-utils: http://www.am-utils.org
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic