[prev in list] [next in list] [prev in thread] [next in thread] 

List:       opensolaris-nfs-discuss
Subject:    Re: [nfs-discuss] NFS hanging with RPC timeout.
From:       Jorgen Lundman <lundman () gmo ! jp>
Date:       2009-02-25 9:06:48
Message-ID: 49A50A28.9000703 () gmo ! jp
[Download RAW message or body]


Ok, it still happens even when not using aliases, it just took longer to 
turn up.

Attempting to mount (snoop running on NFS client)

bash-3.00# mount 172.20.12.228:/export/mail /mnt
nfs mount: 172.20.12.228: : RPC: Program not registered
nfs mount: retrying: /mnt
172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1021 Syn Seq=2365435228 Len=0 
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=1021 S=2049 Rst Ack=2365435229 Win=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=3 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=2 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1020 Syn Seq=2242896555 Len=0 
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=1020 S=2049 Rst Ack=2242896556 Win=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=3 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=2 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0

172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1019 Syn Seq=1448368793 Len=0 
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=1019 S=2049 Rst Ack=1448368794 Win=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=3 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=2 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1018 Syn Seq=883538524 Len=0 
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=1018 S=2049 Rst Ack=883538525 Win=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=3 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=2 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 NFS C FSSTAT3 FH=D702
172.20.12.228 -> 172.20.12.21 ICMP Destination unreachable (UDP port 
2049 unreachable)
172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1017 Syn Seq=3028937941 Len=0 
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>


172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1016 Syn Seq=3821439944 Len=0 
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=1016 S=2049 Rst Ack=3821439945 Win=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=3 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=2 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1015 Syn Seq=1966482573 Len=0 
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=1015 S=2049 Rst Ack=1966482574 Win=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=3 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=2 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0


^C
bash-3.00# 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1014 Syn 
Seq=2696600158 Len=0 Win=49640 Options=<mss 1460,nop,wscale 
0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=1014 S=2049 Rst Ack=2696600159 Win=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=3 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0
172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) 
vers=2 proto=UDP
172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0



# rpcinfo 172.20.12.228
rpcinfo: can't contact rpcbind: : RPC: Unable to receive; errno = 
Connection refused; System error
bash-3.00# 172.20.12.21 -> 172.20.12.228 TCP D=111 S=50279 Syn 
Seq=3313033773 Len=0 Win=49640 Options=<mss 1460,nop,wscale 
0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=50279 S=111 Rst Ack=3313033774 Win=0
172.20.12.21 -> 172.20.12.228 TCP D=111 S=54373 Syn Seq=3383588494 Len=0 
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.228 -> 172.20.12.21 TCP D=54373 S=111 Rst Ack=3383588495 Win=0
172.20.12.21 -> 172.20.12.228 RPCBIND C DUMP
172.20.12.228 -> 172.20.12.21 ICMP Destination unreachable (UDP port 111 
unreachable)

 From a different nfs client:

  rpcinfo 172.20.12.228
    program version netid     address             service    owner
     100000    4    ticots    x4500-05.unix.rpc   rpcbind    superuser
     100000    3    ticots    x4500-05.unix.rpc   rpcbind    superuser
     100000    4    ticotsord x4500-05.unix.rpc   rpcbind    superuser
     100000    3    ticotsord x4500-05.unix.rpc   rpcbind    superuser
     100000    4    ticlts    x4500-05.unix.rpc   rpcbind    superuser
     100000    3    ticlts    x4500-05.unix.rpc   rpcbind    superuser
     100000    4    tcp       0.0.0.0.0.111       rpcbind    superuser
     100000    3    tcp       0.0.0.0.0.111       rpcbind    superuser
     100000    2    tcp       0.0.0.0.0.111       rpcbind    superuser
     100000    4    udp       0.0.0.0.0.111       rpcbind    superuser
     100000    3    udp       0.0.0.0.0.111       rpcbind    superuser
     100000    2    udp       0.0.0.0.0.111       rpcbind    superuser
     100024    1    udp       0.0.0.0.128.10      status     superuser
     100024    1    tcp       0.0.0.0.128.3       status     superuser
     100024    1    ticlts    \021\000\000\000    status     superuser
     100024    1    ticotsord \024\000\000\000    status     superuser
     100024    1    ticots    \027\000\000\000    status     superuser
     100133    1    udp       0.0.0.0.128.10      -          superuser
     100133    1    tcp       0.0.0.0.128.3       -          superuser
     100133    1    ticlts    \021\000\000\000    -          superuser
     100133    1    ticotsord \024\000\000\000    -          superuser
     100133    1    ticots    \027\000\000\000    -          superuser
     100021    1    udp       0.0.0.0.15.205      nlockmgr   1
1073741824    1    tcp       0.0.0.0.128.4       -          1
     100021    2    udp       0.0.0.0.15.205      nlockmgr   1
     100021    3    udp       0.0.0.0.15.205      nlockmgr   1
     100021    4    udp       0.0.0.0.15.205      nlockmgr   1
     100021    1    tcp       0.0.0.0.15.205      nlockmgr   1
     100021    2    tcp       0.0.0.0.15.205      nlockmgr   1
     100021    3    tcp       0.0.0.0.15.205      nlockmgr   1
     100021    4    tcp       0.0.0.0.15.205      nlockmgr   1
     100155    1    ticotsord l\000\000\000       smserverd  superuser
     100011    1    ticlts    o\000\000\000       rquotad    superuser
     100011    1    udp       0.0.0.0.128.18      rquotad    superuser
     100231    1    ticlts    x4500-05.unix.nfsauth -          superuser
     100231    1    ticotsord x4500-05.unix.nfsauth -          superuser
     100231    1    ticots    x4500-05.unix.nfsauth -          superuser
     100005    1    udp       0.0.0.0.128.19      mountd     superuser
     100005    1    ticlts    \203\000\000\000    mountd     superuser
     100005    1    tcp       0.0.0.0.128.13      mountd     superuser
     100005    1    ticotsord \210\000\000\000    mountd     superuser
     100005    1    ticots    \213\000\000\000    mountd     superuser
     100005    2    udp       0.0.0.0.128.19      mountd     superuser
     100005    2    ticlts    \203\000\000\000    mountd     superuser
     100005    2    tcp       0.0.0.0.128.13      mountd     superuser
     100005    2    ticotsord \210\000\000\000    mountd     superuser
     100005    2    ticots    \213\000\000\000    mountd     superuser
     100005    3    udp       0.0.0.0.128.19      mountd     superuser
     100005    3    ticlts    \203\000\000\000    mountd     superuser
     100005    3    tcp       0.0.0.0.128.13      mountd     superuser
     100005    3    ticotsord \210\000\000\000    mountd     superuser
     100005    3    ticots    \213\000\000\000    mountd     superuser
     100003    2    udp       0.0.0.0.8.1         nfs        1
     100003    3    udp       0.0.0.0.8.1         nfs        1
     100227    2    udp       0.0.0.0.8.1         nfs_acl    1
     100227    3    udp       0.0.0.0.8.1         nfs_acl    1
     100003    2    tcp       0.0.0.0.8.1         nfs        1
     100003    3    tcp       0.0.0.0.8.1         nfs        1
     100003    4    tcp       0.0.0.0.8.1         nfs        1
     100227    2    tcp       0.0.0.0.8.1         nfs_acl    1
     100227    3    tcp       0.0.0.0.8.1         nfs_acl    1


Why would the NFS client not be able to talk to the server?


3 minutes later, rpcinfo got unstuck and NFS was back again. Without me 
doing anything but snoop.

Lund





Robert van Veelen wrote:
> I will try this out on my test hosts. Are you using NFSv3 exclusively? There are no \
> v4 clients in your env? If the clients are exclusively ro, have you tried mointing \
> with ro flag? 
> -rob
> 
> 
> -----Original Message-----
> From: 	Jorgen Lundman [mailto:lundman@gmo.jp]
> Sent:	Monday, February 16, 2009 12:11 AM Eastern Standard Time
> To:	Robert van Veelen
> Subject:	Re: [nfs-discuss] NFS hanging with RPC timeout.
> 
> 
> I have not had time to prove this, but by asking the other admins which 
> NFS server mounts hung, nobody could remember x4500-01 ever hanging. The 
> reason I asked them was because x4500-01 is the only one where the 
> "alias" is lower IP than the real IP.
> 
> 01-alias: .220
> 01-real:  .221
> 02-real:  .222
> 02-alias: .223
> 03-real:  .224
> 03-alias: .225
> 04-real:  .226
> 04-alias: .227
> 
> Now, IP "value" should not matter, I know, but it just "felt" like it 
> was related. :)
> 
> Lund
> 
> 
> 
> Robert van Veelen wrote:
> > I have a handful of x4150s to play with this week. I'll drop some ipmi addresses \
> > on them and see if I can reproduce the symptoms that you are describing. Would be \
> > interesting to see. Always good to know where one's bugs lie. 
> > -rob
> > 
> > 
> > -----Original Message-----
> > From: 	Jorgen Lundman [mailto:lundman@gmo.jp]
> > Sent:	Sunday, February 15, 2009 11:47 PM Eastern Standard Time
> > To:	Robert van Veelen
> > Subject:	Re: [nfs-discuss] NFS hanging with RPC timeout.
> > 
> > 
> > The servers that hang the most are www and navi (apache), which is 
> > nearly exclusively read-only.  Servers like FTP, and vmx have yet to 
> > hang at all. It sure do not make much sense here.
> > 
> > Strangely enough, navi servers (3 reboots a day before) "appears" to do 
> > a lot better (only one reboot in 2 days), but now we see a lot of:
> > 
> > Feb 16 13:40:28 navi01.unix nfs: [ID 333984 kern.notice] NFS server 
> > 172.20.12.224 not responding still trying
> > Feb 16 13:40:41 navi01.unix nfs: [ID 563706 kern.notice] NFS server 
> > 172.20.12.224 ok
> > Feb 16 13:41:47 navi01.unix nfs: [ID 333984 kern.notice] NFS server 
> > 172.20.12.224 not responding still trying
> > Feb 16 13:42:03 navi01.unix nfs: [ID 563706 kern.notice] NFS server 
> > 172.20.12.224 ok
> > Feb 16 13:42:28 navi01.unix nfs: [ID 333984 kern.notice] NFS server 
> > 172.20.12.224 not responding still trying
> > 
> > 
> > Even though the x4500 is just fine, and talk to nav01 just fine even 
> > during one of these "stalls". Services appear unaffected.
> > 
> > Oh bugger, I just noticed navi servers are not 5/08. That would possibly 
> > explain why they are the worst of all. I will upgrade these servers 
> > asap. (SunOS navi01.unix 5.11 snv_40 i86pc i386 i86pc)
> > 
> > Lund
> > 
> > Robert van Veelen wrote:
> > > Jorgen,
> > > I have been following the back and forth on the list with the ip alias info. It \
> > > does seem like a strange case. It would be interesting if you found some \
> > > connection to the hangs we are seeing but it also appears unlikely now.  For \
> > > what it's worth, the patch I specified was rolled into the 10/08 release and \
> > > cannot be removed trivially. This is how I was burned while deploying to our \
> > > first qa hosts for 10/08 then through the backported patch to 5/08. Are you \
> > > writing/reading to or from the shared nfs space directly on the server side? \
> > > This seems to be a key factor in my steps to recreate our hang.  Good luck,
> > > 
> > > -rob
> > > 
> > > 
> > > -----Original Message-----
> > > From: 	Jorgen Lundman [mailto:lundman@gmo.jp]
> > > Sent:	Sunday, February 15, 2009 07:55 PM Eastern Standard Time
> > > To:	Robert van Veelen
> > > Subject:	Re: [nfs-discuss] NFS hanging with RPC timeout.
> > > 
> > > 
> > > 
> > > Hello,
> > > 
> > > Sorry, I only just show your mail now, my mail filters were not smart 
> > > enough to move it to the right place :) I do not think I have 137138-09 
> > > installed on the SOl 10 5/08 servers, but it appears installed on the 
> > > 10/08. But most recent findings seem to indicate that the problem we are 
> > > having are with IP aliases. Currently testing this hypothesis.
> > > 
> > > Lund
> > > 
> > > 
> > > 
> > > Robert van Veelen wrote:
> > > > Do you have Solaris patch 137138-09 installed? You may need to back that out \
> > > > until a permanent fix is posted. The issue that you are describing sounds \
> > > > exactly like a problem that I have seen on similar machines here. In my \
> > > > testing the only workaround was to back out the kernel patch 137138-09 on the \
> > > > clients (server can remain as is). If you have a support contract then I \
> > > > would also open a case with sun as there appears to be a regression in the \
> > > > kernel code. At this point I can reproduce the deadlock within 30 seconds. \
> > > > You might be able to reference my open case for this issue. I will forward \
> > > > more info if you find that this is the same issue. Good luck. Regards,
> > > > 
> > > > -rob
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: 	Jorgen Lundman [mailto:lundman@gmo.jp]
> > > > Sent:	Monday, February 09, 2009 09:21 PM Eastern Standard Time
> > > > To:	nfs-discuss@opensolaris.org
> > > > Subject:	[nfs-discuss] NFS hanging with RPC timeout.
> > > > 
> > > > (Resent due to wrong sender, sorry)
> > > > 
> > > > 
> > > > Hello list!
> > > > 
> > > > *** NFS Servers:
> > > > 
> > > > x4500-01 to x4500-05
> > > > > Solaris 10 5/08, ZFS and "UFS on ZVOL" exported.
> > > > > NFSD_SERVER=1024, LOCKD_SERVER=128 average use about 900 / 20 threads.
> > > > > "bufhwm_pct,maxusers,ndquot,ncsize,ufs_ninode,clnt_max_conns,
> > > > > rpcmod:cotsmaxdupreqs,rpcmod:maxdupreqs" tweaked in /etc/system.
> > > > 
> > > > *** NFS Clients:
> > > > 
> > > > Supermicro 1U * 40
> > > > > Solaris 10 5/08
> > > > > No tweaks, Mounted as
> > > > > x4500-03:/export/mail - /export/mail nfs - yes vers=3,hard,intr,quota
> > > > > x4500-02:/export/preview - /export/preview nfs - yes vers=3,hard,intr
> > > > 
> > > > 
> > > > *** Background
> > > > 
> > > > Using vers=3 to have uid mapping, without the need for UID lookups. UFS
> > > > on ZVOL are mounted with "quota". ZFS exported filesystems are mounted
> > > > without. The system is live and generally works very well.
> > > > 
> > > > However, NFS will periodically hang. Usually to just one of the x4500
> > > > servers at a time, the solution currently is just to reboot the client.
> > > > I have attempted to fully umount all filesystems, and terminate the NFS
> > > > and RPC processes, in an attempt to remount. This will not fix it. I can
> > > > not really restart the NFSD/RPC processes on the x4500s.
> > > > 
> > > > Usually looks like:
> > > > 
> > > > # df -h
> > > > [snip]
> > > > x4500-03:/export/preview
> > > > 23T   3.9M    23T     1%    /export/preview
> > > > NFS server x4500-01 not responding still trying
> > > > ^C
> > > > 
> > > > Note that during this time, x4500-01 is still functioning correctly to
> > > > the other 39 servers, and x4500-02,03,04,05 are still mounted correctly
> > > > on this NFS client.
> > > > 
> > > > # umount /export/www
> > > > # mount /export/www
> > > > NFS server x4500-01-vip not responding still trying
> > > > 
> > > > Truss of the mount says:
> > > > 23102:   0.0000 getpid()                                        = 23102
> > > > [23101]
> > > > 23102:   0.0000 door_call(5, 0x080475A0)                        = 0
> > > > 23102:   0.0001 close(5)                                        = 0
> > > > NFS server x4500-01-vip not responding still trying
> > > > ^C23102:        69.0780 mount("x4500-01-vip:/export/www", "/export/www",
> > > > MS_DATA|MS_OPTIONSTR, "nfs3", 0x0806D400, 76, 0x0804777C, 1024) Err#4 EINTR
> > > > 
> > > > Snoop says (x4500-01 is 172.20.12.220, NFS Client is 172.20.12.16)
> > > > 
> > > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT)
> > > > vers=3 proto=UDP
> > > > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967
> > > > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null
> > > > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null
> > > > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www
> > > > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix
> > > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3
> > > > proto=TCP
> > > > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Syn Seq=2255048579
> > > > Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Syn Ack=2255048580
> > > > Seq=611591914 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591915
> > > > Seq=2255048580 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 NFS C NULL3
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048700
> > > > Seq=611591915 Len=0 Win=49520
> > > > 172.20.12.220 -> 172.20.12.16 NFS R NULL3
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591943
> > > > Seq=2255048700 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Fin Ack=611591943
> > > > Seq=2255048700 Len=0 Win=49640
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048701
> > > > Seq=611591943 Len=0 Win=49640
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Fin Ack=2255048701
> > > > Seq=611591943 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591944
> > > > Seq=2255048701 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
> > > > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
> > > > Seq=4284552307 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
> > > > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
> > > > Seq=4284552307 Len=0 Win=49640
> > > > [delay]
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
> > > > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
> > > > Seq=4284552307 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
> > > > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
> > > > Seq=4284552307 Len=0 Win=49640
> > > > [repeat, delay]
> > > > 
> > > > 
> > > > *** truss of mountd on x4500-01 while attempting mount:
> > > > 
> > > > # truss -Dfip 28717
> > > > 28717:   6.8156 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000)  = 1
> > > > 28717:   0.0002 lwp_kill(788, SIG#0)                            Err#3 ESRCH
> > > > 28717:   0.0001 lwp_create(0x08047B90, LWP_DETACHED|LWP_SUSPENDED,
> > > > 0x08047DB0) = 791
> > > > 28717/1:         0.0002 lwp_continue(791)                               = 0
> > > > 28717/791:       6.8159 lwp_create()    (returning as new lwp ...)      = 0
> > > > 28717/1:         0.0001 fxstat(2, 7, 0x08047CB0)                        = 0
> > > > 28717/791:       0.0003 setustack(0xFECD1A60)
> > > > 28717/1:         0.0000 getmsg(7, 0x08047D8C, 0x080CC018, 0x08047DAC)   = 0
> > > > 28717/791:       0.0001 schedctl()
> > > > = 0xFEFB2010
> > > > 28717/1:         0.0001 open("/dev/udp", O_RDONLY)                      = 16
> > > > 28717/1:         0.0001 ioctl(16, SIOCTMYADDR, 0x08047CA8)              = 0
> > > > 28717/1:         0.0001 close(16)                                       = 0
> > > > 28717/1:         0.0000 fxstat(2, 7, 0x08047C40)                        = 0
> > > > 28717/1:         0.0000 putmsg(7, 0x08047D18, 0x080CC018, 0)            = 0
> > > > 28717/1:         0.0001 write(14, "F0", 1)                              = 1
> > > > 28717/791:       0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000)  = 1
> > > > 28717/791:       0.0000 read(13, "F0", 16)                              = 1
> > > > 28717/791:       0.0001 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000)  = 1
> > > > 28717/791:       0.0001 lwp_unpark(1)                                   = 0
> > > > 28717/1:         0.0002 lwp_park(0x00000000, 0)                         = 0
> > > > 28717/791:       0.0000 fxstat(2, 7, 0xFEA3FE40)                        = 0
> > > > 28717/791:       0.0001 getmsg(7, 0xFEA3FF20, 0x080CC018, 0xFEA3FF40)   = 0
> > > > 28717/791:       0.0001 open("/dev/udp", O_RDONLY)                      = 16
> > > > 28717/791:       0.0000 ioctl(16, SIOCTMYADDR, 0xFEA3FE38)              = 0
> > > > 28717/791:       0.0001 close(16)                                       = 0
> > > > 28717/791:       0.0000 write(14, " E", 1)                              = 1
> > > > 28717/1:         0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000)  = 1
> > > > 28717/791:       0.0001 getuid()
> > > > = 0 [0]
> > > > 28717/1:         0.0001 read(13, " E", 16)                              = 1
> > > > 28717/791:       0.0000 getuid()
> > > > = 0 [0]
> > > > 28717/791:       0.0001 door_info(15, 0xFEA3F360)                       = 0
> > > > 28717/791:       0.0001 door_call(15, 0xFEA3F3B8)                       = 0
> > > > 28717/791:       0.0000 resolvepath("/export/www", "/export/www", 1024) = 18
> > > > 28717/791:       0.0001 xstat(2, "/etc/dfs/sharetab", 0xFEA3F6B8)       = 0
> > > > 28717/791:       0.0001 nfssys(20, 0xFEA3F860)                          = 0
> > > > 28717/791:       0.0000 fxstat(2, 7, 0xFEA3F6F0)                        = 0
> > > > 28717/791:       0.0000 putmsg(7, 0xFEA3F7C8, 0x080CC018, 0)            = 0
> > > > 28717/791:       0.0001 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7)
> > > > = 0xFFBFFEFF [0x0000FFFF]
> > > > 28717/791:       0.0000 lwp_exit()
> > > > 
> > > > [pause]
> > > > 
> > > > 
> > > > 
> > > > What IS somewhat amusing though, even though I can not mount it again
> > > > using TCP but if I change to using UDP it will mount just fine. We
> > > > changed most servers to using UDP and it seems to hang less, but it will
> > > > still eventually hang.
> > > > 
> > > > # mount -o proto=udp /export/www
> > > > # df -h
> > > > x4500-01-vip:/export/www
> > > > 984G    73G   901G     8%    /export/www
> > > > 
> > > > 
> > > > Successful mount proto=udp snoop:
> > > > 
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
> > > > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
> > > > Seq=4284552307 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
> > > > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
> > > > Seq=4284552307 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1161480443
> > > > Len=0 Win=49640
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Syn Ack=1118215538
> > > > Seq=4284552306 Len=0 Win=49640 Options=<mss 1460,nop,wscale
> > > > 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Ack=1118215538
> > > > Seq=4284552307 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT)
> > > > vers=3 proto=UDP
> > > > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967
> > > > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null
> > > > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null
> > > > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www
> > > > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix
> > > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3
> > > > proto=UDP
> > > > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049
> > > > 172.20.12.16 -> 172.20.12.220 NFS C NULL3
> > > > 172.20.12.221 -> 172.20.12.16 NFS R NULL3
> > > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3
> > > > proto=UDP
> > > > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049
> > > > 172.20.12.16 -> 172.20.12.220 NFS C NULL3
> > > > 172.20.12.221 -> 172.20.12.16 NFS R NULL3
> > > > 172.20.12.16 -> 172.20.12.220 NFS C FSINFO3 FH=D502
> > > > 172.20.12.221 -> 172.20.12.16 NFS R FSINFO3 OK
> > > > 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502
> > > > 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK
> > > > 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502
> > > > 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK
> > > > 
> > > > 
> > > > Attempt to re-mount using TCP again, for fun
> > > > 
> > > > # umount /export/www
> > > > # mount /export/www
> > > > NFS server x4500-01-vip not responding still trying
> > > > 
> > > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT)
> > > > vers=3 proto=UDP
> > > > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967
> > > > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null
> > > > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null
> > > > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www
> > > > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix
> > > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3
> > > > proto=TCP
> > > > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Syn Seq=2389376336
> > > > Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Syn Ack=2389376337
> > > > Seq=997480070 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480071
> > > > Seq=2389376337 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 NFS C NULL3
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376457
> > > > Seq=997480071 Len=0 Win=49520
> > > > 172.20.12.220 -> 172.20.12.16 NFS R NULL3
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480099
> > > > Seq=2389376457 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Fin Ack=997480099
> > > > Seq=2389376457 Len=0 Win=49640
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376458
> > > > Seq=997480099 Len=0 Win=49640
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Fin Ack=2389376458
> > > > Seq=997480099 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480100
> > > > Seq=2389376458 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1240043383
> > > > Len=0 Win=49640
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0
> > > > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383
> > > > Seq=99287825 Len=0 Win=49640
> > > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0
> > > > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
> > > > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383
> > > > Seq=99287825 Len=0 Win=49640
> > > > 
> > > > 
> > > > So, TCP is hung until reboot. If I reboot the NFS client it will mount
> > > > TCP just fine again. When both UDP and TCP have hung there is nothing I
> > > > can do to make it mount. We never reboot the x4500's.
> > > > 
> > > > So, since of the 40 odd NFS clients, we have to reboot about 6 every day
> > > > which is getting tedious, and worse than that, we do not always notice
> > > > it is stuck immediately.
> > > > 
> > > > We have put Solaris 10 10/08 on some NFS clients as well, but it is too
> > > > early to know if it fixes anything. We will most likely also try 10/08
> > > > on the x4500, but that is a much larger task.
> > > > 
> > > > Are there any NFS related patches we should explore?
> > > > 
> > > > Sorry for the length of this email, I wanted to include as much details
> > > > as possible and show I have tried most things in an attempt to discover
> > > > where the trouble lies.
> > > > 
> > > > Other Google results hinted on running out of secure ports, but netstat
> > > > shows no indication of that as far as I can tell. No entries for the
> > > > hung NFS client on the x4500. The NFS client has a relatively small
> > > > netstat -na, with the exception of 47 entries for "stream-ord".
> > > > 
> > > > 
> > > > We would appreciate any feedback on this issue, thank you.
> > > > 
> > > > 
> > > > Lund
> > > > 
> > > > 
> > > > *** Random commands while mount is hung:
> > > > 
> > > > # showmount -e x4500-01-vip
> > > > export list for x4500-01-vip:
> > > > /export/mail    @172.20.12,@172.20.15
> > > > /export/www     @172.20.12,@172.20.15
> > > > /export/dovecot @172.20.12,@172.20.15
> > > > 
> > > > 
> > > > # rpcinfo -m x4500-01-vip
> > > > PORTMAP (version 2) statistics
> > > > NULL    SET     UNSET   GETPORT         DUMP    CALLIT
> > > > 0       0/0     0/0     1503694/1503838 0       0/0
> > > > 
> > > > PMAP_GETPORT call statistics
> > > > prog            vers    netid     success       failure
> > > > nlockmgr        4       udp       4342          0
> > > > status          1       tcp       2             0
> > > > nlockmgr        2       udp       42            0
> > > > nlockmgr        4       tcp       1433764       0
> > > > nfs             3       udp       346           0
> > > > nfs             3       tcp       400           0
> > > > status          1       udp       49            0
> > > > mountd          1       udp       79            2
> > > > mountd          1       tcp       11            2
> > > > mountd          3       udp       654           113
> > > > rquotad         1       udp       64001         23
> > > > metad           2       tcp       3             0
> > > > smserverd       1       tcp       0             1
> > > > smserverd       1       udp       0             1
> > > > 300598          1       udp       1             1
> > > > 300598          1       tcp       0             1
> > > > 
> > > > RPCBIND (version 3) statistics
> > > > NULL    SET     UNSET   GETADDR DUMP    CALLIT  TIME    U2T     T2U
> > > > 0       0/0     0/0     2/2     0       0/0     0       0       0
> > > > 
> > > > RPCB_GETADDR (version 3) call statistics
> > > > prog            vers    netid     success       failure
> > > > status          1       ticotsord 1             0
> > > > 100133          1       ticotsord 1             0
> > > > 
> > > > RPCBIND (version 4) statistics
> > > > NULL    SET     UNSET   GETADDR DUMP    CALLIT  TIME    U2T     T2U
> > > > 0       99/99   115/115 1/2     0       0/0     0       0       0
> > > > VERADDR INDRECT GETLIST GETSTAT
> > > > 0       0       1       1
> > > > 
> > > > RPCB_GETADDR (version 4) call statistics
> > > > prog            vers    netid     success       failure
> > > > smserverd       1       ticlts    1             1
> > > > 
> > > > 
> > > > # rpcinfo -T tcp x4500-01-vip 100005 3
> > > > program 100005 version 3 ready and waiting
> > > > 
> > > > 
> > > > 
> 

-- 
Jorgen Lundman       | <lundman@lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
_______________________________________________
nfs-discuss mailing list
nfs-discuss@opensolaris.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic