'SUMMARY: Solaris 8 and automount problems'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sun-managers
Subject:    SUMMARY: Solaris 8 and automount problems
From:       Frederick Hall <hall () egr ! msu ! edu>
Date:       2001-02-27 21:49:27
[Download RAW message or body]

In response to Jacob Charly post,  I'll give you all what I have been
able to gather.  Part of the reason for the long delay is that this is
an intermittant problem.

Thanks to Bryan Hodgson, Thomas Anders, Tuc and Mark Day.  The original
post and suggestions I received follow, with our current status at the
end.

Original post:
> 
> We have just recently upgraded a number of nfs servers and clients from
> Solaris 2.6 to 8.  Now, we see several times a day where both 2.6 and 8
> clients report "automountd[163]: xxxxx server not responding" or "unix:
> NFS server mailhost not responding", the mailhost is running Solaris 8.
> If we get on the nfs server and stop and restart nfs.server, it clears
> the problem.  Has anybody else seen this problem/or have a solution?
> 

Although the real problem might well be network issues, one thing that
you might check ...

Presumedly the problem is intermittant.

Run 'vmstat' on both the problematic servers and clients, and look for
non-zero 'sr' numbers.

I've noted in the past that 2.8 uses considerably more RAM than 2.6; in
particular, in a 64M machine which ran fine under 2.6, simply
installing 2.8 caused enough swapping to bring the machine to its
knees, even without it running any apps.  Excessive swapping could be
slowing responsiveness enough to cause timeouts to occur.

On Jan 26,  8:07, Frederick Hall wrote:
> No, there is no indication of anything wrong on the server.

Have you tried raising the number of nfsd threads in /etc/init.d/nfs.server
on the server(s)?

Do you also get one of these warnings on the NFS server?:

Jan 16 12:58:54 dsapp3 rpcmod: [ID 959652 kern.notice] xdrmblk_getmblk failed
Jan 16 12:58:54 dsapp3 nfssrv: [ID 444088 kern.notice] NOTICE: nfs_server: bad
g
etargs for 3/7

The symptoms for the client is a NFS timeout. We're still investigating
whether it's a client or server bug and how to fix it. Patches are
up-to-date, of course.

> > 	1) How much bandwidth is it pumping?
> > 	2) Did you check duplex/speed along EVERY route? (THE BIGGEST CULPRIT
> > 		I EVER SAW)
> > 	3) Hows the load average on the mailhost?
> > 
> The servers are all on 100 MB and most of the clients also.  Ports are
> fixed at 100MB/full duplex in /etc/system.  I usually don't have time to
> check the load on the mail server before re-setting it, too many users
> screaming to get it up, but it is an Ultra60 with 2x450 cpu's.  That
> could be part of it.
> 
	What are they all doing? 

	Check the "vmstat". You might be swapping too.  Watch the machine,
check sar... Also, with a mailhost, there is a lot of locking.  What
protocol/size are you running? We found that we had a shiteload of problems
with V3/UDP/32K, and alot of luck with V2/UDP/8K, suprisingly.

	Didn't think of this before, but mail has alot of file locking
going on. Wonder if thats killing you.

I wish I could offer you a solution, but all I can say is that I feel
your pain  (you actually saved me the trouble of a submission, since
that was going to be my next step.).  We went from an Ultra 5/360 running
2.7 to a 220R server (dual 450 MHz processors) running 2.8, and started
experiencing this same problem using the exact same client base.  As 
you've noted, restarting nfs.server temporarily relieves the problem. I've 
tried a manual mount from  a client while the server is in this state, and 
it just hangs, restarting nfs.server results in an immediate mount.  
Our server has the latest nfsd patch (109783-01), the latest patch with a
reference to rpcbind  (109322-02). I couldn't find any 2.8 patch
that mentioned mountd, which is the other daemon I would suspect might
cause a problem on mount.

My only other comment is that I don't believe that this has anything to
do with automountd.  I think systems that are dependent on mounting the
NFS filesystem are vulnerable to the problems on the server, and therefore
they are just the messenger of this problem.

Please summarize to the list, and let me know if there is anything I
can do to help diagnose the problem.

	1) How much bandwidth is it pumping?
	2) Did you check duplex/speed along EVERY route? (THE BIGGEST CULPRIT
		I EVER SAW)
	3) Hows the load average on the mailhost?

What we have done and I am not sure it is 100% fixed but it is better.

Added a replica NIS+ server.  mailhost and nis+ server were going crazy
authenicating messages for 5000 users.

Increase threads on nfsd from default 16 to 32.  This really helped one
of the servers.

Updated 109783-01, 108528-06 and 108727-05 on Solaris 8.

One thought that we aren't going to do is go to static mounts.

Hope this can help someone.

	Fred Hall
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers

[prev in list] [next in list] [prev in thread] [next in thread]