'Freezing Alphas'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       tru64-unix-managers
Subject:    Freezing Alphas
From:       Dan Jacobson <danj () gdb ! org>
Date:       1995-02-28 1:59:05
[Download RAW message or body]




We have a number of alpha 3000s running OSF/1 v2.0.  started to lock up and
freeze several times a day, the only recovery from which is a power-cycle :-(. 

One of these machines is running a Web server (NSCA httpd 1.3), and is serving a
number of databases via a Web -> freewais-sf-1.0 interface.  It was running fine
since we it arrived about 10 months ago up to the past week. 

Early on it used to run out of swap from time to time but switching to lazy
swap seemed to fix that.  I've also implemented the -terminate patches to stop
the X memory leak. 

Over the past week and especially in the last two days it has started to lock up
and freeze, now it does so several times a day.  The freeze ups seem to be further
brought on by doing something that's cpu and i/o intensive - like a waisindex.  

I checked the archives and the only thing similar that I could find was the following:

> We've been having quite a time with zombies on our DEC 3000/400 running
> OSF/1 2.0. We didn't seem to have this bad of a problem under 1.3. We've
> been running 2.0 for several months now, and with the increased user load
> it's causing some problems. First of all, they build up, you can't kill 
> them...which is really annoying. Second of all, they sem-frequently freeze
> the machine. It just totally freezes up. I've been told by a collegue that
> this is because the system tries to kill them and it hits the wall. I was
> also told that we shouldn't "upgrade to 2.1 or 3.0 because it was even worse
> on those versions"... Any opinions, comments, etc?


I could not find a follow-up or summary to this query.  Has anybody else seen this
sort of problem.  If so, what's the solution.   Is this a OSF/1 problem or is
there a potential (known) hardware problem?


ps ax yields the following:


  PID TT  S           TIME COMMAND
    0 ??  R <     35:14.38 [kernel idle]
    1 ??  I        0:00.22 /sbin/init -a
    2 ??  I        0:00.00 [exception hdlr]
   14 ??  I        0:00.88 /sbin/update
  114 ??  S        0:00.21 /usr/sbin/syslogd
  116 ??  I        0:00.04 /usr/sbin/binlogd
  179 ??  I        0:00.07 /usr/sbin/portmap
  181 ??  U        0:00.02 /usr/sbin/nfsiod 7
  182 ??  U        0:00.00 /usr/sbin/nfsiod 7
  183 ??  U        0:00.00 /usr/sbin/nfsiod 7
  184 ??  U        0:00.00 /usr/sbin/nfsiod 7
  185 ??  U        0:00.00 /usr/sbin/nfsiod 7
  186 ??  U        0:00.00 /usr/sbin/nfsiod 7
  188 ??  U        0:00.00 /usr/sbin/nfsiod 7
  229 ??  I        0:00.10 -accepting connections (sendmail)
  274 ??  I        0:00.06 /usr/sbin/mold
  277 ??  I        0:00.24 /usr/sbin/internet_mom
  286 ??  I        0:00.07 /usr/sbin/snmp_pe
  294 ??  I        0:00.73 /usr/sbin/inetd
  299 ??  I        0:00.08 /usr/sbin/cron
  316 ??  I        0:00.02 /usr/lbin/lpd
  335 ??  I        0:00.05 /usr/bin/X11/xdm -config /usr/lib/X11/xdm/xdm-config
  339 ??  S        0:12.95 /usr/bin/X11/X -terminate -auth /usr/lib/X11/xdm/A:0-aaakpa
  340 ??  S        0:04.95 -:0 (xdm)
  351 ??  I        0:00.22 /usr/bin/X11/dxconsole -geometry 480x150-0-0 -daemon
-nobuttons -v
  352 ??  S        0:01.12 telnetd
  365 ??  S        0:03.50 ./httpd -d .
  636 ??  I        0:00.15 telnetd
  672 ??  I        0:00.13 rshd
  673 ??  I        0:01.84 /usr/sbin/rdump 9ubdsf 64 45434 13000 backer.gdb.org /dev/rmt/0un
  675 ??  S        0:04.62 /usr/sbin/rdump 9ubdsf 64 45434 13000 backer.gdb.org /dev/rmt/0un
  676 ??  U        0:28.44 /usr/sbin/rdump 9ubdsf 64 45434 13000 backer.gdb.org /dev/rmt/0un
  677 ??  S        0:27.22 /usr/sbin/rdump 9ubdsf 64 45434 13000 backer.gdb.org /dev/rmt/0un
  678 ??  S        0:26.24 /usr/sbin/rdump 9ubdsf 64 45434 13000 backer.gdb.org /dev/rmt/0un

  819 ??  I        0:00.09 ./httpd -d .
  838 ??  I        0:00.09 ./httpd -d .
  338 co  I  +     0:00.07 /usr/sbin/getty console console vt100
  353 p1  S        0:00.89 -csh (csh)
  859 p1  R  +     0:00.20 ps ax
  637 p2  I  +     0:00.30 -csh (csh)



Now as I look at this I see rdump and get a bit suspicious as I've seen lots of
horror stories in this list about rdump and the rdump is going over to a Sun
.....  As I sit here doing swapon -s I watch swap usage grow and grow and grow: 


swapon -s
Total swap allocation:
    Allocated space:        16384 pages (128MB)
    Reserved space:         13920 pages ( 84%)
    Available space:         2464 pages ( 15%)

Swap partition /dev/rz3b:
    Allocated space:        16384 pages (128MB)
    In-use space:           13920 pages ( 84%)
    Free space:              2464 pages ( 15%)


In-use space was down at 7% a few minutes ago - after the last power-cycle reboot.



and now a few minutes later swap is back down:


Total swap allocation:
    Allocated space:        16384 pages (128MB)
    Reserved space:          1659 pages ( 10%)
    Available space:        14725 pages ( 89%)

Swap partition /dev/rz3b:
    Allocated space:        16384 pages (128MB)
    In-use space:            1659 pages ( 10%)
    Free space:             14725 pages ( 89%)




All of this about rdump and swap may be grasping at straws but I've run out of
ideas and have a hunch that swap is involved.  Any ideas??????



Regards,


Dan Jacobson

danj@gdb.org

Johns Hopkins University

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic