[prev in list] [next in list] [prev in thread] [next in thread] 

List:       log
Subject:    Re: What makes svstat think a service is down?
From:       prj () po ! cwru ! edu (Paul Jarc)
Date:       2002-05-31 23:36:06
[Download RAW message or body]

"Hubbard, David" <dhubbard@dino.hostasaurus.com> wrote:
> I linked from /service and everything started back up but
> readproctitle complained about the lock files.

There's normally no way of knowing how old readproctitle's error
messages are.  You can have a service like this to clear the messages:
#!/bin/sh
echo ...............................................
date

Give it a "down" file and use svc -o to run it once.

Alternatively, you can use svclean so that service errors will go to a
supervised multilog: <URL:http://multivac.cwru.edu./svclean/>.

> I then went to /var/qmail/supervise and just did a
>
> rm -r */supervise */log/supervise

Bad idea.  That made it impossible for the new supervises to tell
whether there was an old supervise already running.  You only silenced
the error message; you didn't fix the problem reported in the error
message (if there was one).

> The problem is that somewhere between 354 seconds and maybe
> 480 seconds (8 minutes) at most, the next time I ran
> svstat /service/qmail-send, svscan goes from reporting the
> processes as being up to being down, even though that is not
> the case.

That means that the currently running supervise did not spawn the
currently running qmail-send.  Maybe supervise was killed.  In that
case, svscan would restart supervise, and the new supervise would try
to start qmail-send, but qmail-send wouldn't be able to lock
/var/qmail/queue/sendmutex, so it would exit immediately.

> Additionally, it reports the downtime as much longer than possible,
> for example:
>
> root@/# svstat /service/qmail-send
> /service/qmail-send: up (pid 23483) 354 seconds
> root@/# svstat /service/qmail-send
> /service/qmail-send: down 34047 seconds, normally up

It appears that there are two supervises running.  You need to improve
your process hunting skills or reboot.  I suspect this was caused by
removing the supervise directories.


paul
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic