[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha
Subject:    Re: [Linux-HA] Spurious serial write timeouts
From:       Mike Dahmus <mdahmus () netbotz ! com>
Date:       2004-03-30 21:52:41
Message-ID: 1080683560.4945.16.camel () mdahmus ! netbotz ! com
[Download RAW message or body]

On Wed, 2004-03-24 at 10:38, Mike Dahmus wrote:
> On Wed, 2004-03-24 at 10:18, Alan Robertson wrote:
> > Mike Dahmus wrote:
> > > We're prototyping a two-machine cluster whose sole purpose is to serve
> > > one IP address. They're connected right now over a serial cable at 9600+
> > > baud (I've tried various speeds) with all of the expected signals on the
> > > cable (output below). I'm getting intermittent spurious serial write
> > > timeouts and am curious if anybody has any additional debugging
> > > suggestions (archives indicated to run heartbeat with various debugging
> > > flags which I've been unable to locate as of yet).
> > > 
> > > heartbeat: 2004/03/24_08:59:05 WARN: TTY write timeout on [/dev/ttyS0]
> > > (no connection or bad cable? [see documentation])
> > > 
> > > Both machines are running variants of 2.4.20 kernel.
> > 
> > What version of heartbeat are you running?  Red Hat put out some buggy 
> > 2.4.20 kernels...   And, when you say "intermittent spurious", how often 
> > does this occur?
> 
> Hi Alan,
> 
> heartbeat is heartbeat-1.2.0-1.rh.9
> 
> And "intermittent" means it usually happens relatively soon after
> startup but not always. 
> 
> > > And both machines respond with the following signals when I cat
> > > /proc/tty/driver/serial:
> > > 
> > >  RTS|CTS|DTR|DSR|CD
> > 
> > 
> > Now that's interesting...
> > 
> > Why it's getting this result, I can't say.
> > 
> > I can tell you what's causing the timeout from the low level...
> > 
> > Heartbeat is getting blocked for too long while doing the write.
> > 
> > This could also be caused by too high a message rate (like 100ms), driver 
> > bugs, or broken or buggy hardware - or maybe (but less likely) a broken 
> > scheduler.
> > 
> > 9600bps should be able to handle up to ~5 messages/second - assuming things 
> > are working correctly.
> 
> Well, like I said, I've tried higher values (19200 at first).
> 
> > To be conservative, make sure you're not sending more than 2 a second.  Or 
> > set keepalive to 1 or greater to make really sure.
> > 
> > You can start heartbeat with -d, or better yet, send 4 or 5 SIGUSR1 signals 
> > to the write process that's writing to your tty port.  This will be much 
> > more selective in its debug - and high-level debug flags are very verbose 
> > when applied to all of heartbeat.
> 
> Will do.

Alan,

Here's some additional data after a week of more testing:

The serial timeout happens even when we use a USB-Serial-Serial-USB
path. It does print out once every hour; and (if the only remaining path
available) will result in an unnecessary failover immediately (well, the
next time the heartbeat is tested after the alternate communications
path is removed - we're testing with a crossover cable as well).

And as shown in my last correspondence, the debug output (from a normal
serial device path; not the USB one) looks like this:

heartbeat: 2004/03/24_10:53:32 debug: >>>
heartbeat: 2004/03/24_10:53:32 debug: serial write returned 129
heartbeat: 2004/03/24_10:53:34 debug: Sending pkt to /dev/ttyS0 [129
bytes]
heartbeat: 2004/03/24_10:53:34 debug: >>>
heartbeat: 2004/03/24_10:53:34 debug: serial write returned 129
heartbeat: 2004/03/24_10:53:36 debug: Sending pkt to /dev/ttyS0 [128
bytes]
heartbeat: 2004/03/24_10:53:36 debug: >>>
heartbeat: 2004/03/24_10:53:36 debug: serial write returned 120
heartbeat: 2004/03/24_10:53:38 debug: Sending pkt to /dev/ttyS0 [129
bytes]
heartbeat: 2004/03/24_10:53:38 debug: >>>
heartbeat: 2004/03/24_10:53:38 debug: serial write returned 129
heartbeat: 2004/03/24_10:53:40 debug: Sending pkt to /dev/ttyS0 [129
bytes]
heartbeat: 2004/03/24_10:53:40 debug: >>>

You mentioned earlier that you believe this may be a result of bugs in
certain RedHat kernels. Could you be more specific? My colleague here is
willing to try building newer kernels if we know of one that might work.

Regards,
Mike Dahmus

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic