[prev in list] [next in list] [prev in thread] [next in thread] 

List:       haproxy
Subject:    Re: where to start with 503 errors
From:       Willy Tarreau <w () 1wt ! eu>
Date:       2010-07-30 15:42:23
Message-ID: 20100730154223.GD13907 () 1wt ! eu
[Download RAW message or body]

Hi Matt,

On Wed, Jul 28, 2010 at 12:29:15PM -0600, Matt Banks wrote:
> OK, this is somewhat funny, but I'm mostly done with this email and a VERY similar \
> sounding problem was just asked a few minutes ago... 
> All,
> 
> Long story short(ish):
> 
> We put haproxy in front of a few servers that generate dynamic pages from a \
> database.  Here's a crude description of the setup: 
> HAProxy -> 2 to 10 Apache servers -> Gateway (connection to db) -> Local caching \
> database server ---(LAN or WAN)-> Database 
> The point is that if the page is cached, the local caching db server will reply \
> very fast.  If not, it may take a few seconds to respond.

Those are precisely multiples of 3 seconds I guess ?

> We've also found that we basically HAVE to use keep alive (eg loading an image \
> takes well under a second to load without HAProxy and perhaps .5 to 1.5 seconds \
> with keepalive on whereas with keepalive off, the same image on the same page takes \
> 12-18 seconds) if that makes a difference.

Yes, with keep-alive, you have one session, without you have many.
Losing a SYN or a SYN/ACK when establishing a connection implies a
3 second retransmit delay. So with keep-alive disabled, each object
comes in a separate session, causing more connection establishments,
then amplifying the retransmission delay.

> Here's where things get a bit... tricky?
> 
> We have httpcheck disabled.  This is essentially because it's not working for us - \
> at least how we'd like it to be.  In a nutshell, we're getting a LOT of false \
> positives where a server is listed as "up going down" or down when in reality, a \
> non-cached page was simply taking a couple seconds (probably 3-5 but definitely \
> less than 10) to load.

This is also typical of high packet loss rate.

> The point is, we get several 503 errors throughout the day.  And they appear to be \
> random.  Apache never goes down nor reports an error.  Frankly, I think what's \
> happening is that haproxy is hitting a server which takes too long to respond, so \
> it tries another server (which also doesn't have the page cached) and goes through \
> the list until it gives up and reports a 503.

In my opinion, what it happening is that something is causing connections
to fail between haproxy and the servers (since health checks fail too).
There are two common causes for this :

  - a network card connected to a forced 100-full switch. Almost all
    gigabit cards will negociate 100-half if the switch does not advertise
    anything, causing a huge packet loss rate. You can easily check on
    your server using ethtool :

       ethtool eth0

  - a mis-configured netfilter which remains enabled on the haproxy
    machine (the default settings of the conntrack table are too small
    to support a moderate load). You can see messages like "conntrack
    table is full" in "dmesg". Just in case, you should completely
    unload the nf_conntrack / ip_conntrack modules from the machine.

You could also try to run an FTP test from/to the haproxy machine. You
should easily be able to saturate the port when transferring large files
(approx 11800 kB/s on 100 Mbps, 118 MB/s on 1 Gbps). Any significantly
lower value indicates a communication trouble. This will show you where
the network runs well and where it runs poorly. Sometimes this is as
simple as a broken NIC, wire or switch port (the later happened to me
several time).

> Meanwhile, if you go directly to the page on the Apache server, it loads fine.  Or \
> if you re-load using HAProxy, it works fine as well. 
> I'm just wondering where to start with this.  We have several sites experiencing \
> the same problem, but since we're using roughly the same setup for each one, I'm \
> not opposed to saying it could be how we have HAProxy set up.

There is no particular reason your config could cause such things to
happen and you could definitely not cause the checks to randomly fail.
That's why I'm suggesting environment issues, which are a very recurring
concern.

Regards,
Willy


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic