'Re: Major issue with LVS-DR when a server gets overloaded'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-virtual-server
Subject:    Re: Major issue with LVS-DR when a server gets overloaded
From:       Roberto Nibali <ratz () drugphish ! ch>
Date:       2007-02-16 10:38:10
Message-ID: 45D58992.6040700 () drugphish ! ch
[Download RAW message or body]

Hello,

>> Either a massive bug in the ServerIron Firmware or a configuration 
>> glitch on your side. Care to post the relevant part of the configuration?
> 
> In the ServerIron, each of the 6 real servers looks like this :
> 
> server real server01.domain x.x.x.41
>  port default disable
>  weight 10 0
>  port http
>  port http keepalive
>  port http url "GET /alarm/"

And this automatically gets /alarm/index.php as per configuration on 
your lighttpd server?

>  port http status_code  200 299
> !
> And the virtual server :
> 
> server virtual virtual.domain x.x.x.225
>  port default dsr
>  port http sticky
>  port http dsr
>  bind http server01.domain http server02.domain http server03.domain
> http server04.domain http
>  bind http server05.domain http server06.domain http
> !

I don't exactly remember the FSM on the ServerIron hardware and 
unfortunately these days one does not get access to their documentation 
anymore, without a KP id :(. However, your configuration looks pretty 
straight-forward and should definitely work. I'm just not sure if the 
ServerIron OS distinguishes between HTTP no response and HTTP not 
expected response?

What happens if your modify your PHP health check status script to 
actually set code 500 for all HTTP requests? Do any of the RS get set 
up, either with the ServerIron or the LVS?

> The similar configuration with LVS (using keepalived) :

I'm not too familiar with the inner workings of keepalived, so maybe 
Alexandre should throw an eye on this as well.

> virtual_server x.x.x.229 80 {
>     delay_loop 6

This seems pretty short, considering you've 6 RS to check.

>     lb_algo rr
>     lb_kind DR
>     persistence_timeout 30
>     protocol TCP
> 
>     real_server x.x.x.41 80 {
>         weight 10
>         HTTP_GET {
>             url {
>                 path /alarm/
>                 status_code 200
>             }
>             connect_timeout 5
>             nb_get_retry 2
>             delay_before_retry 5
>         }
>     }
> 
> ! etc. for all other 5
> 
> }
> 
>> How exactly do you get your RS to dynamically switch from HTTP response 
>> code 200 to 500? Have you checked the HTTP response header using a CLI 
>> tool like curl, lynx or wget?
> 
> Various ways. I'm using lighttpd with PHP as FastCGI, so by checking
> a /alarm/index.php script :
> - I get a 500 from lighttpd if the PHP backend is overloaded or dead
> And right now I've extended this PHP script to keep sending 500s in
> more situations, in order to avoid "plip-flopping" :
> - I get a 500 from the script if the main db connection is down
> - I get a 500 from the script of the server's 1min avg load is > 20

So what happens if you shut down all your DBs and restart keepalived? 
How does the ipvsadm -Ln output look like?

> I've checked with "curl -I" and get the status I expect in every case.

Ok.

>>> I would like to have tried some kind of "keep the real server disabled
>>> for n seconds when it's detected as down" in order to keep the check
>>> from flip-flopping like this, but there is no such setting in
>>> keepalived AFAICS.
>> Would it be possible and good enough for you to use the threshold 
>> limitation feature by setting an upper and lower threshold for the 
>> amount of active + inactive connections?
> 
> I've got a bit more information after running LVS for the past weeks
> (without sending any real traffic to the virtual server IP address,
> though, I use the ServerIron's virtual IP address currently). I keep
> getting read timeouts from keepalived, so at a higher level it seems
> that there already is an issue. The ServerIron reports no similar
> timeouts against the same servers, which are running fine.

Health check read timeouts?

> Anyhow, this is something I definitely need to fix before digging any
> more about the LVS issue I reported initially.

Fair enough. Good luck,
Roberto Nibali, ratz
-- 
echo 
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
_______________________________________________
LinuxVirtualServer.org mailing list - lvs-users@LinuxVirtualServer.org
Send requests to lvs-users-request@LinuxVirtualServer.org
or go to http://www.in-addr.de/mailman/listinfo/lvs-users
[prev in list] [next in list] [prev in thread] [next in thread]