'[ovs-discuss] ovsdb-server unkillable, need some help'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openvswitch-discuss
Subject:    [ovs-discuss] ovsdb-server unkillable, need some help
From:       jbachtel () bericotechnologies ! com (Jeff Bachtel)
Date:       2014-02-28 20:00:16
Message-ID: 5310EAD0.5050705 () bericotechnologies ! com
[Download RAW message or body]

Does anyone have any insight into this? For further datapoints, I built 
the 2.0 release and much more current openvswitch snapshots (most 
recently to commit bdeadfdd) which exhibited the same problems. The 
CentOS 6 kernel is 2.6.32. Because of presumed incompatibility with the 
Linux bridge module, I made sure bridge.o wasn't being loaded. On a host 
where ovsdb-server had not yet become unresponsive, ovs-vswitchd was 
unkillable, in state R<L. Could my problem be related to vwsitchd 
becoming unresponsive under load, taking ovsdb-server with it?

I've received further confirmation that this is involved in some way 
with load, as a node inadvertently disconnected from the rest of the 
Ceph cluster had a record uptime with openvswitch. If anyone can give me 
pointers on getting a backtrace I'm happy to run things until failure 
and get better data. I've had trouble with this at least as far as using 
strace is concerned. As it is, I've cron'd a restart of openvswitch 
every minute - obviously an incredibly unideal situation.

Thanks for any help,

Jeff

On 02/20/2014 12:54 AM, Jeff Bachtel wrote:
> I'm running OpenVSwitch 1.11 from the RDO Havana repository. In 
> addition, I'm running OpenStack Havana, Neutron, and Ceph Emperor, all 
> on some CentOS 6.5 machines.
>
> After installing Bacula on the previous openstack version (grizzly), I 
> noticed the networking had become somewhat load sensitive. 
> ovsdb-server was freezing - not responding to queries on its unix 
> socket and becoming unkillable in process state R< . Believing that it 
> was probably due to being behind in ovs version, I pushed ahead with 
> an upgrade only to find my stability problems become much much worse. 
> Every 20-30 minutes I can count on an ovsdb-server process freezing.
>
> At 
> https://drive.google.com/folderview?id=0B-wx2_T_hW-_OXZJWGJNc0l0MzQ&usp=sharing 
> please find a folder with shared copies of diagnostic files from a 
> machine with hung ovsdb-server. There is a process list (.ps, 
> apologies forgot postscript until upload was done), strace, dmesg, and 
> /var/log/messages.
>
> The strace didn't reveal anything suspicious to me. To mitigate I 
> tried lowering log verbosity, completely recreating conf.db, as well 
> as frequent compacting (every minute) and putting the db on a ramdisk, 
> nothing worked as a solution.
>
> The ovsdb-server processes most likely to succumb to locking run on 
> ceph hosts running osd - meaning they can see a lot of network 
> traffic, as well as disk i/o.
>
> I don't understand what a simple database RPC server could be doing 
> that would cause it to become unkillable, especially with the attempt 
> at minimizing disk i/o by putting the db file on a ramdisk.
>
> I hope someone has some ideas of what I might do to test or mitigate 
> the situation. Not running ceph osd on the hosts is, unfortunately, 
> not a solution I can use.
>
> Thanks,
> Jeff


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic