'[Ganglia-general] gmetad segfaulting frequently with graphite integration enabled'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ganglia-general
Subject:    [Ganglia-general] gmetad segfaulting frequently with graphite integration enabled
From:       Aaron Nichols <anichols () trumped ! org>
Date:       2012-03-08 3:37:26
Message-ID: CAARfF8MJeBVmZNrUuGFru9kcZYsOM_uOCFx=T7XVnn8xe2fXnA () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


All,
  We have setup ganglia 3.3.1, built RPM's from the 3.3.1 tarball, and have
it configured to push metrics into graphite. All of this setup is new,
we're testing this in our dev environment and have about 80 machines
pushing metrics in via gmond. We are using rrdcached with gmetad as well.
When we have the gmetad configured to send metrics to a carbon server
(graphite) we experience two interesting behaviors:

1) The ganglia web UI is very slow (10-20 seconds) to load the main grid
view (we only have one grid). However, it is pretty quick if you are
viewing any of the individual clusters or hosts. If we disable sending of
metrics to graphite OR we shutdown carbon-cache.py on the graphite server,
this latency goes away.

2) The gmetad daemon is segfaulting multiple times per day (sometimes
multiple times per hour). Again, when we stop sending metrics to graphite
this behavior stops. I haven't tested to see if this stops if
carbon-cache.py is down.

I have run gmetad in debug mode until the segfault occurred and the last
few lines before it died were this, which look like it's sending metrics to
graphite:

carbon proxy:: x.x.x.x 46 is ready to receive
Updating host blah.example.com, metric swap_total
Carbon Proxy:: sending 'blah.gmetad.blah.example.com.swap_total 16779884
1331070149
' to x.x.x.x
carbon proxy:: x.x.x.x is ready to receivecarbon proxy:: x.x.x.x is ready
to receive
Updating host blah.example.com, metric diskstat_sdcc_reads
Carbon Proxy:: sending 'blah.gmetad.blah.example.com.diskstat_sdcc_reads
0.000000 1331070138
' to x.x.x.x
carbon proxy:: x.x.x.x is ready to receive

The segfault looks like this in /var/log/messages:

Mar  7 20:11:26 blah kernel: gmetad[18445]: segfault at 0 ip
00000000004071ea sp 00007f354f021410 error 4 in gmetad[400000+d000]

Also interesting is that we get this message in our logs about every minute:

Mar  7 20:13:06 blah /usr/sbin/gmetad[19098]: server_thread() -1326024960
unable to write root epilog

I'm about to start hooking up gdb to this guy but was wondering if someone
is aware of this issue or has suggestions for what might be causing this?

Thanks,
Aaron

[Attachment #5 (text/html)]

All,<div>  We have setup ganglia 3.3.1, built RPM&#39;s from the 3.3.1 tarball, and \
have it configured to push metrics into graphite. All of this setup is new, we&#39;re \
testing this in our dev environment and have about 80 machines pushing metrics in via \
gmond. We are using rrdcached with gmetad as well. When we have the gmetad configured \
to send metrics to a carbon server (graphite) we experience two interesting \
behaviors:</div> <div><br></div><div>1) The ganglia web UI is very slow (10-20 \
seconds) to load the main grid view (we only have one grid). However, it is pretty \
quick if you are viewing any of the individual clusters or hosts. If we disable \
sending of metrics to graphite OR we shutdown carbon-cache.py on the graphite server, \
this latency goes away. </div> <div><br></div><div>2) The gmetad daemon is \
segfaulting multiple times per day (sometimes multiple times per hour). Again, when \
we stop sending metrics to graphite this behavior stops. I haven&#39;t tested to see \
if this stops if carbon-cache.py is down. </div> <div><br></div><div>I have run \
gmetad in debug mode until the segfault occurred and the last few lines before it \
died were this, which look like it&#39;s sending metrics to \
graphite:</div><div><br></div><div><div>carbon proxy:: x.x.x.x 46 is ready to \
receive</div> <div>Updating host <a \
href="http://blah.example.com">blah.example.com</a>, metric \
swap_total</div><div>Carbon Proxy:: sending \
&#39;blah.gmetad.blah.example.com.swap_total 16779884 1331070149</div><div>&#39; to \
x.x.x.x</div> <div>carbon proxy:: x.x.x.x is ready to receivecarbon proxy:: x.x.x.x \
is ready to receive</div><div>Updating host <a \
href="http://blah.example.com">blah.example.com</a>, metric \
diskstat_sdcc_reads</div><div>Carbon Proxy:: sending \
&#39;blah.gmetad.blah.example.com.diskstat_sdcc_reads 0.000000 1331070138</div> \
<div>&#39; to x.x.x.x</div><div>carbon proxy:: x.x.x.x is ready to \
receive</div></div><div><br></div><div>The segfault looks like this in \
/var/log/messages:</div><div><br></div><div>Mar  7 20:11:26 blah kernel: \
gmetad[18445]: segfault at 0 ip 00000000004071ea sp 00007f354f021410 error 4 in \
gmetad[400000+d000]</div> <div><br></div><div>Also interesting is that we get this \
message in our logs about every minute:</div><div><br></div><div>Mar  7 20:13:06 blah \
/usr/sbin/gmetad[19098]: server_thread() -1326024960 unable to write root \
epilog</div> <div><br></div><div>I&#39;m about to start hooking up gdb to this guy \
but was wondering if someone is aware of this issue or has suggestions for what might \
be causing this? </div><div><br></div><div>Thanks,<br>Aaron</div>



------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic