[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ganglia-general
Subject:    [Ganglia-general] ganglia with nvidia cuda monitoring
From:       Yann Sagon <yann.sagon () unige ! ch>
Date:       2017-03-13 15:42:55
Message-ID: CAPoHtHZfYJ1k9-i5GGknyewSoSP7cBZjWEG1NGFXY3t6-+4Kdw () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hello,

I'm managing a cluster  and we use ganglia for monitoring.

As we now have some gpus (CUDA), I have tried to install this software:

https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia

In the web interface, I can see the graphs related to gpu, but there is no
data inside.

I'm using ganglia 3.7.2

I have installed the package nvidia-ml-py

https://pypi.python.org/pypi/nvidia-ml-py/7.352.0

We are using gmond on each node (mute, and one gmond on the monitor node
which agegate the data from all the node's gmond, and this master gmond is
polled by gmetad.

gmond config on monitor:
[...]
mute = no
deaf = no
allow_extra_data = yes
[...]

gmond config on nodes:
[...]
mute = no
deaf =yes
allow_extra_data = yes
[..]


On the node with the gpu, I have the following files:

/opt/ganglia/ganglia-3.7.2/lib64/ganglia/python_modules/
├── nvidia.py
├── nvidia_smi.py
├── pynvml.py

/opt/ganglia/ganglia-3.7.2/etc/conf.d/
├── modpython.conf
└── nvidia.pyconf

content of modpython.conf:

modules {
  module {
    name = "python_module"
    path = "modpython.so"
    params = "/opt/ganglia/ganglia-3.7.2/lib64/ganglia/python_modules"
  }
}

include ("/opt/ganglia/ganglia-3.7.2/etc/conf.d/*.pyconf")


the file nvidia.pyconf is the original version.

If I start gmond on this node in foreground debug, it "seems" it's working
fine.

[...]
metric 'gpu2_ecc_sb_error' being collected now
metric 'gpu2_ecc_sb_error' has value_threshold 1.000000
sent message 'gpu1_graphics_clock_report' of length 72 with 0 errors
sent message 'gpu3_graphics_clock_report' of length 72 with 0 errors
[...]

In the monitor server, I have these files related to gpu:

/var/www/html/ganglia/graph.d/
├── gpu_common.php
├── gpu_graphics_clock_report.php
├── gpu_mem_clock_report.php
├── gpu_power_usage_report.php
├── gpu_power_violation_report.php
├── gpu_sm_clock_report.php

As it's not working, how can I be sure that gmond of gpu node is actually
sending some data?

Do I have to install something about gpu on the master gmond or on gmetad?

Any clue how to thoubleshoot?

Many thanks

-- 
Yann SAGON
Ingénieur système HPC
24 Rue du Général-Dufour
1211 Genève 4 - Suisse
Tél. : +41 (0)22 379 7737
yann.sagon@unige.ch - www.unige.ch

[Attachment #5 (text/html)]

<div dir="ltr">Hello,<div><br></div><div>I&#39;m managing a cluster   and we use \
ganglia for monitoring.</div><div><br></div><div>As we now have some gpus (CUDA), I \
have tried to install this software:  </div><div><br></div><div><a \
href="https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia">https:// \
github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia</a></div><div><br></div><div>In \
the web interface, I can see the graphs related to gpu, but there is no data \
inside.<br></div><div><br></div><div>I&#39;m using ganglia \
3.7.2</div><div><br></div><div>I have installed the package \
nvidia-ml-py</div><div><br></div><div><a \
href="https://pypi.python.org/pypi/nvidia-ml-py/7.352.0">https://pypi.python.org/pypi/nvidia-ml-py/7.352.0</a></div><div><br></div><div>We \
are using gmond on each node (mute, and one gmond on the monitor node which agegate \
the data from all the node&#39;s gmond, and this master gmond is polled by \
gmetad.</div><div><br></div><div>gmond config on \
monitor:</div><div>[...]</div><div><div>mute = no</div><div>deaf = \
no</div><div>allow_extra_data = \
yes</div></div><div>[...]</div><div><br></div><div>gmond config on \
nodes:</div><div><div>[...]</div><div>mute = no</div><div>deaf \
=yes</div><div>allow_extra_data = \
yes</div></div><div>[..]</div><div><br></div><div><br></div><div>On the node with the \
gpu, I have the following \
files:</div><div><br></div><div><div>/opt/ganglia/ganglia-3.7.2/lib64/ganglia/python_modules/</div><div>├── \
nvidia.py</div><div>├── nvidia_smi.py</div><div>├── \
pynvml.py</div></div><div><br></div><div><div>/opt/ganglia/ganglia-3.7.2/etc/conf.d/</div><div>├── \
modpython.conf</div><div>└── \
nvidia.pyconf</div></div><div><br></div><div>content of \
modpython.conf:<br></div><div><br></div><div><div>modules {</div><div>   module \
{</div><div>      name = &quot;python_module&quot;</div><div>      path = \
&quot;modpython.so&quot;</div><div>      params = \
&quot;/opt/ganglia/ganglia-3.7.2/lib64/ganglia/python_modules&quot;</div><div>   \
}</div><div>}</div><div><br></div><div>include \
(&quot;/opt/ganglia/ganglia-3.7.2/etc/conf.d/*.pyconf&quot;)</div></div><div><br></div><div><br></div><div>the \
file nvidia.pyconf is the original version.</div><div><br></div><div>If I start gmond \
on this node in foreground debug, it &quot;seems&quot; it&#39;s working \
fine.</div><div><br></div><div>[...]</div><div><div>metric \
&#39;gpu2_ecc_sb_error&#39; being collected now</div><div>metric \
&#39;gpu2_ecc_sb_error&#39; has value_threshold 1.000000</div><div>sent message \
&#39;gpu1_graphics_clock_report&#39; of length 72 with 0 errors</div><div>sent \
message &#39;gpu3_graphics_clock_report&#39; of length 72 with 0 \
errors</div><div>[...]</div><div><br></div><div>In the monitor server, I have these \
files related to gpu:</div><div><br></div><div><div>/var/www/html/ganglia/graph.d/</div><div>├── \
gpu_common.php</div><div>├── gpu_graphics_clock_report.php</div><div>├── \
gpu_mem_clock_report.php</div><div>├── \
gpu_power_usage_report.php</div><div>├── \
gpu_power_violation_report.php</div><div>├── \
gpu_sm_clock_report.php</div></div><div><br></div><div>As it&#39;s not working, how \
can I be sure that gmond of gpu node is actually sending some \
data?</div><div><br></div><div>Do I have to install something about gpu on the master \
gmond or on gmetad?</div><div><br></div><div>Any clue how to \
thoubleshoot?</div><div><br></div><div>Many thanks</div><div><br></div>-- <br><div \
class="gmail_signature"><div dir="ltr"><table border="0" cellpadding="0" \
cellspacing="0"><tbody><tr><td \
style="padding-left:0px;padding-right:6px;padding-top:6px" valign="top"><div \
style="font-family:&quot;thesans bold \
plain&quot;;font-size:8pt;font-weight:bold;color:rgb(0,0,0)">Yann SAGON</div><div \
style="font-family:&quot;thesans bold \
plain&quot;;font-size:8pt;font-weight:normal;color:rgb(0,0,0)">Ingénieur système \
HPC</div></td></tr><tr><td style="padding:6px 6px 6px \
0px;border-width:1px;border-style:solid;border-color:rgb(207,0,99)"><div \
style="font-family:&quot;thesans light \
plain&quot;;font-size:8pt;font-weight:normal;color:rgb(136,136,136)">24 Rue du \
Général-Dufour<br>1211 Genève 4 - Suisse<br>Tél. : +41 (0)22 379 7737<br><a \
style="text-decoration:none;color:rgb(136,136,136)" href="mailto:yann.sagon@unige.ch" \
target="_blank">yann.sagon@unige.ch</a> - <a \
style="text-decoration:none;color:rgb(136,136,136)" href="http://www.unige.ch" \
target="_blank">www.unige.ch</a></div></td></tr></tbody></table></div></div></div></div>




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic