[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    Re: [lustre-discuss] LNet nid down after some thing changed the NICs
From:       CJ Yin via lustre-discuss <lustre-discuss () lists ! lustre ! org>
Date:       2023-03-09 9:04:00
Message-ID: CAM=5jC8j5ph8hu-nJvJDwJt0jaujap7wWjjmerJBodf2=NM5Zg () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hi Horn,

Thanks for your help. I checked the description of the ticket you
mentioned. The result are similar but the root cause sames different. While
I was waiting for the account, I tried debugging this issue with my
colleague. We found the root cause. And we already trying to fix the
problem. Hopefully we can provide a path about this issue. Simply, it's a
network namespace problem. In another network namespace a NIC named eth0
and with index 2 can be created. Then LNet will think the eth0 in default
namspace is changed and then it corrupts.

Regards,
Chuanjun

Horn, Chris <chris.horn@hpe.com> 于2023年3月2日周四 01:16写道:

> Hi CJ,
>
>
>
> I don't know if you ever got an account and ticket opened, but I stumbled
> upon this change which sounds like it could be your issue -
> https://jira.whamcloud.com/browse/LU-16378
>
> commit 3c9282a67d73799a03cb1d254275685c1c1e4df2
>
> Author: Cyril Bordage cbordage@whamcloud.com
>
> Date:   Sat Dec 10 01:51:16 2022 +0100
>
>
>
>     LU-16378 lnet: handles unregister/register events
>
>
>
>     When network is restarted, devices are unregistered and then
>
>    registered again. When a device registers using an index that is
>
>     different from the previous one (before network was restarted), LNet
>
>     ignores it. Consequently, this device stays with link in fatal state.
>
>
>
>     To fix that, we catch unregistering events to clear the saved index
>
>     value, and when a registering event comes, we save the new value.
>
>
>
> Chris Horn
>
>
>
> *From: *CJ Yin <woshifuxiuyin@gmail.com>
> *Date: *Sunday, February 19, 2023 at 12:23 AM
> *To: *Horn, Chris <chris.horn@hpe.com>
> *Cc: *lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org>
> *Subject: *Re: [lustre-discuss] LNet nid down after some thing changed
> the NICs
>
> Hi Chris,
>
>
>
> Thanks for your help. I have collected the relevant logs according to your
> hints. But I need an account to open a ticket on Jira. I have sent an
> email to the administrator at info@whamcloud.com. I was wondering if this
> is the correct way to apply for an account. I only found this email on the
> site.
>
>
>
> Regards,
>
> Chuanjun
>
>
>
> Horn, Chris <chris.horn@hpe.com> 于2023年2月18日周六 00:52写道:
>
> If deleting and re-adding it restores the status to up then this sounds
> like a bug to me.
>
>
>
> Can you enable debug tracing, reproduce the issue, and add this
> information to a ticket?
>
> To enable/gather debug:
>
> # lctl set_param debug=+net
> <reproduce issue>
> # lctl dk > /tmp/dk.log
>
> You can create a ticket at https://jira.whamcloud.com/
>
> Please provide the dk.log with the ticket.
>
>
>
> Thanks,
>
> Chris Horn
>
>
>
> *From: *lustre-discuss <lustre-discuss-bounces@lists.lustre.org> on
> behalf of 腐朽银 via lustre-discuss <lustre-discuss@lists.lustre.org>
> *Date: *Friday, February 17, 2023 at 2:53 AM
> *To: *lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org>
> *Subject: *[lustre-discuss] LNet nid down after some thing changed the
> NICs
>
> Hi,
>
>
>
> I encountered a problem when using Lustre Client on k8s with kubenet. Very
> happy if you could help me.
>
>
>
> My LNet configuration is:
>
>
>
> net:
>     - net type: lo
>       local NI(s):
>         - nid: 0@lo
>           status: up
>     - net type: tcp
>       local NI(s):
>         - nid: 10.224.0.5@tcp
>           status: up
>           interfaces:
>               0: eth0
>
>
>
> It works. But after I deploy or delete a pod on the node. The nid goes
> down like:
>
>
>
> - nid: 10.224.0.5@tcp
>           status: down
>           interfaces:
>               0: eth0
>
>
>
> k8s uses veth pairs, so it will add or delete network interfaces when
> deploying or deleting pods. But it doesn't touch the eth0 NIC. I can fix it
> by deleting the tcp net by `lnetctl net del` and re-add it by `lnetctl net
> add`. But I need to do this every time after a pod is scheduled to this
> node.
>
>
>
> My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client is built by
> myself from 2.15.1. Is this an expected LNet behavior or I got something
> wrong? I re-build and tested it several times and got the same problem.
>
>
>
> Regards,
>
> Chuanjun
>
>

[Attachment #5 (text/html)]

<div dir="ltr">Hi Horn,<div><br></div><div>Thanks for your help. I checked the \
description of the ticket you mentioned. The result are similar but the root cause \
sames different. While I was waiting for the account, I tried debugging this issue \
with my colleague. We found the root cause. And we already  trying to fix the \
problem. Hopefully we can provide a path about this issue. Simply, it&#39;s a network \
namespace problem. In another network namespace a NIC named eth0 and with index 2 can \
be created. Then LNet will think the eth0 in default namspace is changed and then it \
corrupts.<br><div><br></div><div><div>Regards,</div><div>Chuanjun</div></div></div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">Horn, Chris &lt;<a \
href="mailto:chris.horn@hpe.com">chris.horn@hpe.com</a>&gt; 于2023年3月2日周四 \
01:16写道:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div \
class="msg-4055546438871150980">





<div lang="EN-US" style="overflow-wrap: break-word;">
<div class="m_8981007762467122438WordSection1">
<p class="MsoNormal">Hi CJ,<u></u><u></u></p>
<p class="MsoNormal"><u></u>  <u></u></p>
<p class="MsoNormal">I don't know if you ever got an account and ticket opened, but I \
stumbled upon this change which sounds like it could be your issue - <a \
href="https://jira.whamcloud.com/browse/LU-16378" \
target="_blank">https://jira.whamcloud.com/browse/LU-16378</a><br> <br>
<u></u><u></u></p>
<p class="MsoNormal">commit \
3c9282a67d73799a03cb1d254275685c1c1e4df2<u></u><u></u></p> <p \
class="MsoNormal">Author: Cyril Bordage <a href="mailto:cbordage@whamcloud.com" \
target="_blank">cbordage@whamcloud.com</a><u></u><u></u></p> <p \
class="MsoNormal">Date:     Sat Dec 10 01:51:16 2022 +0100<u></u><u></u></p> <p \
class="MsoNormal"><u></u>  <u></u></p> <p class="MsoNormal">       LU-16378 lnet: \
handles unregister/register events<u></u><u></u></p> <p class="MsoNormal"><u></u>  \
<u></u></p> <p class="MsoNormal">       When network is restarted, devices are \
unregistered and then<u></u><u></u></p> <p class="MsoNormal">      registered again. \
When a device registers using an index that is<u></u><u></u></p> <p \
class="MsoNormal">       different from the previous one (before network was \
restarted), LNet<u></u><u></u></p> <p class="MsoNormal">       ignores it. \
Consequently, this device stays with link in fatal state.<u></u><u></u></p> <p \
class="MsoNormal"><u></u>  <u></u></p> <p class="MsoNormal">       To fix that, we \
catch unregistering events to clear the saved index<u></u><u></u></p> <p \
class="MsoNormal">       value, and when a registering event comes, we save the new \
value.<u></u><u></u></p> <p class="MsoNormal"><u></u>  <u></u></p>
<p class="MsoNormal">Chris Horn<u></u><u></u></p>
<p class="MsoNormal"><u></u>  <u></u></p>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt \
solid rgb(181,196,223);padding:3pt 0in 0in"> <p class="MsoNormal" \
style="margin-bottom:12pt"><b><span style="font-size:12pt;color:black">From: \
</span></b><span style="font-size:12pt;color:black">CJ Yin &lt;<a \
href="mailto:woshifuxiuyin@gmail.com" \
target="_blank">woshifuxiuyin@gmail.com</a>&gt;<br> <b>Date: </b>Sunday, February 19, \
2023 at 12:23 AM<br> <b>To: </b>Horn, Chris &lt;<a href="mailto:chris.horn@hpe.com" \
target="_blank">chris.horn@hpe.com</a>&gt;<br> <b>Cc: </b><a \
href="mailto:lustre-discuss@lists.lustre.org" \
target="_blank">lustre-discuss@lists.lustre.org</a> &lt;<a \
href="mailto:lustre-discuss@lists.lustre.org" \
target="_blank">lustre-discuss@lists.lustre.org</a>&gt;<br> <b>Subject: </b>Re: \
[lustre-discuss] LNet nid down after some thing changed the \
NICs<u></u><u></u></span></p> </div>
<div>
<p class="MsoNormal">Hi Chris,<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u>  <u></u></p>
</div>
<div>
<p class="MsoNormal">Thanks for your help. I have collected the relevant logs \
according to your hints. But I need an account to open a ticket on Jira. I have sent \
an email  to the administrator at  <a href="mailto:info@whamcloud.com" \
target="_blank">info@whamcloud.com</a>. I  was wondering if this is the correct way \
to apply for an account. I only found this email on the site.<u></u><u></u></p> \
</div> <div>
<p class="MsoNormal"><u></u>  <u></u></p>
<div>
<p class="MsoNormal">Regards,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Chuanjun<u></u><u></u></p>
</div>
</div>
<p class="MsoNormal"><u></u>  <u></u></p>
<div>
<div>
<p class="MsoNormal">Horn, Chris &lt;<a href="mailto:chris.horn@hpe.com" \
target="_blank">chris.horn@hpe.com</a>&gt; <span style="font-family:&quot;MS \
Gothic&quot;">于</span>2023<span style="font-family:&quot;MS \
Gothic&quot;">年</span>2<span style="font-family:&quot;MS \
Gothic&quot;">月</span>18<span style="font-family:&quot;MS \
Gothic&quot;">日周六</span> 00:52<span style="font-family:&quot;MS \
Gothic&quot;">写道:</span><u></u><u></u></p> </div>
<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt \
solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in"> \
<div> <div>
<div>
<p class="MsoNormal">If deleting and re-adding it restores the status to up then this \
sounds like a bug to me.<u></u><u></u></p> <p class="MsoNormal">  <u></u><u></u></p>
<p class="MsoNormal">Can you enable debug tracing, reproduce the issue, and add this \
information to a ticket?<br> <br>
To enable/gather debug:<br>
<br>
# lctl set_param debug=+net<br>
&lt;reproduce issue&gt;<br>
# lctl dk &gt; /tmp/dk.log<br>
<br>
You can create a ticket at <a href="https://jira.whamcloud.com/" \
target="_blank">https://jira.whamcloud.com/</a><br> <br>
Please provide the dk.log with the ticket.<u></u><u></u></p>
<p class="MsoNormal">  <u></u><u></u></p>
<p class="MsoNormal">Thanks,<u></u><u></u></p>
<p class="MsoNormal">Chris Horn<u></u><u></u></p>
<p class="MsoNormal">  <u></u><u></u></p>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt \
solid rgb(181,196,223);padding:3pt 0in 0in"> <p class="MsoNormal" \
style="margin-bottom:12pt"><b><span style="font-size:12pt;color:black">From: \
</span></b><span style="font-size:12pt;color:black">lustre-discuss &lt;<a \
href="mailto:lustre-discuss-bounces@lists.lustre.org" \
target="_blank">lustre-discuss-bounces@lists.lustre.org</a>&gt; on behalf of \
</span><span style="font-size:12pt;font-family:&quot;MS \
Gothic&quot;;color:black">腐朽</span><span \
style="font-size:12pt;font-family:&quot;PingFang \
TC&quot;,sans-serif;color:black">银</span><span style="font-size:12pt;color:black"> \
via lustre-discuss &lt;<a href="mailto:lustre-discuss@lists.lustre.org" \
target="_blank">lustre-discuss@lists.lustre.org</a>&gt;<br> <b>Date: </b>Friday, \
February 17, 2023 at 2:53 AM<br> <b>To: </b><a \
href="mailto:lustre-discuss@lists.lustre.org" \
target="_blank">lustre-discuss@lists.lustre.org</a> &lt;<a \
href="mailto:lustre-discuss@lists.lustre.org" \
target="_blank">lustre-discuss@lists.lustre.org</a>&gt;<br> <b>Subject: \
</b>[lustre-discuss] LNet nid down after some thing changed the \
NICs</span><u></u><u></u></p> </div>
<div>
<p class="MsoNormal">Hi,<u></u><u></u></p>
<div>
<p class="MsoNormal">  <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">I encountered a problem  when using Lustre Client on k8s with \
kubenet. Very happy if you could help me.<u></u><u></u></p> </div>
<div>
<p class="MsoNormal">  <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">My LNet configuration is:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">  <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">net:<br>
      - net type: lo<br>
         local NI(s):<br>
            - nid: 0@lo<br>
               status: up<br>
      - net type: tcp<br>
         local NI(s):<br>
            - nid: 10.224.0.5@tcp<br>
               status: up<br>
               interfaces:<br>
                     0: eth0<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">  <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">It works. But after I deploy or delete a pod on the node. The \
nid goes down like:<u></u><u></u></p> </div>
<div>
<p class="MsoNormal">  <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">- nid: 10.224.0.5@tcp<br>
               status: down<br>
               interfaces:<br>
                     0: eth0<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">  <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">k8s uses veth pairs, so it will add or delete network interfaces \
when deploying or deleting pods. But it doesn&#39;t touch the eth0 NIC. I can fix it \
by deleting the tcp net by `lnetctl  net del` and re-add it by `lnetctl net add`. But \
I need to do this every time after a pod is scheduled to this node.<u></u><u></u></p> \
</div> <div>
<p class="MsoNormal">  <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client \
is built by myself  from 2.15.1.  Is this an expected LNet behavior or I got \
something wrong? I re-build and tested  it several times and got the same \
problem.<u></u><u></u></p> <div>
<p class="MsoNormal">  <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Regards,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Chuanjun<u></u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>

</div></blockquote></div>



_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic