[prev in list] [next in list] [prev in thread] [next in thread]
List: freebsd-hackers
Subject: lockmgr() in network driver context
From: Stephan de Wit <stephan.de.wit () deciso ! com>
Date: 2023-08-01 14:47:55
Message-ID: AM6P193MB0311A026447E2B8523F4AE89BC0AA () AM6P193MB0311 ! EURP193 ! PROD ! OUTLOOK ! COM
[Download RAW message or body]
[Attachment #2 (text/plain)]
Hi all,
I'm encountering i2c bus failures that need more graceful error handling.
The driver operates its management threads in polling mode within callout context.
As a hardware constraint, communication via said i2c bus can only
happen for a single network port at a time and thus needs to be mtx_lock()'d.
With this lock held, i2c transfers are initiated by the driver and the done
event is signaled via an ISR. Since we cannot sleep both in callout (not
allowed according to [1]) and mutex context (will panic), the second thread
waiting on the bus will spin if it does not live on the same CPU[1], and the
i2c transfer itself is handled via a DELAY() busy loop. If the bus fails,
we have 2 cores spinning at 100%, causing connections to drop and all
sorts of other carnage to happen. Reducing the timeout is not an option, as
the transfers might take a while.
Sleeping under a MTX_DEF is not possible, instant panic. However, if I replace
the first mtx_lock() with a lockmgr() call that was initialized with LK_TIMELOCK |
LK_CANRECURSE and a given timeout, I'm able to put the thread waiting on
the comms bus to be available to sleep. If I replace the hardcoded DELAY()s
further down the line during the i2c transfers with mtx_sleep() (and wakeup_one() in the ISR),
no CPU cycles are wasted, this is also the case for normal operation,
improving the current situation by a mile.
As to the nature of these bus failures, they seem to happen very infrequently and nigh
on impossible to reproduce on a production system, but they always seem
to recover after a certain amount of time. To reproduce the issue, I have
wired some physical switches to the i2c bus to simulate the outages.
From a design perspective, the i2c bus should be a singleton and should be treated
as such via the queueing of transfer tasks including metadata to link
such transfers back to specific port ids, however, a change like this would
require significant effort with a large risk of regressions, and I'm not even
sure if sleeping in such a scenario would be considered a valid thing to do
in FreeBSD context, so I would prefer sticking to the battle-hardened
code.
The documentation for lockmgr() tells me to stay away from it, but I
seem to have little options left. Also, the solution seems to violate
the FreeBSD guidelines for sleeping in callout context. Any thoughts?
Thanks in advance.
[1] https://man.freebsd.org/cgi/man.cgi?locking(9)
[Attachment #3 (text/html)]
<html xmlns:v="urn:schemas-microsoft-com:vml" \
xmlns:o="urn:schemas-microsoft-com:office:office" \
xmlns:w="urn:schemas-microsoft-com:office:word" \
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" \
xmlns="http://www.w3.org/TR/REC-html40"> <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-ligatures:standardcontextual;
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{mso-style-priority:99;
mso-style-link:"Plain Text Char";
margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-ligatures:standardcontextual;
mso-fareast-language:EN-US;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.PlainTextChar
{mso-style-name:"Plain Text Char";
mso-style-priority:99;
mso-style-link:"Plain Text";
font-family:"Calibri",sans-serif;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="en-NL" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoPlainText"><span lang="en-NL">Hi all,<o:p></o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span lang="EN-US">I'm encountering i2c bus failures that \
need more graceful error handling. <o:p></o:p></span></p>
<p class="MsoPlainText"><span lang="EN-US">The driver operates its management threads \
in polling mode within callout context.<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">As a hardware constraint, communication via \
said i2c bus can only<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">happen for a single network port at a time and thus needs to be \
mtx_lock()'d.<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">With \
this lock held, i2c transfers are initiated by the driver and the \
done<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">event is \
signaled via an ISR. Since we cannot sleep both in callout (not<o:p></o:p></span></p> \
<p class="MsoPlainText"><span lang="en-NL">allowed according to [1]) and mutex \
context (will panic), the second thread<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">waiting on the bus will spin if it does not \
live on the same CPU[1], and the<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">i2c transfer itself is handled via a DELAY() busy loop. If the bus \
fails,<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">we have 2 \
cores spinning at 100%, causing connections to drop and all<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">sorts of other carnage to happen. Reducing \
the timeout is not an option, as<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">the transfers might take a while. <o:p> </o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL">Sleeping under a MTX_DEF is \
not</span><span lang="en-NL"> </span><span lang="en-NL">possible, instant panic. \
However, if I replace <o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">the first mtx_lock</span><span lang="EN-US">()</span><span lang="en-NL"> \
with a</span><span lang="en-NL"> </span><span lang="en-NL">lockmgr() call that was \
initialized with LK_TIMELOCK | <o:p> </o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL">LK_CANRECURSE and a given</span><span \
lang="en-NL"> </span><span lang="en-NL">timeout, I'm able to put the thread waiting \
on<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">the comms bus to \
be available to</span><span lang="en-NL"> </span><span lang="en-NL">sleep. If I \
replace the hardcoded DELAY()s</span><span lang="en-NL"> </span><span \
lang="EN-US"><o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">further down the line during the</span><span lang="en-NL"> </span><span \
lang="en-NL">i2c transfers with mtx_sleep() (and wakeup_one() in the ISR), \
<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">no CPU</span><span \
lang="en-NL"> </span> <span lang="en-NL">cycles are wasted, this is also the case for \
normal operation</span><span lang="EN-US">,<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="EN-US">improving the current situation by a \
mile.<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">As to the</span><span lang="en-NL"> </span> <span lang="en-NL">nature of \
these bus failures, they seem to happen very infrequently and \
nigh<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">on impossible \
to reproduce on a production system, but they always seem<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">to recover after a certain amount of time. To \
reproduce the issue, I have<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">wired some physical switches to the i2c bus to simulate the \
outages. <o:p></o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL">From a</span><span lang="en-NL"> </span>
<span lang="en-NL">design perspective, the i2c bus should be a singleton and should \
be treated<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">as such \
via the queueing of transfer tasks including metadata to link<o:p></o:p></span></p> \
<p class="MsoPlainText"><span lang="en-NL">such transfers back to specific port ids, \
however, a change like this would<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">require significant effort with a large risk of regressions, and I'm not \
even<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">sure if \
sleeping in such a scenario would be considered a valid thing to \
do<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">in FreeBSD \
context, so I would prefer sticking to the battle-hardened<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">code.<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL"><o:p> </o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">The documentation for lockmgr() tells me to \
stay away from it, but I<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">seem to have little options left. </span> <span lang="EN-US">Also, the \
solution seems to violate<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="EN-US">the FreeBSD guidelines for sleeping in callout context. </span><span \
lang="en-NL">Any thoughts? <o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span \
lang="EN-US">Thanks in advance.<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL"><o:p> </o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">[1]</span><span lang="en-NL"> </span><span lang="en-NL"><a \
href="https://man.freebsd.org/cgi/man.cgi?locking(9)">https://man.freebsd.org/cgi/man.cgi?locking(9)</a><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="en-NL"><o:p> </o:p></span></p>
</div>
</body>
</html>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic