[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-hackers
Subject:    lockmgr() in network driver context
From:       Stephan de Wit <stephan.de.wit () deciso ! com>
Date:       2023-08-01 14:47:55
Message-ID: AM6P193MB0311A026447E2B8523F4AE89BC0AA () AM6P193MB0311 ! EURP193 ! PROD ! OUTLOOK ! COM
[Download RAW message or body]

[Attachment #2 (text/plain)]

Hi all,



I'm encountering i2c bus failures that need more graceful error handling.

The driver operates its management threads in polling mode within callout context.

As a hardware constraint, communication via said i2c bus can only

happen for a single network port at a time and thus needs to be mtx_lock()'d.

With this lock held, i2c transfers are initiated by the driver and the done

event is signaled via an ISR. Since we cannot sleep both in callout (not

allowed according to [1]) and mutex context (will panic), the second thread

waiting on the bus will spin if it does not live on the same CPU[1], and the

i2c transfer itself is handled via a DELAY() busy loop. If the bus fails,

we have 2 cores spinning at 100%, causing connections to drop and all

sorts of other carnage to happen. Reducing the timeout is not an option, as

the transfers might take a while.



Sleeping under a MTX_DEF is not possible, instant panic. However, if I replace

the first mtx_lock() with a lockmgr() call that was initialized with LK_TIMELOCK |

LK_CANRECURSE and a given timeout, I'm able to put the thread waiting on

the comms bus to be available to sleep. If I replace the hardcoded DELAY()s

further down the line during the i2c transfers with mtx_sleep() (and wakeup_one() in the ISR),

no CPU cycles are wasted, this is also the case for normal operation,

improving the current situation by a mile.



As to the nature of these bus failures, they seem to happen very infrequently and nigh

on impossible to reproduce on a production system, but they always seem

to recover after a certain amount of time. To reproduce the issue, I have

wired some physical switches to the i2c bus to simulate the outages.



From a design perspective, the i2c bus should be a singleton and should be treated

as such via the queueing of transfer tasks including metadata to link

such transfers back to specific port ids, however, a change like this would

require significant effort with a large risk of regressions, and I'm not even

sure if sleeping in such a scenario would be considered a valid thing to do

in FreeBSD context, so I would prefer sticking to the battle-hardened

code.



The documentation for lockmgr() tells me to stay away from it, but I

seem to have little options left. Also, the solution seems to violate

the FreeBSD guidelines for sleeping in callout context. Any thoughts?



Thanks in advance.



[1] https://man.freebsd.org/cgi/man.cgi?locking(9)


[Attachment #3 (text/html)]

<html xmlns:v="urn:schemas-microsoft-com:vml" \
xmlns:o="urn:schemas-microsoft-com:office:office" \
xmlns:w="urn:schemas-microsoft-com:office:word" \
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" \
xmlns="http://www.w3.org/TR/REC-html40"> <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;
	mso-ligatures:standardcontextual;
	mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:#0563C1;
	text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
	{mso-style-priority:99;
	mso-style-link:"Plain Text Char";
	margin:0cm;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;
	mso-ligatures:standardcontextual;
	mso-fareast-language:EN-US;}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
span.PlainTextChar
	{mso-style-name:"Plain Text Char";
	mso-style-priority:99;
	mso-style-link:"Plain Text";
	font-family:"Calibri",sans-serif;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri",sans-serif;
	mso-fareast-language:EN-US;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="en-NL" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoPlainText"><span lang="en-NL">Hi all,<o:p></o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL"><o:p>&nbsp;</o:p></span></p>
<p class="MsoPlainText"><span lang="EN-US">I'm encountering i2c bus failures that \
need more graceful error handling. <o:p></o:p></span></p>
<p class="MsoPlainText"><span lang="EN-US">The driver operates its management threads \
in polling mode within callout context.<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">As a hardware constraint, communication via \
said i2c bus can only<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">happen for a single network port at a time and thus needs to be \
mtx_lock()'d.<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">With \
this lock held, i2c transfers are initiated by the driver and the \
done<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">event is \
signaled via an ISR. Since we cannot sleep both in callout (not<o:p></o:p></span></p> \
<p class="MsoPlainText"><span lang="en-NL">allowed according to [1]) and mutex \
context (will panic), the second thread<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">waiting on the bus will spin if it does not \
live on the same CPU[1], and the<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">i2c transfer itself is handled via a DELAY() busy loop. If the bus \
fails,<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">we have 2 \
cores spinning at 100%, causing connections to drop and all<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">sorts of other carnage to happen. Reducing \
the timeout is not an option, as<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">the transfers might take a while. <o:p> </o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL"><o:p>&nbsp;</o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL">Sleeping under a MTX_DEF is \
not</span><span lang="en-NL"> </span><span lang="en-NL">possible, instant panic. \
However, if I replace <o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">the first mtx_lock</span><span lang="EN-US">()</span><span lang="en-NL"> \
with a</span><span lang="en-NL"> </span><span lang="en-NL">lockmgr() call that was \
initialized with LK_TIMELOCK | <o:p> </o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL">LK_CANRECURSE and a given</span><span \
lang="en-NL"> </span><span lang="en-NL">timeout, I'm able to put the thread waiting \
on<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">the comms bus to \
be available to</span><span lang="en-NL"> </span><span lang="en-NL">sleep. If I \
replace the hardcoded DELAY()s</span><span lang="en-NL"> </span><span \
lang="EN-US"><o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">further down the line during the</span><span lang="en-NL"> </span><span \
lang="en-NL">i2c transfers with mtx_sleep() (and wakeup_one() in the ISR), \
<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">no CPU</span><span \
lang="en-NL"> </span> <span lang="en-NL">cycles are wasted, this is also the case for \
normal operation</span><span lang="EN-US">,<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="EN-US">improving the current situation by a \
mile.<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL"><o:p>&nbsp;</o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">As to the</span><span lang="en-NL"> </span> <span lang="en-NL">nature of \
these bus failures, they seem to happen very infrequently and \
nigh<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">on impossible \
to reproduce on a production system, but they always seem<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">to recover after a certain amount of time. To \
reproduce the issue, I have<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">wired some physical switches to the i2c bus to simulate the \
outages.&nbsp; <o:p></o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL"><o:p>&nbsp;</o:p></span></p>
<p class="MsoPlainText"><span lang="en-NL">From a</span><span lang="en-NL"> </span>
<span lang="en-NL">design perspective, the i2c bus should be a singleton and should \
be treated<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">as such \
via the queueing of transfer tasks including metadata to link<o:p></o:p></span></p> \
<p class="MsoPlainText"><span lang="en-NL">such transfers back to specific port ids, \
however, a change like this would<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">require significant effort with a large risk of regressions, and I'm not \
even<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">sure if \
sleeping in such a scenario would be considered a valid thing to \
do<o:p></o:p></span></p> <p class="MsoPlainText"><span lang="en-NL">in FreeBSD \
context, so I would prefer sticking to the battle-hardened<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">code.<o:p></o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL"><o:p>&nbsp;</o:p></span></p> <p \
class="MsoPlainText"><span lang="en-NL">The documentation for lockmgr() tells me to \
stay away from it, but I<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">seem to have little options left. </span> <span lang="EN-US">Also, the \
solution seems to violate<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="EN-US">the FreeBSD guidelines for sleeping in callout context. </span><span \
lang="en-NL">Any thoughts? <o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL"><o:p>&nbsp;</o:p></span></p> <p class="MsoPlainText"><span \
lang="EN-US">Thanks in advance.<o:p></o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL"><o:p>&nbsp;</o:p></span></p> <p class="MsoPlainText"><span \
lang="en-NL">[1]</span><span lang="en-NL"> </span><span lang="en-NL"><a \
href="https://man.freebsd.org/cgi/man.cgi?locking(9)">https://man.freebsd.org/cgi/man.cgi?locking(9)</a><o:p></o:p></span></p>
 <p class="MsoNormal"><span lang="en-NL"><o:p>&nbsp;</o:p></span></p>
</div>
</body>
</html>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic