[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-fs
Subject:    Re: FYI: an example Optane Read failure on a HoneyComb (aarch64), main at e78dc78e517a
From:       Warner Losh <imp () bsdimp ! com>
Date:       2023-03-03 4:50:50
Message-ID: CANCZdfoFo5m-J=C53HfESrOObvSG7QhtMyd67S9vYzynC-kgiQ () mail ! gmail ! com
[Download RAW message or body]

On Thu, Mar 2, 2023 at 9:39 PM Mark Millard <marklmi@yahoo.com> wrote:

> FYI: I got the following error:
> 
> nvme0: RECOVERY_START 411856860627824 vs 411855426224359
> 

Translation: We submitted transactions to the card and got a timeout waiting
for them to finish.


> nvme0: Controller in fatal status, resetting
> 

The controller status bit indicating failure was on


> nvme0: Resetting controller due to a timeout and possible hot unplug.
> 

We read 0xfffffff from the card, indicating often a power glitch reset the
card, but sometimes it's a bridge getting messed up so the address we think
the card is at it isn't able to get transactions for. Or sometimes it's
because the firmware crashes in the card. hard to diagnose for sure, but
one thing is for sure: the card is AFU and we can't fix it.


> nvme0: RECOVERY_WAITING
> nvme0: resetting controller
> 
nvme0: failing outstanding i/o
> 

We reset the controller.


> nvme0: READ sqid:2 cid:5 nsid:1 lba:537405568 len:16
> nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:2 cid:5 cdw0:0
> (nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0
> cdw=20082880 0 f 0 0 0
> (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error
> (nda0:nvme0:0:0:1): Error 5, Retries exhausted
> g_vfs_done():gpt/CA72optM2ufs[READ(offset=65536, length=8192)]enda0 at
> nvme0 bus 0 scbus4 target 0 lun 1
> nda0: rror = 5
> 

We failed enough times that we gave up.


> <INTEL SSDPE21D960GA E2010480 REDACTED>
> s/n REDACTED detached
> (nda0:nvme0:0:0:1): Periph destroyed
> nvme0: waiting
> 

might indicate there's still a reference here, but maybe there's not.

The one thing I don't do in recovery is try to power cycle the card. That's
now possible with decent APIs in the kernel, but I haven't had the need to
do it for our deployment.

Warner


> (After rebooting . . .)
> 
> # gpart show -pl
> =>        40  1875384928    nda0  GPT  (894G)
> 40      532480  nda0p1  CA72optM2efi  (260M)
> 532520        2008          - free -  (1.0M)
> 534528    20971520  nda0p2  CA72optM2swp10  (10G)
> 21506048    29360128  nda0p4  CA72optM2swp14  (14G)
> 50866176    33554432  nda0p5  CA72optM2swp16  (16G)
> 84420608    67108864  nda0p6  CA72optM2swp32  (32G)
> 151529472   364904448  nda0p7  CA72optM2swp174  (174G)
> 516433920     7340032  nda0p8  RPi3swp3p5  (3.5G)
> 523773952    13631488          - free -  (6.5G)
> 537405440  1337979528  nda0p3  CA72optM2ufs  (638G)
> 
> =>        40  1875384928    nda1  GPT  (894G)
> 40      532480  nda1p1  CA72opt0EFI  (260M)
> 532520        2008          - free -  (1.0M)
> 534528   515899392  nda1p2  CA72opt0SWP  (246G)
> 516433920    20971520          - free -  (10G)
> 537405440  1337979528  nda1p3  CA72opt0ZFS  (638G)
> 
> nda0 is a U2 Optane used via a M2 adapter. The error is the
> first that I've seen for it. (nda1 is in the PCIe slot.)
> 
> # uname -apKU
> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #88
> main-n261230-e78dc78e517a-dirty: Wed Mar  1 16:17:45 PST 2023
> root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72
>  arm64 aarch64 1400081 1400081
> 
> ===
> Mark Millard
> marklmi at yahoo.com
> 
> 
> 


[Attachment #3 (text/html)]

<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Thu, Mar 2, 2023 at 9:39 PM Mark Millard &lt;<a \
href="mailto:marklmi@yahoo.com">marklmi@yahoo.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">FYI: I got the following error:<br> <br>
nvme0: RECOVERY_START 411856860627824 vs \
411855426224359<br></blockquote><div><br></div><div>Translation: We submitted \
transactions to the card and got a timeout waiting</div><div>for them to \
finish.<br></div><div>  </div><blockquote class="gmail_quote" style="margin:0px 0px \
                0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
nvme0: Controller in fatal status, resetting<br></blockquote><div><br></div><div>The \
controller status bit indicating failure was on<br></div><div>  </div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
                rgb(204,204,204);padding-left:1ex">
nvme0: Resetting controller due to a timeout and possible hot \
unplug.<br></blockquote><div><br></div><div>We read 0xfffffff from the card, \
indicating often a power glitch reset the card, but sometimes it&#39;s a bridge \
getting messed up so the address we think the card is at it isn&#39;t able to get \
transactions for. Or sometimes it&#39;s because the firmware crashes in the card. \
hard to diagnose for sure, but one thing is for sure: the card is AFU and we \
can&#39;t fix it.<br></div><div>  </div><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
                rgb(204,204,204);padding-left:1ex">
nvme0: RECOVERY_WAITING<br>
nvme0: resetting controller <br></blockquote><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
                rgb(204,204,204);padding-left:1ex">
nvme0: failing outstanding i/o<br></blockquote><div><br></div><div>We reset the \
controller.<br></div><div>  </div><blockquote class="gmail_quote" style="margin:0px \
                0px 0px 0.8ex;border-left:1px solid \
                rgb(204,204,204);padding-left:1ex">
nvme0: READ sqid:2 cid:5 nsid:1 lba:537405568 len:16<br>
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:2 cid:5 cdw0:0<br>
(nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=20082880 0 f 0 0 \
0<br> (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error<br>
(nda0:nvme0:0:0:1): Error 5, Retries exhausted<br>
g_vfs_done():gpt/CA72optM2ufs[READ(offset=65536, length=8192)]enda0 at nvme0 bus 0 \
                scbus4 target 0 lun 1<br>
nda0: rror = 5<br></blockquote><div><br></div><div>We failed enough times that we \
gave up.<br></div><div>  </div><blockquote class="gmail_quote" style="margin:0px 0px \
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> &lt;INTEL \
SSDPE21D960GA E2010480 REDACTED&gt;<br>  s/n REDACTED detached<br>
(nda0:nvme0:0:0:1): Periph destroyed<br>
nvme0: waiting<br></blockquote><div><br></div><div>might indicate there&#39;s still a \
reference here, but maybe there&#39;s not.</div><div><br></div><div>The one thing I \
don&#39;t do in recovery is try to power cycle the card. That&#39;s now possible with \
decent APIs in the kernel, but I haven&#39;t had the need to do it for our \
deployment.<br></div><div><br></div><div>Warner<br></div><div>  </div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"> (After rebooting . . .)<br>
<br>
# gpart show -pl<br>
=&gt;            40   1875384928      nda0   GPT   (894G)<br>
               40         532480   nda0p1   CA72optM2efi   (260M)<br>
         532520            2008               - free -   (1.0M)<br>
         534528      20971520   nda0p2   CA72optM2swp10   (10G)<br>
      21506048      29360128   nda0p4   CA72optM2swp14   (14G)<br>
      50866176      33554432   nda0p5   CA72optM2swp16   (16G)<br>
      84420608      67108864   nda0p6   CA72optM2swp32   (32G)<br>
     151529472     364904448   nda0p7   CA72optM2swp174   (174G)<br>
     516433920        7340032   nda0p8   RPi3swp3p5   (3.5G)<br>
     523773952      13631488               - free -   (6.5G)<br>
     537405440   1337979528   nda0p3   CA72optM2ufs   (638G)<br>
<br>
=&gt;            40   1875384928      nda1   GPT   (894G)<br>
               40         532480   nda1p1   CA72opt0EFI   (260M)<br>
         532520            2008               - free -   (1.0M)<br>
         534528     515899392   nda1p2   CA72opt0SWP   (246G)<br>
     516433920      20971520               - free -   (10G)<br>
     537405440   1337979528   nda1p3   CA72opt0ZFS   (638G)<br>
<br>
nda0 is a U2 Optane used via a M2 adapter. The error is the<br>
first that I&#39;ve seen for it. (nda1 is in the PCIe slot.)<br>
<br>
# uname -apKU<br>
FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #88 \
main-n261230-e78dc78e517a-dirty: Wed Mar   1 16:17:45 PST 2023        \
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 \
arm64 aarch64 1400081 1400081<br> <br>
===<br>
Mark Millard<br>
marklmi at <a href="http://yahoo.com" rel="noreferrer" \
target="_blank">yahoo.com</a><br> <br>
<br>
</blockquote></div></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic