'Re: [lustre-discuss] Accessing files with bad PFL causing MDS kernel panics'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    Re: [lustre-discuss] Accessing files with bad PFL causing MDS kernel panics
From:       Colin Faber via lustre-discuss <lustre-discuss () lists ! lustre ! org>
Date:       2022-10-25 21:14:37
Message-ID: CAJcXmBkyc4stfg45SuDwFugr7XHpFrFAsKwPMy6mAqaAUj39Nw () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hi Nathan, looks like you're hitting
https://jira.whamcloud.com/browse/LU-16152

-cf


On Tue, Oct 25, 2022 at 2:43 PM Nathan Crawford <nrcrawfo@uci.edu> wrote:

> Hi All,
>
>   I'm looking for possible work-arounds to recover data from some
> mis-migrated files (as seen in  LU-16152). Basically, there's a bug in "lfs
> setstripe --yaml" where extent start/end values in the yaml file >= 2GiB
> overflow to 16 EiB - 2 GiB.
>
>   Using lfs_migrate, I re-striped many files in directories with a default
> striping pattern containing these values.  I'm pretty sure that the data
> exists (was trying to purge an older OST, and disk usage on the other OSTs
> increased as the purged OST decreased), and an lfsck procedure happily
> returns after a day or so. Unfortunately, attempts to access or re-migrate
> the files triggers a kernel panic on the MDS with:
>
> LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) ASSERTION( !((unsigned
> long)addr & ~(~(((1UL) << 12)-1))) ) failed:
> LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) LBUG
> Kernel panic - not syncing: LBUG
>
>  The servers are lustre 2.12.8 on OpenZFS 0.8.5 on CentOS 7.9. The output
> from "lfs getstripe -v badfile" is attached.
>
>   I can use lfs find to search for files with these bad extent endpoint
> values, then move them to a quarantine area on the same FS. This will allow
> the rest of the system to stay up (hopefully) but recovering the data is
> still needed.
>
> Thanks!
> Nate
>
> --
>
> Dr. Nathan Crawford              nathan.crawford@uci.edu
> Director of Scientific Computing
> School of Physical Sciences
> 164 Rowland Hall                 Office: 152 Rowland Hall
> University of California, Irvine  Phone: 949-824-1380
> Irvine, CA 92697-2025, USA
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>

[Attachment #5 (text/html)]

<div dir="ltr"><div>Hi Nathan, looks like you&#39;re hitting <a \
href="https://jira.whamcloud.com/browse/LU-16152">https://jira.whamcloud.com/browse/LU-16152</a></div><div><br></div><div>-cf</div><div><br></div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 25, 2022 at 2:43 PM \
Nathan Crawford &lt;<a href="mailto:nrcrawfo@uci.edu">nrcrawfo@uci.edu</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi \
All,<div><br></div><div>   I&#39;m looking for possible work-arounds to recover data \
from some mis-migrated files (as seen in   LU-16152). Basically, there&#39;s a bug in \
&quot;lfs setstripe  --yaml&quot; where extent start/end values in the yaml file \
&gt;= 2GiB overflow to 16 EiB - 2 GiB.</div><div><br></div><div>   Using lfs_migrate, \
I re-striped many files in directories with a default striping pattern containing \
these values.   I&#39;m pretty sure that the data exists (was trying to purge an \
older OST, and disk usage on the other OSTs increased as the purged OST decreased), \
and an lfsck procedure happily returns after a day or so. Unfortunately, attempts to \
access or re-migrate the files triggers a kernel panic on the MDS \
with:</div><div><br></div><div>LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) \
ASSERTION( !((unsigned long)addr &amp; ~(~(((1UL) &lt;&lt; 12)-1))) ) \
failed:</div><div>LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) \
LBUG<br></div><div>Kernel panic - not syncing: LBUG<br></div><div><br></div><div>  \
The servers are lustre 2.12.8 on OpenZFS 0.8.5 on CentOS 7.9. The output from \
&quot;lfs getstripe -v badfile&quot; is attached.</div><div><br></div><div>   I can \
use lfs find to search for files with these bad extent endpoint values, then move \
them to a quarantine area on the same FS. This will allow the rest of the system to \
stay up (hopefully) but recovering the data is still \
needed.</div><div><br></div><div>Thanks!</div><div>Nate</div><div>  <br>-- <br><div \
dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><pre>Dr. Nathan Crawford       \
<a href="mailto:nathan.crawford@uci.edu" target="_blank">nathan.crawford@uci.edu</a> \
Director of Scientific Computing School of Physical Sciences
164 Rowland Hall                 Office: 152 Rowland Hall
University of California, Irvine  Phone: 949-824-1380
Irvine, CA 92697-2025, USA</pre></div></div></div></div></div></div>
_______________________________________________<br>
lustre-discuss mailing list<br>
<a href="mailto:lustre-discuss@lists.lustre.org" \
target="_blank">lustre-discuss@lists.lustre.org</a><br> <a \
href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" \
rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>
 </blockquote></div>



_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic