[prev in list] [next in list] [prev in thread] [next in thread]
List: lustre-discuss
Subject: Re: [lustre-discuss] Accessing files with bad PFL causing MDS kernel panics
From: Colin Faber via lustre-discuss <lustre-discuss () lists ! lustre ! org>
Date: 2022-10-25 21:14:37
Message-ID: CAJcXmBkyc4stfg45SuDwFugr7XHpFrFAsKwPMy6mAqaAUj39Nw () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
Hi Nathan, looks like you're hitting
https://jira.whamcloud.com/browse/LU-16152
-cf
On Tue, Oct 25, 2022 at 2:43 PM Nathan Crawford <nrcrawfo@uci.edu> wrote:
> Hi All,
>
> I'm looking for possible work-arounds to recover data from some
> mis-migrated files (as seen in LU-16152). Basically, there's a bug in "lfs
> setstripe --yaml" where extent start/end values in the yaml file >= 2GiB
> overflow to 16 EiB - 2 GiB.
>
> Using lfs_migrate, I re-striped many files in directories with a default
> striping pattern containing these values. I'm pretty sure that the data
> exists (was trying to purge an older OST, and disk usage on the other OSTs
> increased as the purged OST decreased), and an lfsck procedure happily
> returns after a day or so. Unfortunately, attempts to access or re-migrate
> the files triggers a kernel panic on the MDS with:
>
> LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) ASSERTION( !((unsigned
> long)addr & ~(~(((1UL) << 12)-1))) ) failed:
> LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) LBUG
> Kernel panic - not syncing: LBUG
>
> The servers are lustre 2.12.8 on OpenZFS 0.8.5 on CentOS 7.9. The output
> from "lfs getstripe -v badfile" is attached.
>
> I can use lfs find to search for files with these bad extent endpoint
> values, then move them to a quarantine area on the same FS. This will allow
> the rest of the system to stay up (hopefully) but recovering the data is
> still needed.
>
> Thanks!
> Nate
>
> --
>
> Dr. Nathan Crawford nathan.crawford@uci.edu
> Director of Scientific Computing
> School of Physical Sciences
> 164 Rowland Hall Office: 152 Rowland Hall
> University of California, Irvine Phone: 949-824-1380
> Irvine, CA 92697-2025, USA
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
[Attachment #5 (text/html)]
<div dir="ltr"><div>Hi Nathan, looks like you're hitting <a \
href="https://jira.whamcloud.com/browse/LU-16152">https://jira.whamcloud.com/browse/LU-16152</a></div><div><br></div><div>-cf</div><div><br></div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 25, 2022 at 2:43 PM \
Nathan Crawford <<a href="mailto:nrcrawfo@uci.edu">nrcrawfo@uci.edu</a>> \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi \
All,<div><br></div><div> I'm looking for possible work-arounds to recover data \
from some mis-migrated files (as seen in LU-16152). Basically, there's a bug in \
"lfs setstripe --yaml" where extent start/end values in the yaml file \
>= 2GiB overflow to 16 EiB - 2 GiB.</div><div><br></div><div> Using lfs_migrate, \
I re-striped many files in directories with a default striping pattern containing \
these values. I'm pretty sure that the data exists (was trying to purge an \
older OST, and disk usage on the other OSTs increased as the purged OST decreased), \
and an lfsck procedure happily returns after a day or so. Unfortunately, attempts to \
access or re-migrate the files triggers a kernel panic on the MDS \
with:</div><div><br></div><div>LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) \
ASSERTION( !((unsigned long)addr & ~(~(((1UL) << 12)-1))) ) \
failed:</div><div>LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) \
LBUG<br></div><div>Kernel panic - not syncing: LBUG<br></div><div><br></div><div> \
The servers are lustre 2.12.8 on OpenZFS 0.8.5 on CentOS 7.9. The output from \
"lfs getstripe -v badfile" is attached.</div><div><br></div><div> I can \
use lfs find to search for files with these bad extent endpoint values, then move \
them to a quarantine area on the same FS. This will allow the rest of the system to \
stay up (hopefully) but recovering the data is still \
needed.</div><div><br></div><div>Thanks!</div><div>Nate</div><div> <br>-- <br><div \
dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><pre>Dr. Nathan Crawford \
<a href="mailto:nathan.crawford@uci.edu" target="_blank">nathan.crawford@uci.edu</a> \
Director of Scientific Computing School of Physical Sciences
164 Rowland Hall Office: 152 Rowland Hall
University of California, Irvine Phone: 949-824-1380
Irvine, CA 92697-2025, USA</pre></div></div></div></div></div></div>
_______________________________________________<br>
lustre-discuss mailing list<br>
<a href="mailto:lustre-discuss@lists.lustre.org" \
target="_blank">lustre-discuss@lists.lustre.org</a><br> <a \
href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" \
rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>
</blockquote></div>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic