[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-edac
Subject: RE: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine checks in kernel context
From: "Luck, Tony" <tony.luck () intel ! com>
Date: 2022-10-31 19:20:38
Message-ID: SJ1PR11MB6083564E9626FFB4681CA3B7FC379 () SJ1PR11MB6083 ! namprd11 ! prod ! outlook ! com
[Download RAW message or body]
> Well, if one were sane, one would assume that one would expect to see a
> stack dump when the machine panics, right? I mean, it is only fair...
Stack dump from a machine check wasn't at all useful until h/w and Linux started
supporting recoverable machine checks. The stack dump is there to help diagnose
and fix s/w problems. But a machine check isn't a software problem.
So I was pretty happy with the status quo of not getting a stack dump from
a machine check panic.
With recoverable machine checks there are some cases where there might
be an opportunity to change the kernel to avoid a crash. See my patches that
akpm just took into the "mm" tree to recover when the kernel hits poison during
a copy-on-write:
https://lore.kernel.org/all/20221021200120.175753-1-tony.luck@intel.com/
or the patches from Google to recover when khugepaged hits poison:
https://lore.kernel.org/linux-mm/20221010160142.1087120-1-jiaqiyan@google.com/
To identify additional opportunities to make the kernel more resilient, it would be useful
to get a kernel stack trace in the specific case of a recoverable data consumption
machine check while executing in the kernel.
> And there's an attempt:
>
> #ifdef CONFIG_DEBUG_BUGVERBOSE
> /*
> * Avoid nested stack-dumping if a panic occurs during oops processing
> */
> if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
> dump_stack();
> #endif
>
> but that oops_in_progress thing is stopping us:
...
> it hints that panic() might've been called twice for oops_in_progress to
> be already 1 on entry.
>
> I guess we need to figure out why that is...
It might be interesting, but a distraction from the goal of my patch to only
dump the stack for recoverable machine checks in kernel code.
-Tony
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic