[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-edac
Subject:    RE: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine checks in kernel context
From:       "Luck, Tony" <tony.luck () intel ! com>
Date:       2022-10-31 19:20:38
Message-ID: SJ1PR11MB6083564E9626FFB4681CA3B7FC379 () SJ1PR11MB6083 ! namprd11 ! prod ! outlook ! com
[Download RAW message or body]

> Well, if one were sane, one would assume that one would expect to see a
> stack dump when the machine panics, right? I mean, it is only fair...

Stack dump from a machine check wasn't at all useful until h/w and Linux started
supporting recoverable machine checks. The stack dump is there to help diagnose
and fix s/w problems. But a machine check isn't a software problem.

So I was pretty happy with the status quo of not getting a stack dump from
a machine check panic.

With recoverable machine checks there are some cases where there might
be an opportunity to change the kernel to avoid a crash. See my patches that
akpm just took into the "mm" tree to recover when the kernel hits poison during
a copy-on-write:

https://lore.kernel.org/all/20221021200120.175753-1-tony.luck@intel.com/

or the patches from Google to recover when khugepaged hits poison:

https://lore.kernel.org/linux-mm/20221010160142.1087120-1-jiaqiyan@google.com/


To identify additional opportunities to make the kernel more resilient, it would be useful
to get a kernel stack trace in the specific case of a recoverable data consumption
machine check while executing in the kernel.

> And there's an attempt:
>
> #ifdef CONFIG_DEBUG_BUGVERBOSE
>         /*
>          * Avoid nested stack-dumping if a panic occurs during oops processing
>          */
>         if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
>                 dump_stack();
> #endif
>
> but that oops_in_progress thing is stopping us: 

...

> it hints that panic() might've been called twice for oops_in_progress to
> be already 1 on entry.
>
> I guess we need to figure out why that is...

It might be interesting, but a distraction from the goal of my patch to only
dump the stack for recoverable machine checks in kernel code.

-Tony

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic