'Re: [Fastboot] Re: [FYI] kexec: design point and implementation for'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       osdl-fastboot
Subject:    Re: [Fastboot] Re: [FYI] kexec: design point and implementation for
From:       ebiederm () xmission ! com (Eric W !  Biederman)
Date:       2005-12-23 15:35:32
Message-ID: m1u0cz6d2z.fsf () ebiederm ! dsl ! xmission ! com
[Download RAW message or body]

Milton Miller <miltonm@bga.com> writes:

> On Dec 22, 2005, at 9:38 AM, Eric W. Biederman wrote:
>
> We do nothing to addressthe ongoing dma so that is not a valid reason.  And we
> could do the checksum in arch code after stoppng the other cpus if we can (not
> all architectures have a non-maskable interrupt controllable by the kernel).

Loading and running the new kernel at a reserved addresses addresses
ongoing DMA.  It moves the odds of having rouge DMA hit us to almost
0.  That was considered more reliable than just about everything
else short of a reset.

Probably the biggest reason for not doing it in the kernel is
that we don't need to.  If the kernel has crashed you are in trouble.
If you can get a crash dump that is great.  If things get corrupted
to the point you can't safely get a crashdump that is a problem.

Reporting the world is so totally broken up we can't even take
a crashdump would be nice but it adds very little information to
the picture.

If we are so totally messed up that we write (DMA or cpu) into a
completely reserved area of memory then I doubt we can do anything
useful.

Then there is another element.  Having user space do it is much closer
to end-to-end than having the kernel do it.  It allows bugs in the
kernel loader to be trapped and caught.

If we get lots of silent failures where we are wondering if the
kernel is not crash dumping because of the checksum code that is
the time to address it in the kernel.  Then we can address it in
the kernel and in user space :)  Doing anything about it now I think
is optimizing the wrong case.

Removing code and getting the kernel data paths to the absolute
minimum for creating a reliable crashdump is the goal.  Right
now on x86 we are don't about twice as much as we should be doing
with the apic code in there.

This is hard code to get to a minimum and hard code to review.  Once
it is solid we don't want to be playing around with it.  Keeping the
code paths to a minimum is very important.

So far the failure mode of kexec on panic is the right one.  If you
can't reliably take a crash dump stop.  Don't corrupt my disk or
anything else farther. 

I guess I'm at a loss for why the information the information that
the checksum failed would be interesting.

Thinking it through we do come in after we print the panic message so
the odds are good that if printk works at the time of the crash
it continues to work, for displaying an error.  Although we can
get information almost as good and with much less risk by indicating
we are going to use kexec-on-panic, in the existing print statement.

Truthfully the less we can depend on a kernel that is broken by
definition the more I will appreciate it.

Eric

_______________________________________________
fastboot mailing list
fastboot@lists.osdl.org
https://lists.osdl.org/mailman/listinfo/fastboot

[prev in list] [next in list] [prev in thread] [next in thread]