'Re: Question about stack overflows in native code'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-hotspot-runtime-dev
Subject:    Re: Question about stack overflows in native code
From:       Thomas_Stüfe <thomas.stuefe () gmail ! com>
Date:       2017-04-05 8:36:33
Message-ID: CAA-vtUzdSZ=hXONm4XBySiGmZTAy5-eKc3joAbmyBBZLom+-Xw () mail ! gmail ! com
[Download RAW message or body]

Hi Coleen,

On Tue, Apr 4, 2017 at 9:22 PM, <coleen.phillimore@oracle.com> wrote:

>
> I have a couple of comments below but don't have that much to add to
> Frederic's email.
>
>
> On 4/4/17 3:04 PM, Thomas Stüfe wrote:
>
>> Hi Fred,
>>
>> On Tue, 4 Apr 2017 at 20:56, Frederic Parain <frederic.parain@oracle.com>
>> wrote:
>>
>>
>>> On 04/04/2017 02:31 PM, Thomas Stüfe wrote:
>>>
>>>> Hi David,
>>>>
>>>> On Tue, Apr 4, 2017 at 12:11 PM, David Holmes <david.holmes@oracle.com
>>>> <mailto:david.holmes@oracle.com>> wrote:
>>>>
>>>>      On 4/04/2017 6:30 PM, Thomas Stüfe wrote:
>>>>
>>>>          Hi David,
>>>>
>>>>          On Mon, Apr 3, 2017 at 11:02 PM, David Holmes
>>>>          <david.holmes@oracle.com <mailto:david.holmes@oracle.com>
>>>>          <mailto:david.holmes@oracle.com
>>>>          <mailto:david.holmes@oracle.com>>> wrote:
>>>>
>>>>              Just to follow up on what Fred responded ...
>>>>
>>>>              On 4/04/2017 4:42 AM, Thomas Stüfe wrote:
>>>>
>>>>                  Hi Fred,
>>>>
>>>>                  thanks! Some more questions inline.
>>>>
>>>>                  On Mon, Apr 3, 2017 at 8:29 PM, Frederic Parain
>>>>                  <frederic.parain@oracle.com
>>>>          <mailto:frederic.parain@oracle.com>
>>>>          <mailto:frederic.parain@oracle.com
>>>>          <mailto:frederic.parain@oracle.com>>>
>>>>
>>>>                  wrote:
>>>>
>>>>                      When the yellow zone is hit and the thread state is
>>>>          not in
>>>>                      _thread_in_java (which means thread state is
>>>>                      _thread_in_native or
>>>>                      _thread_in_vm), the yellow zone is silently
>>>> disabled
>>>>          and the
>>>>                      thread
>>>>                      is allowed to resume its execution.
>>>>
>>>>
>>>>                  Disabled by whom exactly?
>>>>
>>>>                  Normally, this would be done in the signal handler, but
>>>>
>>> that
>>>
>>>>                  requires
>>>>                  enough stack space to run. AFAIK jitted or interpreted
>>>>          code does
>>>>                  stack
>>>>                  banging in order to trigger the yellow-page-segfault at
>>>>          a point
>>>>                  where there
>>>>                  are enough pages left on the stack to invoke the signal
>>>>          handler
>>>>                  (n shadow
>>>>                  pages before), but that is not guaranteed to work with
>>>>          native
>>>>                  C-compiled
>>>>                  code, no?
>>>>
>>>>
>>>>              The stack banging is done to ensure the stackoverflow is
>>>> hit
>>>>          before
>>>>              we start doing the actual operation. The size of the yellow
>>>>          and red
>>>>              zones are supposed to be sufficient to allow the respective
>>>>          signal
>>>>              processing and response to be executed.
>>>>
>>>>
>>>>          And the size of the shadow pages should be sufficient to invoke
>>>>          initial
>>>>          signal handler which will unprotect the yellow or red zone,
>>>>
>>> right?
>>>
>>>>          So, back to my original question, if native C code does not
>>>> bang
>>>>
>>> the
>>>
>>>>          stack but simply runs into the yellow zone, process will simply
>>>>          die, or?
>>>>
>>>>
>>>>      I thought Fred already answered that. The signal handler simply
>>>>      disables the yellow zone and returns:
>>>>
>>>>                } else {
>>>>                  // Thread was in the vm or native code.  Return and try
>>>>      to finish.
>>>>                  thread->disable_stack_yellow_reserved_zone();
>>>>                  return 1;
>>>>                }
>>>>
>>>>
>>>> But in order to do this it needs at least enough stack space to invoke
>>>> the signal handler and call mprotect on the yellow page, right? So, for
>>>> native code compiled by a C-compiler, this may or may not work,
>>>> depending on whether and what form of stack-banging code the C-Compiler
>>>> does generate? (It may generate some sort of stack banging to trigger
>>>> the OS guard page and do OS stack overflow handling, or it may just
>>>> blindly run into the yellow page when pushing a new frame).
>>>>
>>> The yellow zone by itself doesn't provide protection against dying
>>> from a stack overflow. It has been designed to work in coordination
>>> with the stack banging. With stack banging, a thread will try to
>>> "touch" some pages down its stack *before* it really needs them.
>>> This way, if the yellow zone is hit during the stack banging, there's
>>> enough remaining free stack space before the yellow zone to execute
>>> the signal handler (which doesn't need a lot of stack space). And
>>> if the signal handler can disable the yellow zone, then the thread
>>> has enough stack space to perform more complex operations like
>>> generating and throwing a StackOverflowError.
>>>
>>> Without stack banging, the thread will use is stack space until
>>> the yellow zone is hit, and usually when it is hit, the process
>>> will die because there wasn't enough remaining space to execute
>>> the signal handler.
>>>
>>> In Java code, stack banging is performed each time a method is
>>> invoked, to ensure the thread has enough stack space to execute
>>> it (the class file provides information about the maximum number
>>> of local variables and the deepest execution stack the method
>>> will need). Of course, with JIT compile code, stack banging is
>>> performed differently because of in-lining.
>>>
>>> For code which cannot perform stack banging on method boundaries,
>>> like the VM code, the approach is different. Each time a thread
>>> is about to call into the VM runtime, a stack banging is performed
>>> using the StackShadowPages sizing. Shadow pages is supposed to
>>> represent a stack space big enough to execute *any* call to VM
>>> runtime. So, if this stack banging passes, all the runtime code
>>> is executed without any additional check, hoping that shadow
>>> pages have been sized correctly.
>>>
>>
> The StackShadowPage mechanism is what is supposed to protect you in this
> situation, but it's been known to be difficult to size correctly especially
> at a customer site, and you may not want to globally have a large number of
> shadow pages.
>
>>
>>> You can try to add, in your native code, some stack banging code,
>>> or a logic computing the remaining stack space before the
>>> guard pages. Not necessarily on every method call, but on well
>>> known points in your code. The hardest part is usually to know
>>> how much stack space your native code will need. It's possible
>>> to start with a big over-estimating value, and refine it later.
>>> The sizing of the different zones has been determined with a
>>> trial and error process which still continue today as the
>>> JVM code and native JDK code evolve.
>>>
>>> Fred
>>>
>>
> Of course, there's always alternate signal stacks.  It's been a couple of
> years since they came up last.
>
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
> 011-August/002403.html
>
>
A very interesting read. We tried the same in early 2006 with the initial
(commercial) jvm port to AIX, but abandoned the attempt because we could
not get it to work in a stable fashion.

I really dig Yasumasa's idea of allocating the alternate stack with
alloca(), this is quite neat :) But apparently his patch never found its
way into the main code line, right?

Kind Regards, Thomas



> Coleen
>
> Thanks a lot for this excellent and complete explanation!
>>
>> Kind regards, Thomas
>>
>>
>> If stack space is not sufficient to invoke the signal handler to
>>>> unprotect the yellow/red page, process would silently die, right?
>>>>
>>> Correct.
>>>
>>>      If it keeps going and hits the red zone then the red zone will be
>>>>      disabled, we print some error messages, and then should call
>>>>      VMError::report_and_die(). But I admit the signal handler logic is
>>>>      quite complex so I may have missed something. :)
>>>>
>>>>
>>>>
>>>>              But that assumes you simply advance into the guard zones -
>>>>          if your
>>>>              native code suddenly jumped to the end of the yellow zone
>>>> for
>>>>              example, then signal processing would hit the red zone;
>>>>          similarly if
>>>>              you jump to the end of the red zone then signal processing
>>>>          will hit
>>>>              the OS guard page. If you jump past all guard pages you
>>>>          simply die.
>>>>
>>>>
>>>>          Thank you!
>>>>
>>>>          See also my response to Fred. We wondered whether exporting a
>>>>          simple JNI
>>>>          helper function to check the stack size on behalf of the native
>>>>
>>> code
>>>
>>>>          would be something helpful, for cooperative native code at
>>>> least.
>>>>
>>>>
>>>>      Perhaps. Haven't really thought about it. :)
>>>>
>>>>
>>>> We may experiment a bit. The VM silently dying on native code stack
>>>> overflows is a huge annoyance, especially since it depends on the
>>>> user-adjustable stack size. Typically not even a hs_err file is
>>>>
>>> generated.
>>>
>>>> Actually not a theoretical problem, I am currently running into this:
>>>> http://www-01.ibm.com/support/docview.wss?uid=swg1IV23033 for our
>>>> commercial code base at a customer (not j9 obviously), and while the
>>>> recursion in the vector calculations can be fixed, it would be nice to
>>>> at least have an hs_err file...
>>>>
>>>>      Cheers,
>>>>      David
>>>>
>>>>
>>>> Kind Regards, Thomas
>>>>
>>>>
>>>>          Kind Regards, Thomas
>>>>
>>>>
>>>>              David
>>>>
>>>>
>>>>                  (not just a theory, we have a test case here where a
>>>>
>>> stack
>>>
>>>>                  overflow in
>>>>                  native code just silently kills the process.)
>>>>
>>>>                  I guess it may work accidentally if the C-compiled code
>>>>          itself
>>>>                  does some
>>>>                  form of stack banging when establishing frames, in
>>>> order
>>>>
>>> to
>>>
>>>>                  detect OS stack
>>>>                  overflows? Very fuzzy here. But whatever the C-compiled
>>>>
>>> code
>>>
>>>>                  does, it has
>>>>                  no notion about how much space we need to invoke the
>>>>
>>> signal
>>>
>>>>                  handler and
>>>>                  handle stack overflows, no?
>>>>
>>>>                  When the red zone is hit, what ever the current thread
>>>>          state is,
>>>>
>>>>                      the red zone is disabled and
>>>>          VMError::report_and_die() is
>>>>                      called,
>>>>                      which should generate a hs_err file unless the
>>>>          generation of the
>>>>                      error file requires more memory than the red zone
>>>>          provides.
>>>>
>>>>                      Fred
>>>>
>>>>
>>>>                  Thanks, Thomas
>>>>
>>>>
>>>>
>>>>
>>>>                      On 04/03/2017 02:08 PM, Thomas Stüfe wrote:
>>>>
>>>>                          Hi,
>>>>
>>>>                          Today we wondered what would happen when a
>>>> stack
>>>>                          overflow occurs in native
>>>>                          code running in a java thread (an attached
>>>>          thread or one
>>>>                          created by the
>>>>                          VM).
>>>>
>>>>                          In that case yellow and red pages are in place,
>>>>          but this
>>>>                          would not help
>>>>                          much, would it not, because the native code
>>>>          would not do
>>>>                          any stack
>>>>                          banging?
>>>>
>>>>                          So, native code would hit the yellow page, and
>>>>
>>> then
>>>
>>>>                          there would probably
>>>>                          not be enough space left on the stack to invoke
>>>>
>>> the
>>>
>>>>                          signal handler. The
>>>>                          result would be immediate VM death - not even
>>>> an
>>>>          hs-err
>>>>                          file - is that
>>>>                          correct?
>>>>
>>>>                          Also, we would hit the our own yellow page, not
>>>>
>>> the
>>>
>>>>                          guard page the OS may
>>>>                          or may not have established, so - on UNIX -
>>>> this
>>>>          would
>>>>                          show up as
>>>>                          "Segmentation Fault", not "Stack Overflow", or?
>>>>
>>>>                          Thank you,
>>>>
>>>>                          Thomas
>>>>
>>>>
>>>>
>>>>
>>>>
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic