[prev in list] [next in list] [prev in thread] [next in thread]
List: openjdk-serviceability-dev
Subject: Re: RFR 8240902: JDI shared memory connector can use already closed Handles
From: "serguei.spitsyn () oracle ! com" <serguei ! spitsyn () oracle ! com>
Date: 2020-03-23 18:22:38
Message-ID: 68e2e091-7bfd-8501-f8bf-60b55d64b9af () oracle ! com
[Download RAW message or body]
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Hi Patricio,<br>
<br>
The update looks good.<br>
<br>
Thanks,<br>
Serguei<br>
<br>
<br>
On 3/18/20 23:18, Patricio Chilano wrote:<br>
</div>
<blockquote type="cite"
cite="mid:e1a2d907-3907-0f0e-31f3-221338549a5e@oracle.com"> Hi
David,<br>
<br>
<div class="moz-cite-prefix">On 3/18/20 8:10 PM, David Holmes
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:8928590e-6516-051e-aa84-098a6fdc9d45@oracle.com">Hi
Patricio, <br>
<br>
On 19/03/2020 6:44 am, Patricio Chilano wrote: <br>
<blockquote type="cite">Hi David, <br>
<br>
On 3/18/20 4:27 AM, David Holmes wrote: <br>
<blockquote type="cite">Hi Patricio, <br>
<br>
On 18/03/2020 6:14 am, Patricio Chilano wrote: <br>
<blockquote type="cite">Hi all, <br>
<br>
Please review the following patch: <br>
<br>
Bug: <a class="moz-txt-link-freetext"
href="https://bugs.openjdk.java.net/browse/JDK-8240902"
moz-do-not-send="true">https://bugs.openjdk.java.net/browse/JDK-8240902</a>
<br>
Webrev: <a class="moz-txt-link-freetext"
href="http://cr.openjdk.java.net/~pchilanomate/8240902/v1/webrev/"
moz-do-not-send="true">http://cr.openjdk.java.net/~pchilanomate/8240902/v1/webrev/</a>
<br>
<br>
Calling closeConnection() on an already created/opened
connection includes calls to CloseHandle() on objects that
can still be used by other threads. This can lead to
either undefined behavior or, as detailed in the bug
comments, changes of state of unrelated objects. </blockquote>
<br>
This was a really great find! <br>
</blockquote>
Thanks! : ) <br>
<br>
<blockquote type="cite">
<blockquote type="cite">This issue was found while debugging
the reason behind some jshell test failures seen after
pushing 8230594. Not as important, but there are also
calls to closeStream() from createStream()/openStream()
when failing to create/open a stream that will return
after executing "CHECK_ERROR(enterMutex(stream, NULL));"
without closing the intended resources. Then, calling
closeConnection() could assert if the reason of the
previous failure was that the stream's mutex failed to be
created/opened. These patch aims to address these issues
too. <br>
</blockquote>
<br>
Patch looks good in general. The internal reference count
guards deletion of the internal resources, and is itself
safe because never actually delete the connection. Thanks
for adding the comment about this aspect. <br>
<br>
A few items: <br>
<br>
Please update copyright year before pushing. <br>
</blockquote>
Done. <br>
<br>
<blockquote type="cite">Please align
ENTER_CONNECTION/LEAVE_CONNECTION macros the same way as
STREAM_INVARIANT. <br>
</blockquote>
Done. <br>
<br>
<blockquote type="cite"> 170 unsigned int refcount; <br>
171 jint state; <br>
<br>
I'm unclear about the use of stream->state and
connection->state as guards - unless accessed under a
mutex these would seem to at least need acquire/release
semantics. <br>
<br>
Additionally the reads of refcount would also seem to need
to some form of memory synchronization - though the Windows
docs for the Interlocked* API does not show how to simply
read such a variable! Though I note that the
RtlFirstEntrySList method for the "Interlocked Singly Linked
Lists" API does state "Access to the list is synchronized on
a multiprocessor system." which suggests a read of such a
variable does require some form of memory synchronization! <br>
</blockquote>
In the case of the stream struct, the state field is protected
by the mutex field. It is set to STATE_CLOSED while holding
the mutex, and threads that read it must acquire the mutex
first through sysIPMutexEnter(). For the cases where
sysIPMutexEnter() didn't acquire the mutex, we will return
something different than SYS_OK and the call will exit
anyways. All this behaves as before, I didn't change it. <br>
</blockquote>
<br>
Thanks for clarifying. <br>
<br>
<blockquote type="cite">The refcount and state that I added to
the SharedMemoryConnection struct work together. For a thread
closing the connection, setting the connection state to
STATE_CLOSED has to happen before reading the refcount (more
on the atomicity of that read later). That's why I added the
MemoryBarrier() call; which I see it's better if I just move
it to after setting the connection state to closed. For the
threads accessing the connection, incrementing the refcount
has to happen before reading the connection state. That's
already provided by the InterlockedIncrement() which uses a
full memory barrier. In this way if the thread closing the
connection reads a refcount of 0, then we know it's safe to
release the resources, since other threads accessing the
connection will see that the state is closed after
incrementing the refcount. If the read of refcount is not 0,
then it could be that a thread is accessing the connection or
not (it could have read a state connection of STATE_CLOSED
after incrementing the refcount), we don't know, so we can't
release anything. Similarly if the thread accessing the
connection reads that the state is not closed, then we know
it's safe to access the stream since anybody closing the
connection will still have to read refcount which will be at
least 1. <br>
As for the atomicity of the read of refcount, from
<a class="moz-txt-link-freetext"
href="https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access"
moz-do-not-send="true">https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access</a>,
it states that "simple reads and writes to properly-aligned
32-bit variables are atomic operations". Maybe I should
declare refcount explicitly as DWORD32? <br>
</blockquote>
<br>
It isn't the atomicity in question with the naked read but the
visibility. Any latency in the visibility of the store done by
the InterLocked*() function should be handled by the retry loop,
but what is to stop the C++ compiler from hoisting the read of
refcount out of the loop? It isn't even volatile (which has a
stronger meaning in VS than regular C+++). <br>
</blockquote>
I see what you mean now, I was thinking on atomicity and order of
operations but didn't consider the visibility of that read. Yes,
if the compiler decides to be smart and hoist the read out of the
loop we might never notice that it is safe to release those
resources and we would leak them for no reason. I see from the
windows docs(<a
href="https://docs.microsoft.com/en-us/cpp/c-language/type-qualifiers"
moz-do-not-send="true">https://docs.microsoft.com/en-us/cpp/c-language/type-qualifiers</a>)
that declaring it volatile as you pointed out should be enough to
prevent that.<br>
<br>
<blockquote type="cite"
cite="mid:8928590e-6516-051e-aa84-098a6fdc9d45@oracle.com">
<blockquote type="cite">Instead of having a refcount we could
have done something similar to the stream struct and protect
access to the connection through a mutex. To avoid serializing
all threads we could have used SRW locks and only the one
closing the connection would do AcquireSRWLockExclusive(). It
would change the state of the connection to STATE_CLOSED,
close all handles, and then release the mutex.
ENTER_CONNECTION() and LEAVE_CONNECTION() would acquire and
release the mutex in shared mode. But other that maybe be more
easy to read I don't think the change will be smaller. <br>
<br>
<blockquote type="cite"> 413 while (attempts>0) { <br>
<br>
spaces around > <br>
</blockquote>
Done. <br>
<br>
<blockquote type="cite">If the loop at 413 never encounters a
zero reference_count then it doesn't close the events or the
mutex but still returns SYS_OK. That seems wrong but I'm not
sure what the right behaviour is here. <br>
</blockquote>
I can change the return value to be SYS_ERR, but I don't think
there is much we can do about it unless we want to wait
forever until we can release those resources. <br>
</blockquote>
<br>
SYS_ERR would look better, but I see now that the return value
is completely ignored anyway. So we're just going to leak
resources if the loop "times out". I guess this is the best we
can do. <br>
</blockquote>
Here is v2 with the corrections:<br>
<br>
Full: <a
href="http://cr.openjdk.java.net/~pchilanomate/8240902/v2/webrev/"
moz-do-not-send="true">http://cr.openjdk.java.net/~pchilanomate/8240902/v2/webrev/</a><br>
Inc: <a
href="http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/"
moz-do-not-send="true">http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/webrev/</a>
(not sure why the indent fixes are not highlighted as changes
but the Frames view does show they changed)<br>
<br>
I'll give it a run on mach5 adding tier5 as Serguei suggested.<br>
<br>
<br>
Thanks,<br>
Patricio<br>
<blockquote type="cite"
cite="mid:8928590e-6516-051e-aa84-098a6fdc9d45@oracle.com">Thanks,
<br>
David <br>
<br>
<blockquote type="cite"> <br>
<blockquote type="cite">And please wait for serviceability
folk to review this. <br>
</blockquote>
Sounds good. <br>
<br>
<br>
Thanks for looking at this David! I will move the
MemoryBarrier() and change the refcount to be DWORD32 if you
are okay with that. <br>
<br>
<br>
Thanks, <br>
Patricio <br>
<blockquote type="cite">Thanks, <br>
David <br>
----- <br>
<br>
<blockquote type="cite">Tested in mach5 with the current
baseline, tiers1-3 and several runs of
open/test/langtools/:tier1 which includes the jshell tests
where this connector is used. I also applied patch <a
class="moz-txt-link-freetext"
href="http://cr.openjdk.java.net/~pchilanomate/8240902/triggerbug/webrev"
moz-do-not-send="true">http://cr.openjdk.java.net/~pchilanomate/8240902/triggerbug/webrev</a>
mentioned in the comments of the bug, on top of the
baseline and run the langtool tests with and without this
fix. Without the fix running around 30 repetitions already
shows failures in tests
jdk/jshell/FailOverExecutionControlTest.java and
jdk/jshell/FailOverExecutionControlHangingLaunchTest.java.
With the fix I run several hundred runs and saw no
failures. Let me know if there is any additional testing I
should do. <br>
<br>
As a side note, I see there are a couple of open issues
related with jshell failures (8209848) which could be
related to this bug and therefore might be fixed by this
patch. <br>
<br>
Thanks, <br>
Patricio <br>
<br>
</blockquote>
</blockquote>
<br>
</blockquote>
</blockquote>
<br>
</blockquote>
<br>
</body>
</html>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic