[prev in list] [next in list] [prev in thread] [next in thread]
List: openjdk-hotspot-runtime-dev
Subject: Re: RFR(L) 8153224 Monitor deflation prolong safepoints (CR7/v2.07/10-for-jdk14)
From: David Holmes <david.holmes () oracle ! com>
Date: 2019-10-31 4:00:44
Message-ID: 4636ebe3-b783-a104-0ea6-23ab7fb47e0a () oracle ! com
[Download RAW message or body]
Hi Dan,
First I've seen your additional tracking emails, but for now just want
to respond to the load_acquire discussion below ...
On 30/10/2019 6:18 am, Daniel D. Daugherty wrote:
> Hi David,
>
> Sorry for the delay in replying to this email. I spent most of last week
> migrating my development environment from my dying MBP13 to a new MBP13.
> I think I have everything recovered, but only time will tell...
>
> More below...
>
> On 10/23/19 12:00 AM, David Holmes wrote:
> > Hi Dan,
> >
> > Trimming to the open issues.
>
> I'll try to do the same (but was only able to trim one thing)...
>
>
> >
> > On 23/10/2019 6:46 am, Daniel D. Daugherty wrote:
> > > Okay, time to return to my uses of release_store() and load_acquire().
> > > Consider this bit of code:
> > >
> > > src/hotspot/share/runtime/synchronizer.cpp:
> > >
> > > L351: // Take an ObjectMonitor from the start of the specified
> > > list. Also
> > > L352: // decrements the specified counter. Returns NULL if none
> > > are available.
> > > L353: static ObjectMonitor*
> > > take_from_start_of_common(ObjectMonitor* volatile * list_p,
> > > L354: \
> > > int volatile * count_p) {
> > > L355: ObjectMonitor* next = NULL;
> > > L356: ObjectMonitor* take = NULL;
> > > L357: // Mark the list head to guard against A-B-A race:
> > > L358: if (!mark_list_head(list_p, &take, &next)) {
> > > L359: return NULL; // None are available.
> > > L360: }
> > > L361: // Switch marked list head to next (which unmarks the
> > > list head, but
> > > L362: // leaves take marked):
> > > L363: OrderAccess::release_store(list_p, next);
> > > L364: Atomic::dec(count_p);
> > > L365: // Unmark take, but leave the next value for any lagging
> > > list
> > > L366: // walkers. It will get cleaned up when take is
> > > prepended to
> > > L367: // the in-use list:
> > > L368: set_next(take, next);
> > > L369: return take;
> > > L370: }
> > >
> > >
> > > On L358, we call mark_list_head() and if that call returns true, then
> > > the calling thread is the "owner" of the list head. My idea is that
> > > once my thread "owns" the list head, I can use release_store() as a
> > > smaller hammer than cmpxchg() to sync out the new value ('next') as
> > > the new list head.
> > >
> > > Should I be doing something other than release_store() here?
> > >
> > > My thinking is that release_store() pairs up with the load_acquire()
> > > of any reader thread that's trying to simply walk the list. And that
> > > release_store() also pairs up with any _other_ writer thread that's
> > > trying to mark the list head using:
> > >
> > > mark_list_head() -> mark_next() -> cmpxchg()
> > >
> > > Here's mark_list_head():
> > >
> > > L194: // Mark the next field in the list head ObjectMonitor. If
> > > marking was
> > > L195: // successful, then the mid and the unmarked next field
> > > are returned
> > > L196: // via parameter and true is returned. Otherwise false is
> > > returned.
> > > L197: static bool mark_list_head(ObjectMonitor* volatile * list_p,
> > > L198: ObjectMonitor** \
> > > mid_p, ObjectMonitor** next_p) {
> > > L199: while (true) {
> > > L200: ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
> > > L201: if (mid == NULL) {
> > > L202: return false; // The list is empty so nothing to mark.
> > > L203: }
> > > L204: if (mark_next(mid, next_p)) {
> > > L205: if (OrderAccess::load_acquire(list_p) != mid) {
> > > L206: // The list head changed so we have to retry.
> > > L207: set_next(mid, *next_p); // unmark mid
> > > L208: continue;
> > > L209: }
> > > L210: // We marked next field to guard against races.
> > > L211: *mid_p = mid;
> > > L212: return true;
> > > L213: }
> > > L214: }
> > > L215: }
> > >
> > > which does a matching load_acquire() to get the current list head
> > > and calls mark_next() to try and mark it. It then calls load_acquire()
> > > again to verify that the list head hasn't changed while doing the
> > > mark_next().
> > >
> > > So here's mark_next():
> > >
> > > 161 // Mark the next field in an ObjectMonitor. If marking was
> > > successful,
> > > 162 // then the unmarked next field is returned via parameter and
> > > true is
> > > 163 // returned. Otherwise false is returned.
> > > 164 static bool mark_next(ObjectMonitor* om, ObjectMonitor** next_p) {
> > > 165 // Get current next field without any marking value.
> > > 166 ObjectMonitor* next = (ObjectMonitor*)
> > > 167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
> > > 168 if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next)
> > > != next) {
> > > 169 return false; // Could not mark the next field or it was
> > > already marked.
> > > 170 }
> > > 171 *next_p = next;
> > > 172 return true;
> > > 173 }
> > >
> > > We use load_acquire() on L166-7 to make sure that we have the latest
> > > value in the next_om field. That load_acquire() matches up with the
> > > release_store() done by another thread to unmark the next field (see
> > > the set_next() function). Or it matches with the cmpxchg() done by
> > > another to mark the next field. Yes, we could do a regular load when
> > > we know that the field is only changed by cmpxchg(), but it could
> > > have been changed by release_store() and my thinking is that we need
> > > a matching load_acquire().
> > >
> > > We have to use cmpxchg() here to try and set the mark on the next field.
> > > If successful, then we return the value of the unmarked next field via
> > > the 'next_p' param and we return true. However, just because we marked
> > > the next field in the ObjectMonitor doesn't mean that we marked the list
> > > head. That's what the second load_acquire() on L205 above verifies.
> > >
> > >
> > > Okay, that was pretty gory on the details of my thinking! I think it's
> > > pretty clear that I'm doing load_acquire()/release_store() pairs on
> > > the list head pointers or the next_om fields to make sure that I match
> > > loads and stores of _just_ the list head pointer or the next_om field
> > > in question. I'm not trying affect other fields in the ObjectMonitor
> > > or at least I don't think I have to do that.
> > >
> > > Maybe I've misunderstood the load_acquire()/release_store() stuff
> > > (again)
> > > and this code is doing way too much! I would be happy to have my
> > > thinking
> > > about this load_acquire()/release_store() stuff corrected and the code
> > > simplified. Maybe in this sequence:
> > >
> > > mark_list_head() -> mark_next() -> cmpxchg()
> > >
> > > we don't need all the load_acquire() calls because the sequence only
> > > ends
> > > successfully with a cmpxchg() and that makes memory happy everywhere
> > > with
> > > just simple loads. Dunno. I definitely need more feedback here!
> >
> > Okay that was a massive deep dive :) but it is exactly what I was
> > referring to when I said:
> >
> > "My main comment on all the list management code pertains to the
> > difficulty in seeing which release_store is paired with which
> > load_acquire, and why."
>
> Yup. And that comment is exactly why I wrote the "massive deep dive"
> reply... so I _think_ I replied to the comment and thus proved the
> point that this stuff is difficult all at the same time. :-)
>
>
> > To understand the use of acquire/release/cmpxchg in detail you really
> > need to see all that code inlined/flattened so that you can see how
> > they actually arrange themselves.
>
> Agreed...
To be clear I can't see the code that way so it is difficult to give
concrete opinions on appropriateness of load_acquire/release_store at
individual code locations.
>
> > I'm immediately suspicious of a load_acquire that is just before a
> > cmpxchg on the same field because the cmpxchg will subsume any memory
> > affects of the load-acquire.
>
> Okay, I'm trying to figure out if this comment is meant for L166, L167
> and L168 of mark_next():
>
> 165 // Get current next field without any marking value.
> 166 ObjectMonitor* next = (ObjectMonitor*)
> 167 ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
> 168 if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) !=
> next) {
> 169 return false; // Could not mark the next field or it was
> already marked.
> 170 }
> 171 *next_p = next;
> 172 return true;
>
> or if this is just a general comment or both... :-)
Both.
>
> > But the devil is in the detail.
>
> Agreed and with this project there are so many details. :-(
>
>
> > If a field can be set by release_store, and there are previous stores
> > related to that which must be visible, then a read of that field can't
> > be via a plain load (but either load_acquire or as part of cmpxchg).
>
> As the comment on L166 says, the goal of L167 and L168 is to get the
> current next field without any marking value. Since the next field can be
> changed by either cmpxchg() or release_store() and we have no idea which
> one made the most recent update, we have to use a load_acquire() to get
> the latest value of the next field.
>
> As for the cmpxchg() on L168, the potential update operation has to be
> done with cmpxchg() in order to safely and exclusively allow only one
> thread to mark the ObjectMonitor's next field (or no threads if the next
> field is already marked).
>
> So... I think I have good reasons for a load_acquire() to be immediately
> followed by a cmpxchg(), in this case anyway.
Actually no. :) The purpose of a load_acquire is to ensure that
subsequent loads of other variables will return values written before
the release_store that is paired with the load_acquire. In this code
there are no subsequent loads between the load_acquire and the cmpxchg.
The cmpxchg itself provides full bi-directional fence semantics, so it
deals with any loads after the cmpxchg. Hence the acquire part of the
load_acquire is simply not needed.
>
> Of course, I'm assuming that this is the load_acquire(), cmpxchg() pair
> that you are suspicious of. It could be some other pair or it could be
> the pairs in general.
Anywhere there are no intervening loads, between the load_acquire, and
the cmpxchg, the acquire part is not needed.
>
> > But there are so many fields at play here that it is very difficult to
> > keep track of everything.
> >
> > You mention above:
> >
> > "My idea is that once my thread "owns" the list head, I can use
> > release_store() as a smaller hammer than cmpxchg() to sync out the new
> > value ('next') as the new list head."
> >
> > and I tend to agree that the second cmpxchg is certainly not needed as
> > there cannot be any contended update after you've marked the "next"
> > field.
>
> Not quite as strong of an agreement as I was hoping for, but I'll take
> it. :-)
>
> Is there a different memory update/sync operation I should be using for my
> "smaller hammer"? I don't think so, but I could be missing something...
>
> For example, I don't think I can switch from OrderAccess::release_store()
> to Atomic::store() and from OrderAccess::load_acquire() to Atomic::load()
> because I'll lose the {release, acquire} matching up component.
Well you wouldn't need Atomic operations regardless.
Whether you could use plain load/store depends, as we've already agreed,
on whether other accesses are dependent on the release/acquire
semantics. And I can't see whether they are not with the current
structure of the code, so I have no concrete suggestions here so lets
move on.
>
> I also need to clarify one thing. In my reply to your previous set of
> comments, I wrote:
>
> > I think it's
> > pretty clear that I'm doing load_acquire()/release_store() pairs on
> > the list head pointers or the next_om fields to make sure that I match
> > loads and stores of _just_ the list head pointer or the next_om field
> > in question. I'm not trying affect other fields in the ObjectMonitor
> > or at least I don't think I have to do that.
>
> Now that I've had time to think about it, I have to retract that last
> sentence. When doing these lock free list manipulations, we often will
> change some other field in the ObjectMonitor as part of that manipulation.
>
> For example, as part of deflation, we will clear the _object field when
> we extract the ObjectMonitor from the containing in-use list and before
> we prepend it to the global free list. A release_store() is used to update
> the next field in the (cur_mid_in_use) ObjectMonitor that refers to the
> deflated ObjectMonitor. That release_store() is important for sync'ing
> out the clearing of the _object field because once the deflated
> ObjectMonitor is unlinked from an in-use list, that field is no longer
> updated by GC.
Agreed.
>
> > In essence the marking is implementing a spin lock
>
> I like the "spin lock" analogy.
>
>
> > so once you have the "lock" you can use normal load/stores to make
> > updates
>
> Yup. I get it, e.g., we mark an ObjectMonitor so that we can deflate it
> and part of deflation is making some updates...
>
>
> > - but that assumes all updates occur under the "lock", that releasing
> > the "lock" also has the right barriers and that any reads sync
> > correctly with the writes. If any of that doesn't hold then
> > release_store and load_acquire will be needed rather than plain stores
> > and loads.
>
> I _think_ that all updates that we make as part of list manipulation are
> done after marking the ObjectMonitor's next field and before we unmark
> that same next field. So I _think_ the "lock" is properly held...
>
> Of course, it's always possible that I missed something, but that's what
> we have code reviewers for right? :-)
I love your sense of humour ;-)
>
> > So again you'd need to look at all the flattened code paths to see how
> > this all hangs together.
>
> For my load_acquire() calls I have matched them up with the corresponding
> release_store() call. And in some cases the load_acquire() call site
> matches
> up with a call site that does release_store() on one branch and cmpxchg()
> on the other. Yes, complicated and mind numbing... Sorry about that...
>
> I've also looked at the release_store() calls and matched them up with a
> corresponding load_acquire().
>
> However, I've looked at this code so many different times that I might be
> blind to some aspects of it so another pair of eyes looking for missing
> matches is always welcome.
Without seeing an inlined/flattened version of the code it's near
impossible (for me at least) to see the code paths in detail.
>
> >
> > > > src/hotspot/share/runtime/serviceThread.cpp
> > > >
> > > > I have some concerns that the serviceThread must now wakeup
> > > > periodically instead of only on demand. It has a lot of tasks to
> > > > check now and it is not obvious how complex or trivial those checks
> > > > are. We don't currently gather any statistics in this area so we
> > > > have no idea how long the serviceThread typically waits between
> > > > requests. The fact the serviceThread is woken on demand means that
> > > > the desire to be "checked at the same interval" seems exceedingly
> > > > unlikely - the serviceThread would have to waiting for extended
> > > > periods of time for the timed-wait to have any real affect on the
> > > > frequency of the async monitor deflation.
> > >
> > > Periodically waking up the ServiceThread is done to match the safepoint
> > > cleanup period of GuaranteedSafepointInterval when doing safepoint based
> > > deflation. Without waking up periodically, we could go a long time
> > > before
> > > doing any deflation via the ServiceThread and that would definitely be
> > > an observable change in behavior relative to safepoint based deflation.
> > > In some of my test runs, I had seen us go for 8-27 seconds without doing
> > > any async monitor deflation. Of course, that meant that there was
> > > more to
> > > cleanup when we did do it.
> >
> > How does that occur? If the ServiceThread is woken when there are
> > monitors to deflate then that implies that if it was not woken then
> > there were none to deflate. Or turning it around, if after waiting for
> > GuaranteedSafepointInterval ms the Service Thread finds there are
> > monitors to deflate why wasn't it directly notified of that?
>
> I think there's a bit of confusion here.
>
> We don't wake up the ServiceThread when there are ObjectMonitors to
> deflate. We wake up the ServiceThread to see if it can deflate any
> ObjectMonitors. ObjectMonitor idleness or suitability to be deflated is
> determined by the ServiceThread. ObjectMonitor idleness is not a property
> determined by another part of the system and the ServiceThread is woken
> to handle it.
Right my mistake.
That's it for this email.
Thanks,
David
-----
>
> >
> > > I've been thinking about have my own AsyncDeflateIdleMonitorsThread
> > > and move
> > > all the async deflation code from ServiceThread there. So many things
> > > are
> > > being added to the ServiceThread that I do wonder whether it is
> > > overloaded,
> > > but (as you said) there's no way to know that (yet).
> >
> > Some stuff just got moved out of the ServiceThread to a new
> > NotificationThread (which is a non-hidden Java thread), so the
> > ServiceThread has a little more capacity (potentially) to handle async
> > monitor deflation. But yes if the ServiceThread becomes a bottleneck
> > it will defeat the purpose of offloading work from the VMThread at
> > safepoints.
>
> The other "benefit" of having a dedicated AsyncDeflateIdleMonitorsThread
> is that it would be easier for us to determine how much actual work is
> done by the whole ObjectMonitor deflation procedure. We can say XX% of
> the VMs execution time went to the AsyncDeflateIdleMonitorsThread.
>
> Also, the work moved out of the ServiceThread is not an "always on" thing
> so the help for async deflation is limited at best.
>
>
> >
> > > BTW, I think GuaranteedSafepointInterval might be "broken" at this point
> > > in that we don't guarantee a safepoint at that interval anymore. This
> > > has
> > > kind of slipped through the cracks...
> >
> > We can look at that separately.
>
> Agreed. Not directly related to the Async Monitor Deflation project
> except as the explanation for how I went so long without seeing an
> async monitor deflation... Without that cleanup safepoint happening
> every GuaranteedSafepointInterval (and without the wait() timeout value),
> we don't have ObjectSynchronizer::do_safepoint_work() periodically
> waking up the ServiceThread to do an async monitor deflation sweep...
>
>
>
> >
> > >
> > > > ---
> > > >
> > > > src/hotspot/share/runtime/thread.cpp
> > > >
> > > > No changes to this file.
> > > >
> > > > ---
> > > >
> > > > test/jdk/tools/jlink/multireleasejar/JLinkMultiReleaseJarTest.java
> > > >
> > > > No changes to this file.
> > >
> > > Correct. I'm unsure how to deal with this type of thing. This project
> > > is managed as a series of patches. Earlier patches changed those files,
> > > but a later patch undid those changes. So depending on where you are
> > > in the patch queue, the logical 'files.list' contents are different.
> > > I'm currently using a 'files.list' that reflects every file touched by
> > > the current set of patches and that's why those two files are included
> > > with no changes.
> > >
> > > Suggestions?
> >
> > When you think you have the final version ready for review then you
> > should flatten out the patches (I don't know the right terminology) so
> > that the final webrev is accurate.
>
> Good point. I've been so focused on different reviewers wanting to come
> at this mass of code from different phases of the development process and
> that led me to keep all the various version patches intact. I think we're
> past the point where we are going to back up to, e.g., v2.05, and then
> do a completely different lock free monitor list as a new v2.06.
>
> Okay. When we roll to v2.08, v2.07 and predecessors will get flattened so
> that we have an easier review process.
>
>
> >
> > > Thanks for the partial review! I'm looking forward to resolving the
> > > queries about and your next set of review comments.
> >
> > I will try to look at the ObjectMonitor changes themselves this
> > afternoon. So many non-trivial reviews/discussions at the moment, plus
> > my own work :)
>
> Thanks for spending so much time and energy on this code.
>
> Dan
>
>
> >
> > Thanks,
> > David
> > -----
> >
> > > Dan
> > >
> > >
> > > >
> > > > ---
> > > >
> > > >
> > > >
> > > > > Dan
> > > > >
> > > > >
> > > > > On 10/17/19 5:50 PM, Daniel D. Daugherty wrote:
> > > > > > Greetings,
> > > > > >
> > > > > > The Async Monitor Deflation project is reaching the end game. I
> > > > > > have no
> > > > > > changes planned for the project at this time so all that is left
> > > > > > is code
> > > > > > review and any changes that results from those reviews.
> > > > > >
> > > > > > Carsten and Roman! Time for you guys to chime in again on the code
> > > > > > reviews.
> > > > > >
> > > > > > I have attached the list of fixes from CR6 to CR7 instead of
> > > > > > putting it
> > > > > > in the main body of this email.
> > > > > >
> > > > > > Main bug URL:
> > > > > >
> > > > > > JDK-8153224 Monitor deflation prolong safepoints
> > > > > > https://bugs.openjdk.java.net/browse/JDK-8153224
> > > > > >
> > > > > > The project is currently baselined on jdk-14+19.
> > > > > >
> > > > > > Here's the full webrev URL for those folks that want to see all of
> > > > > > the
> > > > > > current Async Monitor Deflation code in one go (v2.07 full):
> > > > > >
> > > > > > http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full \
> > > > > >
> > > > > >
> > > > > > Some folks might want to see just what has changed since the last
> > > > > > review
> > > > > > cycle so here's a webrev for that (v2.07 inc):
> > > > > >
> > > > > > http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/ \
> > > > > >
> > > > > >
> > > > > > The OpenJDK wiki has been updated to match the
> > > > > > CR7/v2.07/10-for-jdk14 changes:
> > > > > >
> > > > > > https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
> > > > > >
> > > > > > The jdk-14+18 based v2.07 version of the patch has been thru Mach5
> > > > > > tier[1-8]
> > > > > > testing on Oracle's usual set of platforms. It has also been
> > > > > > through my usual
> > > > > > set of stress testing on Linux-X64, macOSX and Solaris-X64 with
> > > > > > the addition
> > > > > > of Robbin's "MoCrazy 1024" test running in parallel with the other
> > > > > > tests in
> > > > > > my lab.
> > > > > >
> > > > > > The jdk-14+19 based v2.07 version of the patch has been thru Mach5
> > > > > > tier[1-3]
> > > > > > test on Oracle's usual set of platforms. Mach5 tier[4-8] are in
> > > > > > process.
> > > > > >
> > > > > > I did another round of SPECjbb2015 testing in Oracle's Aurora
> > > > > > Performance lab
> > > > > > using using their tuned SPECjbb2015 Linux-X64 G1 configs:
> > > > > >
> > > > > > - "base" is jdk-14+18
> > > > > > - "v2.07" is the latest version and includes C2
> > > > > > inc_om_ref_count() support
> > > > > > on LP64 X64 and the new HandshakeAfterDeflateIdleMonitors
> > > > > > option
> > > > > > - "off" is with -XX:-AsyncDeflateIdleMonitors specified
> > > > > > - "handshake" is with -XX:+HandshakeAfterDeflateIdleMonitors
> > > > > > specified
> > > > > >
> > > > > > hbIR hbIR
> > > > > > (max attempted) (settled) max-jOPS critical-jOPS runtime
> > > > > > --------------- --------- -------- ------------- -------
> > > > > > 34282.00 30635.90 28831.30 20969.20 3841.30 base
> > > > > > 34282.00 30973.00 29345.80 21025.20 3964.10 v2.07
> > > > > > 34282.00 31105.60 29174.30 21074.00 3931.30
> > > > > > v2.07_handshake
> > > > > > 34282.00 30789.70 27151.60 19839.10 3850.20
> > > > > > v2.07_off
> > > > > >
> > > > > > - The Aurora Perf comparison tool reports:
> > > > > >
> > > > > > Comparison max-jOPS critical-jOPS
> > > > > > ---------------------- --------------------
> > > > > > --------------------
> > > > > > base vs 2.07 +1.78% (s, p=0.000) +0.27% (ns,
> > > > > > p=0.790)
> > > > > > base vs 2.07_handshake +1.19% (s, p=0.007) +0.58% (ns,
> > > > > > p=0.536)
> > > > > > base vs 2.07_off -5.83% (ns, p=0.394) -5.39% (ns,
> > > > > > p=0.347)
> > > > > >
> > > > > > (s) - significant (ns) - not-significant
> > > > > >
> > > > > > - For historical comparison, the Aurora Perf comparision tool
> > > > > > reported for v2.06 with a baseline of jdk-13+31:
> > > > > >
> > > > > > Comparison max-jOPS critical-jOPS
> > > > > > ---------------------- --------------------
> > > > > > --------------------
> > > > > > base vs 2.06 -0.32% (ns, p=0.345) +0.71% (ns,
> > > > > > p=0.646)
> > > > > > base vs 2.06_off +0.49% (ns, p=0.292) -1.21% (ns,
> > > > > > p=0.481)
> > > > > >
> > > > > > (s) - significant (ns) - not-significant
> > > > > >
> > > > > > Thanks, in advance, for any questions, comments or suggestions.
> > > > > >
> > > > > > Dan
> > >
> > > <trimmed older review invites>
> > >
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic