'Re: RFR: 8291555: Replace stack-locking with fast-locking [v8]'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-serviceability-dev
Subject:    Re: RFR: 8291555: Replace stack-locking with fast-locking [v8]
From:       Roman Kennke <rkennke () openjdk ! org>
Date:       2022-10-28 15:29:39
Message-ID: 7ORZSjVcOQ8IrMAC0iS2pgsf_-vMKZQVmfjxAROqVq4=.267878cb-6392-428c-8a11-b431b2e19cfb () github ! com
[Download RAW message or body]

On Fri, 28 Oct 2022 09:32:58 GMT, Roman Kennke <rkennke@openjdk.org> wrote:

> > This change replaces the current stack-locking implementation with a fast-locking \
> > scheme that retains the advantages of stack-locking (namely fast locking in \
> > uncontended code-paths), while avoiding the overload of the mark word. That \
> > overloading causes massive problems with Lilliput, because it means we have to \
> > check and deal with this situation. And because of the very racy nature, this \
> > turns out to be very complex and involved a variant of the inflation protocol to \
> > ensure that the object header is stable.  
> > What the original stack-locking does is basically to push a stack-lock onto the \
> > stack which consists only of the displaced header, and CAS a pointer to this \
> > stack location into the object header (the lowest two header bits being 00 \
> > indicate 'stack-locked'). The pointer into the stack can then be used to identify \
> > which thread currently owns the lock. 
> > This change basically reverses stack-locking: It still CASes the lowest two \
> > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper \
> > bits with a stack-pointer. Instead, it pushes the object-reference to a \
> > thread-local lock-stack. This is a new structure which is basically a small array \
> > of oops that is associated with each thread. Experience shows that this array \
> > typcially remains very small (3-5 elements). Using this lock stack, it is \
> > possible to query which threads own which locks. Most importantly, the most \
> > common question 'does the current thread own me?' is very quickly answered by \
> > doing a quick scan of the array. More complex queries like 'which thread owns X?' \
> > are not performed in very performance-critical paths (usually in code like JVMTI \
> > or deadlock detection) where it is ok to do more complex operations. The \
> > lock-stack is also a new set of GC roots, and would be scanned during thread \
> > scanning, possibly concurrently, via the normal protocols. 
> > In contrast to stack-locking, fast-locking does *not* support recursive locking \
> > (yet). When that happens, the fast-lock gets inflated to a full monitor. It is \
> > not clear if it is worth to add support for recursive fast-locking. 
> > One trouble is that when a contending thread arrives at a fast-locked object, it \
> > must inflate the fast-lock to a full monitor. Normally, we need to know the \
> > current owning thread, and record that in the monitor, so that the contending \
> > thread can wait for the current owner to properly exit the monitor. However, \
> > fast-locking doesn't have this information. What we do instead is to record a \
> > special marker ANONYMOUS_OWNER. When the thread that currently holds the lock \
> > arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, \
> > fixes the owner to be itself, and then properly exits the monitor, and thus \
> > handing over to the contending thread. 
> > As an alternative, I considered to remove stack-locking altogether, and only use \
> > heavy monitors. In most workloads this did not show measurable regressions. \
> > However, in a few workloads, I have observed severe regressions. All of them have \
> > been using old synchronized Java collections (Vector, Stack), StringBuffer or \
> > similar code. The combination of two conditions leads to regressions without \
> > stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. \
> > single-threaded use of Vector or StringBuffer) and 2. The workload churns such \
> > locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but \
> > creating lots of such single-use, single-threaded-locked objects leads to massive \
> > ObjectMonitor churn, which can lead to a significant performance impact. But \
> > alas, such code exists, and we probably don't want to punish it if we can avoid \
> > it. 
> > This change enables to simplify (and speed-up!) a lot of code:
> > 
> > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) \
> >                 ObjectMonitor pointer to the object header.
> > - Accessing the hashcode could now be done in the fastpath always, if the \
> > hashcode has been installed. Fast-locked headers can be used directly, for \
> > monitor-locked objects we can easily reach-through to the displaced header. This \
> > is safe because Java threads participate in monitor deflation protocol. This \
> > would be implemented in a separate PR 
> > ### Benchmarks
> > 
> > All benchmarks are run on server-class metal machines. The JVM settings are \
> > always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is \
> > better. 
> > #### DaCapo/AArch64
> > 
> > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS \
> > m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa \
> > (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, \
> > h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that \
> > showed results far off the baseline or showed high variance have been repeated \
> > and I am reporting results with the most bias *against* fast-locking. The sunflow \
> > benchmark is really far off the mark - the baseline run with stack-locking \
> > exhibited very high run-to-run variance and generally much worse performance, \
> > while with fast-locking the variance was very low and the results very stable \
> > between runs. I wouldn't trust that benchmark - I mean what is it actually doing \
> > that a change in locking shows >30% perf difference? 
> > benchmark | baseline | fast-locking | % | size
> > -- | -- | -- | -- | --
> > avrora | 27859 | 27563 | 1.07% | large
> > batik | 20786 | 20847 | -0.29% | large
> > biojava | 27421 | 27334 | 0.32% | default
> > eclipse | 59918 | 60522 | -1.00% | large
> > fop | 3670 | 3678 | -0.22% | default
> > graphchi | 2088 | 2060 | 1.36% | default
> > h2 | 297391 | 291292 | 2.09% | huge
> > jme | 8762 | 8877 | -1.30% | default
> > jython | 18938 | 18878 | 0.32% | default
> > luindex | 1339 | 1325 | 1.06% | default
> > lusearch | 918 | 936 | -1.92% | default
> > pmd | 58291 | 58423 | -0.23% | large
> > sunflow | 32617 | 24961 | 30.67% | large
> > tomcat | 25481 | 25992 | -1.97% | large
> > tradebeans | 314640 | 311706 | 0.94% | huge
> > tradesoap | 107473 | 110246 | -2.52% | huge
> > xalan | 6047 | 5882 | 2.81% | default
> > zxing | 970 | 926 | 4.75% | default
> > 
> > #### DaCapo/x86_64
> > 
> > The following measurements have been taken on an Intel Xeon Scalable Processors \
> > (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and \
> > considerations as in the measurements above. 
> > benchmark | baseline | fast-Locking | % | size
> > -- | -- | -- | -- | --
> > avrora | 127690 | 126749 | 0.74% | large
> > batik | 12736 | 12641 | 0.75% | large
> > biojava | 15423 | 15404 | 0.12% | default
> > eclipse | 41174 | 41498 | -0.78% | large
> > fop | 2184 | 2172 | 0.55% | default
> > graphchi | 1579 | 1560 | 1.22% | default
> > h2 | 227614 | 230040 | -1.05% | huge
> > jme | 8591 | 8398 | 2.30% | default
> > jython | 13473 | 13356 | 0.88% | default
> > luindex | 824 | 813 | 1.35% | default
> > lusearch | 962 | 968 | -0.62% | default
> > pmd | 40827 | 39654 | 2.96% | large
> > sunflow | 53362 | 43475 | 22.74% | large
> > tomcat | 27549 | 28029 | -1.71% | large
> > tradebeans | 190757 | 190994 | -0.12% | huge
> > tradesoap | 68099 | 67934 | 0.24% | huge
> > xalan | 7969 | 8178 | -2.56% | default
> > zxing | 1176 | 1148 | 2.44% | default
> > 
> > #### Renaissance/AArch64
> > 
> > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with \
> > same JVM settings. 
> > benchmark | baseline | fast-locking | %
> > -- | -- | -- | --
> > AkkaUct | 2558.832 | 2513.594 | 1.80%
> > Reactors | 14715.626 | 14311.246 | 2.83%
> > Als | 1851.485 | 1869.622 | -0.97%
> > ChiSquare | 1007.788 | 1003.165 | 0.46%
> > GaussMix | 1157.491 | 1149.969 | 0.65%
> > LogRegression | 717.772 | 733.576 | -2.15%
> > MovieLens | 7916.181 | 8002.226 | -1.08%
> > NaiveBayes | 395.296 | 386.611 | 2.25%
> > PageRank | 4294.939 | 4346.333 | -1.18%
> > FjKmeans | 496.076 | 493.873 | 0.45%
> > FutureGenetic | 2578.504 | 2589.255 | -0.42%
> > Mnemonics | 4898.886 | 4903.689 | -0.10%
> > ParMnemonics | 4260.507 | 4210.121 | 1.20%
> > Scrabble | 139.37 | 138.312 | 0.76%
> > RxScrabble | 320.114 | 322.651 | -0.79%
> > Dotty | 1056.543 | 1068.492 | -1.12%
> > ScalaDoku | 3443.117 | 3449.477 | -0.18%
> > ScalaKmeans | 259.384 | 258.648 | 0.28%
> > Philosophers | 24333.311 | 23438.22 | 3.82%
> > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14%
> > FinagleChirper | 6814.192 | 6853.38 | -0.57%
> > FinagleHttp | 4762.902 | 4807.564 | -0.93%
> > 
> > #### Renaissance/x86_64
> > 
> > benchmark | baseline | fast-locking | %
> > -- | -- | -- | --
> > AkkaUct | 1117.185 | 1116.425 | 0.07%
> > Reactors | 11561.354 | 11812.499 | -2.13%
> > Als | 1580.838 | 1575.318 | 0.35%
> > ChiSquare | 459.601 | 467.109 | -1.61%
> > GaussMix | 705.944 | 685.595 | 2.97%
> > LogRegression | 659.944 | 656.428 | 0.54%
> > MovieLens | 7434.303 | 7592.271 | -2.08%
> > NaiveBayes | 413.482 | 417.369 | -0.93%
> > PageRank | 3259.233 | 3276.589 | -0.53%
> > FjKmeans | 946.429 | 938.991 | 0.79%
> > FutureGenetic | 1760.672 | 1815.272 | -3.01%
> > ParMnemonics | 2016.917 | 2033.101 | -0.80%
> > Scrabble | 147.996 | 150.084 | -1.39%
> > RxScrabble | 177.755 | 177.956 | -0.11%
> > Dotty | 673.754 | 683.919 | -1.49%
> > ScalaDoku | 2193.562 | 1958.419 | 12.01%
> > ScalaKmeans | 165.376 | 168.925 | -2.10%
> > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95%
> > Philosophers | 14268.449 | 13308.87 | 7.21%
> > FinagleChirper | 4722.13 | 4688.3 | 0.72%
> > FinagleHttp | 3497.241 | 3605.118 | -2.99%
> > 
> > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics \
> > are not compatible with JDK20. The remaining benchmarks show very high run-to-run \
> > variance, which I am investigating (and probably addressing with running them \
> > much more often. 
> > I have also run another benchmark, which is a popular Java JVM benchmark, with \
> > workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I \
> > won't publish the results because I am not sure about the licensing terms. They \
> > look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). 
> > Please let me know if you want me to run any other workloads, or, even better, \
> > run them yourself and report here. 
> > ### Testing
> > - [x] tier1 (x86_64, aarch64, x86_32)
> > - [x] tier2 (x86_64, aarch64)
> > - [x] tier3 (x86_64, aarch64)
> > - [x] tier4 (x86_64, aarch64)
> > - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64)
> 
> Roman Kennke has updated the pull request with a new target base due to a merge or \
> a rebase. The pull request now contains 37 commits: 
> - Merge remote-tracking branch 'upstream/master' into fast-locking
> - Merge remote-tracking branch 'upstream/master' into fast-locking
> - Merge remote-tracking branch 'upstream/master' into fast-locking
> - More RISC-V fixes
> - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
> - RISC-V port
> - Revert "Re-use r0 in call to unlock_object()"
> 
> This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46.
> - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
> - Fix number of rt args to complete_monitor_locking_C, remove some comments
> - Re-use r0 in call to unlock_object()
> - ... and 27 more: https://git.openjdk.org/jdk/compare/4b89fce0...3f0acba4

FYI: I am working on an alternative PR for this that makes fast-locking optional and \
opt-in behind an experimental switch. It will also be much less invasive (no \
structural changes except absolutely necessary, no cleanups) and thus easier to \
handle.

-------------

PR: https://git.openjdk.org/jdk/pull/10590


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic