'Withdrawn: 8291555: Replace stack-locking with fast-locking'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-serviceability-dev
Subject:    Withdrawn: 8291555: Replace stack-locking with fast-locking
From:       Roman Kennke <rkennke () openjdk ! org>
Date:       2022-12-14 15:47:01
Message-ID: p0EgZVB2jmWeP-jWPkdBsETwo6oZKzSe73JyLlhsve8=.e3fd0bca-7c0d-4b9e-97f9-75f2fd5c054a () github ! com
[Download RAW message or body]

On Thu, 6 Oct 2022 10:23:04 GMT, Roman Kennke <rkennke@openjdk.org> wrote:

> This change replaces the current stack-locking implementation with a fast-locking \
> scheme that retains the advantages of stack-locking (namely fast locking in \
> uncontended code-paths), while avoiding the overload of the mark word. That \
> overloading causes massive problems with Lilliput, because it means we have to \
> check and deal with this situation. And because of the very racy nature, this turns \
> out to be very complex and involved a variant of the inflation protocol to ensure \
> that the object header is stable.  
> What the original stack-locking does is basically to push a stack-lock onto the \
> stack which consists only of the displaced header, and CAS a pointer to this stack \
> location into the object header (the lowest two header bits being 00 indicate \
> 'stack-locked'). The pointer into the stack can then be used to identify which \
> thread currently owns the lock. 
> This change basically reverses stack-locking: It still CASes the lowest two header \
> bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a \
> stack-pointer. Instead, it pushes the object-reference to a thread-local \
> lock-stack. This is a new structure which is basically a small array of oops that \
> is associated with each thread. Experience shows that this array typcially remains \
> very small (3-5 elements). Using this lock stack, it is possible to query which \
> threads own which locks. Most importantly, the most common question 'does the \
> current thread own me?' is very quickly answered by doing a quick scan of the \
> array. More complex queries like 'which thread owns X?' are not performed in very \
> performance-critical paths (usually in code like JVMTI or deadlock detection) where \
> it is ok to do more complex operations. The lock-stack is also a new set of GC \
> roots, and would be scanned during thread scanning, possibly concurrently, via the \
> normal protocols. 
> In contrast to stack-locking, fast-locking does *not* support recursive locking \
> (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not \
> clear if it is worth to add support for recursive fast-locking. 
> One trouble is that when a contending thread arrives at a fast-locked object, it \
> must inflate the fast-lock to a full monitor. Normally, we need to know the current \
> owning thread, and record that in the monitor, so that the contending thread can \
> wait for the current owner to properly exit the monitor. However, fast-locking \
> doesn't have this information. What we do instead is to record a special marker \
> ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at \
> monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the \
> owner to be itself, and then properly exits the monitor, and thus handing over to \
> the contending thread. 
> As an alternative, I considered to remove stack-locking altogether, and only use \
> heavy monitors. In most workloads this did not show measurable regressions. \
> However, in a few workloads, I have observed severe regressions. All of them have \
> been using old synchronized Java collections (Vector, Stack), StringBuffer or \
> similar code. The combination of two conditions leads to regressions without stack- \
> or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. \
> single-threaded use of Vector or StringBuffer) and 2. The workload churns such \
> locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but \
> creating lots of such single-use, single-threaded-locked objects leads to massive \
> ObjectMonitor churn, which can lead to a significant performance impact. But alas, \
> such code exists, and we probably don't want to punish it if we can avoid it. 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) \
>                 ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the hashcode \
> has been installed. Fast-locked headers can be used directly, for monitor-locked \
> objects we can easily reach-through to the displaced header. This is safe because \
> Java threads participate in monitor deflation protocol. This would be implemented \
> in a separate PR 
> ### Benchmarks
> 
> All benchmarks are run on server-class metal machines. The JVM settings are always: \
> `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. 
> #### DaCapo/AArch64
> 
> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS \
> m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa \
> (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, \
> h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that \
> showed results far off the baseline or showed high variance have been repeated and \
> I am reporting results with the most bias *against* fast-locking. The sunflow \
> benchmark is really far off the mark - the baseline run with stack-locking \
> exhibited very high run-to-run variance and generally much worse performance, while \
> with fast-locking the variance was very low and the results very stable between \
> runs. I wouldn't trust that benchmark - I mean what is it actually doing that a \
> change in locking shows >30% perf difference? 
> benchmark | baseline | fast-locking | % | size
> -- | -- | -- | -- | --
> avrora | 27859 | 27563 | 1.07% | large
> batik | 20786 | 20847 | -0.29% | large
> biojava | 27421 | 27334 | 0.32% | default
> eclipse | 59918 | 60522 | -1.00% | large
> fop | 3670 | 3678 | -0.22% | default
> graphchi | 2088 | 2060 | 1.36% | default
> h2 | 297391 | 291292 | 2.09% | huge
> jme | 8762 | 8877 | -1.30% | default
> jython | 18938 | 18878 | 0.32% | default
> luindex | 1339 | 1325 | 1.06% | default
> lusearch | 918 | 936 | -1.92% | default
> pmd | 58291 | 58423 | -0.23% | large
> sunflow | 32617 | 24961 | 30.67% | large
> tomcat | 25481 | 25992 | -1.97% | large
> tradebeans | 314640 | 311706 | 0.94% | huge
> tradesoap | 107473 | 110246 | -2.52% | huge
> xalan | 6047 | 5882 | 2.81% | default
> zxing | 970 | 926 | 4.75% | default
> 
> #### DaCapo/x86_64
> 
> The following measurements have been taken on an Intel Xeon Scalable Processors \
> (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and \
> considerations as in the measurements above. 
> benchmark | baseline | fast-Locking | % | size
> -- | -- | -- | -- | --
> avrora | 127690 | 126749 | 0.74% | large
> batik | 12736 | 12641 | 0.75% | large
> biojava | 15423 | 15404 | 0.12% | default
> eclipse | 41174 | 41498 | -0.78% | large
> fop | 2184 | 2172 | 0.55% | default
> graphchi | 1579 | 1560 | 1.22% | default
> h2 | 227614 | 230040 | -1.05% | huge
> jme | 8591 | 8398 | 2.30% | default
> jython | 13473 | 13356 | 0.88% | default
> luindex | 824 | 813 | 1.35% | default
> lusearch | 962 | 968 | -0.62% | default
> pmd | 40827 | 39654 | 2.96% | large
> sunflow | 53362 | 43475 | 22.74% | large
> tomcat | 27549 | 28029 | -1.71% | large
> tradebeans | 190757 | 190994 | -0.12% | huge
> tradesoap | 68099 | 67934 | 0.24% | huge
> xalan | 7969 | 8178 | -2.56% | default
> zxing | 1176 | 1148 | 2.44% | default
> 
> #### Renaissance/AArch64
> 
> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with \
> same JVM settings. 
> benchmark | baseline | fast-locking | %
> -- | -- | -- | --
> AkkaUct | 2558.832 | 2513.594 | 1.80%
> Reactors | 14715.626 | 14311.246 | 2.83%
> Als | 1851.485 | 1869.622 | -0.97%
> ChiSquare | 1007.788 | 1003.165 | 0.46%
> GaussMix | 1157.491 | 1149.969 | 0.65%
> LogRegression | 717.772 | 733.576 | -2.15%
> MovieLens | 7916.181 | 8002.226 | -1.08%
> NaiveBayes | 395.296 | 386.611 | 2.25%
> PageRank | 4294.939 | 4346.333 | -1.18%
> FjKmeans | 496.076 | 493.873 | 0.45%
> FutureGenetic | 2578.504 | 2589.255 | -0.42%
> Mnemonics | 4898.886 | 4903.689 | -0.10%
> ParMnemonics | 4260.507 | 4210.121 | 1.20%
> Scrabble | 139.37 | 138.312 | 0.76%
> RxScrabble | 320.114 | 322.651 | -0.79%
> Dotty | 1056.543 | 1068.492 | -1.12%
> ScalaDoku | 3443.117 | 3449.477 | -0.18%
> ScalaKmeans | 259.384 | 258.648 | 0.28%
> Philosophers | 24333.311 | 23438.22 | 3.82%
> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14%
> FinagleChirper | 6814.192 | 6853.38 | -0.57%
> FinagleHttp | 4762.902 | 4807.564 | -0.93%
> 
> #### Renaissance/x86_64
> 
> benchmark | baseline | fast-locking | %
> -- | -- | -- | --
> AkkaUct | 1117.185 | 1116.425 | 0.07%
> Reactors | 11561.354 | 11812.499 | -2.13%
> Als | 1580.838 | 1575.318 | 0.35%
> ChiSquare | 459.601 | 467.109 | -1.61%
> GaussMix | 705.944 | 685.595 | 2.97%
> LogRegression | 659.944 | 656.428 | 0.54%
> MovieLens | 7434.303 | 7592.271 | -2.08%
> NaiveBayes | 413.482 | 417.369 | -0.93%
> PageRank | 3259.233 | 3276.589 | -0.53%
> FjKmeans | 946.429 | 938.991 | 0.79%
> FutureGenetic | 1760.672 | 1815.272 | -3.01%
> ParMnemonics | 2016.917 | 2033.101 | -0.80%
> Scrabble | 147.996 | 150.084 | -1.39%
> RxScrabble | 177.755 | 177.956 | -0.11%
> Dotty | 673.754 | 683.919 | -1.49%
> ScalaDoku | 2193.562 | 1958.419 | 12.01%
> ScalaKmeans | 165.376 | 168.925 | -2.10%
> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95%
> Philosophers | 14268.449 | 13308.87 | 7.21%
> FinagleChirper | 4722.13 | 4688.3 | 0.72%
> FinagleHttp | 3497.241 | 3605.118 | -2.99%
> 
> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are \
> not compatible with JDK20. The remaining benchmarks show very high run-to-run \
> variance, which I am investigating (and probably addressing with running them much \
> more often. 
> I have also run another benchmark, which is a popular Java JVM benchmark, with \
> workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I \
> won't publish the results because I am not sure about the licensing terms. They \
> look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). 
> Please let me know if you want me to run any other workloads, or, even better, run \
> them yourself and report here. 
> ### Testing
> - [x] tier1 (x86_64, aarch64, x86_32)
> - [x] tier2 (x86_64, aarch64)
> - [x] tier3 (x86_64, aarch64)
> - [x] tier4 (x86_64, aarch64)
> - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64)

This pull request has been closed without being integrated.

-------------

PR: https://git.openjdk.org/jdk/pull/10590


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic