'Re: RFR: 8291555: Implement alternative fast-locking scheme [v12]'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-serviceability-dev
Subject:    Re: RFR: 8291555: Implement alternative fast-locking scheme [v12]
From:       Daniel D. Daugherty <dcubed () openjdk ! org>
Date:       2023-02-24 21:23:14
Message-ID: wz9WND-HKV59yoZxavxiHaFyxy--8QcE4b2x9TUt7uY=.97fe152c-10d1-4523-82fb-92c761194cfa () github ! com
[Download RAW message or body]

On Wed, 15 Feb 2023 21:15:54 GMT, Roman Kennke <rkennke@openjdk.org> wrote:

> > This change adds a fast-locking scheme as an alternative to the current \
> > stack-locking implementation. It retains the advantages of stack-locking (namely \
> > fast locking in uncontended code-paths), while avoiding the overload of the mark \
> > word. That overloading causes massive problems with Lilliput, because it means we \
> > have to check and deal with this situation when trying to access the mark-word. \
> > And because of the very racy nature, this turns out to be very complex and would \
> > involve a variant of the inflation protocol to ensure that the object header is \
> > stable. (The current implementation of setting/fetching the i-hash provides a \
> > glimpse into the complexity). 
> > What the original stack-locking does is basically to push a stack-lock onto the \
> > stack which consists only of the displaced header, and CAS a pointer to this \
> > stack location into the object header (the lowest two header bits being 00 \
> > indicate 'stack-locked'). The pointer into the stack can then be used to identify \
> > which thread currently owns the lock. 
> > This change basically reverses stack-locking: It still CASes the lowest two \
> > header bits to 00 to indicate 'fast-locked' but does *not* overload the upper \
> > bits with a stack-pointer. Instead, it pushes the object-reference to a \
> > thread-local lock-stack. This is a new structure which is basically a small array \
> > of oops that is associated with each thread. Experience shows that this array \
> > typcially remains very small (3-5 elements). Using this lock stack, it is \
> > possible to query which threads own which locks. Most importantly, the most \
> > common question 'does the current thread own me?' is very quickly answered by \
> > doing a quick scan of the array. More complex queries like 'which thread owns X?' \
> > are not performed in very performance-critical paths (usually in code like JVMTI \
> > or deadlock detection) where it is ok to do more complex operations (and we \
> > already do). The lock-stack is also a new set of GC roots, and would be scanned \
> > during thread scanning, possibly concurrently, via the normal 
 protocols.
> > 
> > The lock-stack is grown when needed. This means that we need to check for \
> > potential overflow before attempting locking. When that is the case, locking \
> > fast-paths would call into the runtime to grow the stack and handle the locking. \
> > Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check on method \
> > entry to avoid (possibly lots) of such checks at locking sites. 
> > In contrast to stack-locking, fast-locking does *not* support recursive locking \
> > (yet). When that happens, the fast-lock gets inflated to a full monitor. It is \
> > not clear if it is worth to add support for recursive fast-locking. 
> > One trouble is that when a contending thread arrives at a fast-locked object, it \
> > must inflate the fast-lock to a full monitor. Normally, we need to know the \
> > current owning thread, and record that in the monitor, so that the contending \
> > thread can wait for the current owner to properly exit the monitor. However, \
> > fast-locking doesn't have this information. What we do instead is to record a \
> > special marker ANONYMOUS_OWNER. When the thread that currently holds the lock \
> > arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, \
> > fixes the owner to be itself, and then properly exits the monitor, and thus \
> > handing over to the contending thread. 
> > As an alternative, I considered to remove stack-locking altogether, and only use \
> > heavy monitors. In most workloads this did not show measurable regressions. \
> > However, in a few workloads, I have observed severe regressions. All of them have \
> > been using old synchronized Java collections (Vector, Stack), StringBuffer or \
> > similar code. The combination of two conditions leads to regressions without \
> > stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. \
> > single-threaded use of Vector or StringBuffer) and 2. The workload churns such \
> > locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but \
> > creating lots of such single-use, single-threaded-locked objects leads to massive \
> > ObjectMonitor churn, which can lead to a significant performance impact. But \
> > alas, such code exists, and we probably don't want to punish it if we can avoid \
> > it. 
> > This change enables to simplify (and speed-up!) a lot of code:
> > 
> > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) \
> >                 ObjectMonitor pointer to the object header.
> > - Accessing the hashcode could now be done in the fastpath always, if the \
> > hashcode has been installed. Fast-locked headers can be used directly, for \
> > monitor-locked objects we can easily reach-through to the displaced header. This \
> > is safe because Java threads participate in monitor deflation protocol. This \
> > would be implemented in a separate PR 
> > 
> > Testing:
> > - [x] tier1 x86_64 x aarch64 x +UseFastLocking
> > - [x] tier2 x86_64 x aarch64 x +UseFastLocking
> > - [x] tier3 x86_64 x aarch64 x +UseFastLocking
> > - [x] tier4 x86_64 x aarch64 x +UseFastLocking
> > - [x] tier1 x86_64 x aarch64 x -UseFastLocking
> > - [x] tier2 x86_64 x aarch64 x -UseFastLocking
> > - [x] tier3 x86_64 x aarch64 x -UseFastLocking
> > - [x] tier4 x86_64 x aarch64 x -UseFastLocking
> > - [x] Several real-world applications have been tested with this change in tandem \
> > with Lilliput without any problems, yet 
> > ### Performance
> > 
> > #### Simple Microbenchmark
> > 
> > The microbenchmark exercises only the locking primitives for monitorenter and \
> > monitorexit, without contention. The benchmark can be found \
> > (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops. 
> > > > x86_64 | aarch64 |
> > > -- | -- | -- |
> > > -UseFastLocking | 20.651 | 20.764 |
> > > +UseFastLocking | 18.896 | 18.908 |
> > 
> > 
> > #### Renaissance
> > 
> > > x86_64 |   |   |   | aarch64 |   |  
> > -- | -- | -- | -- | -- | -- | -- | --
> > > stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
> > AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
> > Reactors | 11041.427 | 11181.451 | -1.25% |   | 11381.751 | 11521.318 | -1.21%
> > Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
> > ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
> > GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
> > LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
> > MovieLens | 8248.116 | 8314.576 | -0.80% |   | 7569.219 | 7646.828 | -1.01%%
> > NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
> > PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
> > FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
> > FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | -0.12%
> > ParMnemonics | 2434.551 | 2468.763 | -1.39% |   | 4278.225 | 4263.863 | 0.34%
> > Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
> > RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
> > Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
> > ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
> > ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
> > ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | -0.37%
> > Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 0.97%
> > FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | -0.39%
> > FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%
> 
> Roman Kennke has updated the pull request incrementally with one additional commit \
> since the last revision: 
> Fix merge error (move done label into correct places)

This project is currently baselined on jdk-21+10-761.

-------------

PR: https://git.openjdk.org/jdk/pull/10907


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic