[prev in list] [next in list] [prev in thread] [next in thread]
List: openjdk-hotspot-runtime-dev
Subject: Using x86 pause instr in SpinPause
From: peter.levart () marand ! si (Peter Levart)
Date: 2012-08-29 16:43:21
Message-ID: 6205263.7S5AjWIolE () peterl ! marand ! si
[Download RAW message or body]
I played with PAUSE instruction a little on my i7 machine.
What I found out is that on i7 (4 cores, 2 threads per core) with NOP the wait loop \
spins about 3 - 4 times faster then with PAUSE instruction although all 4 CPU cores \
run with Max. Frequency in both cases.
Also it seems that a loop using PAUSE spins at much more "constant" speed and so the \
lock becomes more fair then the same lock using NOP in the spin-loop.
With high number of threads the efficiency is also better when using PAUSE in a loop \
instead of just NOP.
Here's how I tested:
I tried to mimic the hotspot's Thread::SpinAcquire/SpinRelease so I ripped-off some \
code fragments from hotspot sources. Here's a simplified SpinLock.c implementation \
that is not resorting to park after 5 yields but continues to yield every 4096 spins \
indefinitely:
#include <sched.h>
inline void fence() {
// always use locked addl since mfence is sometimes expensive
#ifdef AMD64
__asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
__asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
}
inline int cmpxchg(int exchange_value, volatile int* dest, int compare_value) {
__asm__ volatile ("lock; cmpxchgl %1,(%3)"
: "=a" (exchange_value)
: "r" (exchange_value), "a" (compare_value), "r" (dest)
: "cc", "memory");
return exchange_value;
}
int SpinPause () ;
int SpinAcquire (volatile int * adr) {
if (cmpxchg (1, adr, 0) == 0) {
return 0; // normal fast-path return
}
// Slow-path : We've encountered contention -- Spin/Yield strategy.
int ctr = 0 ;
int Yields = 0 ;
for (;;) {
while (*adr != 0) {
++ctr ;
if ((ctr & 0xFFF) == 0) {
sched_yield() ;
++Yields ;
} else {
SpinPause() ;
}
}
if (cmpxchg (1, adr, 0) == 0) return Yields;
}
}
void SpinRelease (volatile int * adr) {
fence() ; // guarantee at least release consistency.
// Roach-motel semantics.
// It's safe if subsequent LDs and STs float "up" into the critical section,
// but prior LDs and STs within the critical section can't be allowed
// to reorder or float past the ST that releases the lock.
*adr = 0 ;
}
... the SpinPause.s is either NOP or PAUSE:
.globl SpinPause
.align 16
.type SpinPause, at function
SpinPause:
rep
nop
movq $1, %rax
ret
The benchmark is designed arround a relatively small "critical" section of code that \
is guarded with single spin/yield-lock and executed repeatedly by multiple threads.
The size of critical section is tuned so that when using a PAUSE in the spin-loop and \
10 concurrent threads, each lock acquire takes in average 3-4 yields with 4096 spins \
between each yield before returning.
Each thread executes a constant number of acquires and critical sections, so the \
amount of "useful work" is constant.
Number of threads is then varied: 10, 30, 100 and the type of spin-lock too: NOP / \
PAUSE
Here's the code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include "SpinLock.h"
#define THREADS 10
#define OUTER_LOOP_SIZE 100000
#define WORK_LOOP_SIZE 100
void *runnable( void *ptr );
void main(int argc, char *argv[])
{
pthread_t thread[THREADS];
int yields[THREADS];
int i;
for (i = 0; i < THREADS; i++)
{
yields[i] = 0;
pthread_create( &thread[i], NULL, runnable, (void*) &yields[i]);
}
for (i = 0; i < THREADS; i++)
pthread_join( thread[i], NULL);
printf("Acquires per thread: %d\n", OUTER_LOOP_SIZE);
for (i = 0; i < THREADS; i++)
printf("Thread %d total yields: %d\n", i, yields[i]);
exit(0);
}
int lock = 0;
void *runnable( void *ptr )
{
int *thread_yields;
thread_yields = (int *) ptr;
int yields = 0;
int i, j;
int sum;
for (i = 0; i < OUTER_LOOP_SIZE; i++)
{
yields += SpinAcquire(&lock);
for (j = 0; j < WORK_LOOP_SIZE; j++)
{
int k;
for (k = 0; k < 100; k++)
sum += yields;
}
SpinRelease(&lock);
}
*thread_yields = sum + yields - sum;
}
Here are the results:
10 threads
----------
NOP:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 1569102
Thread 1 total yields: 1240318
Thread 2 total yields: 1339807
Thread 3 total yields: 1529634
Thread 4 total yields: 1215736
Thread 5 total yields: 1400302
Thread 6 total yields: 1501579
Thread 7 total yields: 1066652
Thread 8 total yields: 699626
Thread 9 total yields: 1295480
real 0m18.428s
user 2m15.209s
sys 0m4.084s
PAUSE:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 151316
Thread 1 total yields: 285822
Thread 2 total yields: 321386
Thread 3 total yields: 493483
Thread 4 total yields: 306425
Thread 5 total yields: 420804
Thread 6 total yields: 330618
Thread 7 total yields: 176552
Thread 8 total yields: 434332
Thread 9 total yields: 474757
real 0m17.761s
user 1m58.394s
sys 0m0.936s
30 threads
----------
NOP:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 1516779
Thread 1 total yields: 2007878
Thread 2 total yields: 1027554
Thread 3 total yields: 2085868
Thread 4 total yields: 1152372
Thread 5 total yields: 575832
Thread 6 total yields: 1742543
Thread 7 total yields: 1083592
Thread 8 total yields: 1522944
Thread 9 total yields: 1906178
Thread 10 total yields: 1127148
Thread 11 total yields: 1872796
Thread 12 total yields: 951841
Thread 13 total yields: 1997159
Thread 14 total yields: 1347902
Thread 15 total yields: 1490622
Thread 16 total yields: 1836486
Thread 17 total yields: 2037498
Thread 18 total yields: 1648107
Thread 19 total yields: 1415434
Thread 20 total yields: 2044485
Thread 21 total yields: 644851
Thread 22 total yields: 599960
Thread 23 total yields: 1328082
Thread 24 total yields: 612360
Thread 25 total yields: 640560
Thread 26 total yields: 541422
Thread 27 total yields: 2018600
Thread 28 total yields: 605493
Thread 29 total yields: 1994882
real 0m57.296s
user 7m12.540s
sys 0m18.230s
PAUSE:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 470464
Thread 1 total yields: 361301
Thread 2 total yields: 349293
Thread 3 total yields: 364812
Thread 4 total yields: 360606
Thread 5 total yields: 517756
Thread 6 total yields: 421567
Thread 7 total yields: 301124
Thread 8 total yields: 555139
Thread 9 total yields: 481553
Thread 10 total yields: 634827
Thread 11 total yields: 344431
Thread 12 total yields: 308426
Thread 13 total yields: 405738
Thread 14 total yields: 349320
Thread 15 total yields: 563739
Thread 16 total yields: 301356
Thread 17 total yields: 350707
Thread 18 total yields: 329854
Thread 19 total yields: 471155
Thread 20 total yields: 380747
Thread 21 total yields: 533402
Thread 22 total yields: 638196
Thread 23 total yields: 628245
Thread 24 total yields: 342036
Thread 25 total yields: 356593
Thread 26 total yields: 351789
Thread 27 total yields: 300021
Thread 28 total yields: 290741
Thread 29 total yields: 484845
real 0m56.744s
user 7m12.831s
sys 0m5.127s
100 threads
-----------
NOP:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 715692
Thread 1 total yields: 3956169
Thread 2 total yields: 1043740
Thread 3 total yields: 1029961
Thread 4 total yields: 1499409
Thread 5 total yields: 2012256
Thread 6 total yields: 2683384
Thread 7 total yields: 3229792
Thread 8 total yields: 2053604
Thread 9 total yields: 537083
Thread 10 total yields: 3068300
Thread 11 total yields: 485165
Thread 12 total yields: 1294566
Thread 13 total yields: 477668
Thread 14 total yields: 1935177
Thread 15 total yields: 1320217
Thread 16 total yields: 3630811
Thread 17 total yields: 2002443
Thread 18 total yields: 2573688
Thread 19 total yields: 656375
Thread 20 total yields: 723496
Thread 21 total yields: 2867001
Thread 22 total yields: 3462940
Thread 23 total yields: 2638588
Thread 24 total yields: 3591878
Thread 25 total yields: 674437
Thread 26 total yields: 1767265
Thread 27 total yields: 3254028
Thread 28 total yields: 1148442
Thread 29 total yields: 3232056
Thread 30 total yields: 1710429
Thread 31 total yields: 487365
Thread 32 total yields: 475716
Thread 33 total yields: 672629
Thread 34 total yields: 2235400
Thread 35 total yields: 1073231
Thread 36 total yields: 1564212
Thread 37 total yields: 1232321
Thread 38 total yields: 1668370
Thread 39 total yields: 3926584
Thread 40 total yields: 3639128
Thread 41 total yields: 2135553
Thread 42 total yields: 2410193
Thread 43 total yields: 465033
Thread 44 total yields: 2267986
Thread 45 total yields: 2556756
Thread 46 total yields: 1233673
Thread 47 total yields: 2296487
Thread 48 total yields: 1569566
Thread 49 total yields: 2087966
Thread 50 total yields: 1141489
Thread 51 total yields: 2895012
Thread 52 total yields: 1318840
Thread 53 total yields: 463860
Thread 54 total yields: 3063963
Thread 55 total yields: 1602932
Thread 56 total yields: 2998556
Thread 57 total yields: 3052040
Thread 58 total yields: 2936234
Thread 59 total yields: 1608002
Thread 60 total yields: 1947606
Thread 61 total yields: 3650902
Thread 62 total yields: 481011
Thread 63 total yields: 2946211
Thread 64 total yields: 2657741
Thread 65 total yields: 1854059
Thread 66 total yields: 612458
Thread 67 total yields: 3858631
Thread 68 total yields: 3645990
Thread 69 total yields: 2916354
Thread 70 total yields: 1587217
Thread 71 total yields: 625513
Thread 72 total yields: 810526
Thread 73 total yields: 3230899
Thread 74 total yields: 3117595
Thread 75 total yields: 680967
Thread 76 total yields: 1925092
Thread 77 total yields: 2205682
Thread 78 total yields: 2669335
Thread 79 total yields: 699507
Thread 80 total yields: 462614
Thread 81 total yields: 1108081
Thread 82 total yields: 998706
Thread 83 total yields: 1625907
Thread 84 total yields: 1364484
Thread 85 total yields: 2698464
Thread 86 total yields: 1132631
Thread 87 total yields: 1272493
Thread 88 total yields: 544296
Thread 89 total yields: 642514
Thread 90 total yields: 1659716
Thread 91 total yields: 3657423
Thread 92 total yields: 1152010
Thread 93 total yields: 864437
Thread 94 total yields: 1914716
Thread 95 total yields: 665765
Thread 96 total yields: 470625
Thread 97 total yields: 1515056
Thread 98 total yields: 1694343
Thread 99 total yields: 656651
real 4m9.802s
user 31m31.862s
sys 1m30.147s
PAUSE:
[peter at peterl src]$ time ./SpinLockTest
Acquires per thread: 100000
Thread 0 total yields: 423635
Thread 1 total yields: 432373
Thread 2 total yields: 525277
Thread 3 total yields: 403625
Thread 4 total yields: 435605
Thread 5 total yields: 535598
Thread 6 total yields: 508091
Thread 7 total yields: 548162
Thread 8 total yields: 532789
Thread 9 total yields: 769968
Thread 10 total yields: 442565
Thread 11 total yields: 638749
Thread 12 total yields: 488601
Thread 13 total yields: 579624
Thread 14 total yields: 653364
Thread 15 total yields: 437042
Thread 16 total yields: 515524
Thread 17 total yields: 502716
Thread 18 total yields: 445556
Thread 19 total yields: 569855
Thread 20 total yields: 436675
Thread 21 total yields: 434850
Thread 22 total yields: 409440
Thread 23 total yields: 733155
Thread 24 total yields: 708836
Thread 25 total yields: 504327
Thread 26 total yields: 501003
Thread 27 total yields: 772928
Thread 28 total yields: 404170
Thread 29 total yields: 564273
Thread 30 total yields: 447199
Thread 31 total yields: 518266
Thread 32 total yields: 751953
Thread 33 total yields: 528898
Thread 34 total yields: 469453
Thread 35 total yields: 443822
Thread 36 total yields: 453571
Thread 37 total yields: 483523
Thread 38 total yields: 673307
Thread 39 total yields: 419745
Thread 40 total yields: 420812
Thread 41 total yields: 579195
Thread 42 total yields: 534738
Thread 43 total yields: 558074
Thread 44 total yields: 404649
Thread 45 total yields: 690615
Thread 46 total yields: 457234
Thread 47 total yields: 623036
Thread 48 total yields: 700575
Thread 49 total yields: 608860
Thread 50 total yields: 405334
Thread 51 total yields: 577808
Thread 52 total yields: 449998
Thread 53 total yields: 473125
Thread 54 total yields: 558360
Thread 55 total yields: 406760
Thread 56 total yields: 621827
Thread 57 total yields: 456095
Thread 58 total yields: 700446
Thread 59 total yields: 696581
Thread 60 total yields: 657749
Thread 61 total yields: 771747
Thread 62 total yields: 425028
Thread 63 total yields: 416165
Thread 64 total yields: 416922
Thread 65 total yields: 748436
Thread 66 total yields: 710466
Thread 67 total yields: 431879
Thread 68 total yields: 407904
Thread 69 total yields: 516825
Thread 70 total yields: 404993
Thread 71 total yields: 432439
Thread 72 total yields: 762656
Thread 73 total yields: 512795
Thread 74 total yields: 443227
Thread 75 total yields: 627807
Thread 76 total yields: 506496
Thread 77 total yields: 550415
Thread 78 total yields: 420525
Thread 79 total yields: 547715
Thread 80 total yields: 693714
Thread 81 total yields: 708321
Thread 82 total yields: 438307
Thread 83 total yields: 400358
Thread 84 total yields: 555737
Thread 85 total yields: 555062
Thread 86 total yields: 518418
Thread 87 total yields: 448384
Thread 88 total yields: 776252
Thread 89 total yields: 584543
Thread 90 total yields: 578234
Thread 91 total yields: 426361
Thread 92 total yields: 572331
Thread 93 total yields: 511610
Thread 94 total yields: 573504
Thread 95 total yields: 596209
Thread 96 total yields: 462665
Thread 97 total yields: 455619
Thread 98 total yields: 670371
Thread 99 total yields: 634330
real 3m50.676s
user 30m8.317s
sys 0m24.138s
With 100 threads the difference starts to show.
Regards, Peter
On Wednesday, August 29, 2012 08:31:23 AM Vitaly Davidovich wrote:
By the way, only thing I can think of as being a possible issue is the pause \
instruction killing (or reducing) out of order/pipelined execution of the rest of the \
loop body. But then I don't know if this is such an issue as the loop exit will \
probably have a branch misprediction anyway, killing whatever is in the pipeline. At \
any rate, would be interesting to see an explanation. Cheers
Sent from my phone
On Aug 29, 2012 8:23 AM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:
I'm actually curious to know if Eric can explain a bit more why pause is an issue \
here, possibly with some benchmark results. David's point earlier was that he doesn't \
think there's benefit to it in the way hotspot spins, but removing pause implies it \
can actually do harm rather than simply being unhelpful. I'm also assuming this is \
not AMD specific but Intel as well? Thanks
Sent from my phone
On Aug 29, 2012 8:04 AM, "Peter Levart" <peter.levart at marand.si> wrote:
Here's an interesting explanation about the impact of pause instruction in spin-wait \
loops:
http://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors/
4 years later: It may be that newer CPUs are more clever now.
Regards, Peter
On Tuesday, August 28, 2012 11:36:23 AM Eric Caspole wrote:
> Hi everybody,
> I have made a webrev making a one-line change to remove use of PAUSE
> in linux x64. This will bring linux into sync with windows where
> SpinPause is just "return 0" as Dan indicates below.
>
> http://cr.openjdk.java.net/~ecaspole/nopause/webrev.00/webrev/
>
> We find that it is better not to use PAUSE in this kind of spin
> routine. Apparently someone discovered that on windows x64 years ago.
> Thanks,
> Eric
>
> On Aug 16, 2012, at 5:07 PM, Daniel D. Daugherty wrote:
> > On Win64, SpinPause() has been "return 0" since mid-2005. Way back
> >
> > when Win64 code was in os_win32_amd64.cpp:
> > SCCS/s.os_win32_amd64.cpp:
> >
> > D 1.9.1.1 05/07/04 03:20:45 dice 12 10 00025/00000/00334
> > MRs:
> > COMMENTS:
> > 5030359 -- back-end synchonization improvements - adaptive
> >
> > spinning, etc
> >
> > When the i486 and amd64 cpu dirs were merged back in 2007, the code
> >
> > became like it is below (#ifdef'ed):
> > D 1.32 07/09/17 09:11:33 sgoldman 37 35 00264/00008/00218
> > MRs:
> > COMMENTS:
> > 5108146 Merge i486 and amd64 cpu directories.
> > Macro-ized register names. Inserted amd64 specific code.
> >
> > Looks like on Linux-X64, the code has used the PAUSE instruction
> >
> > since mid-2005:
> > D 1.3 05/07/04 03:14:09 dice 4 3 00031/00000/00353
> > MRs:
> > COMMENTS:
> > 5030359 -- back-end synchonization improvements - adaptive
> >
> > spinning, etc
> >
> > We'll have to see if Dave Dice remember why he implemented
> > it this way...
> >
> > Dan
> >
> > On 8/16/12 12:01 PM, Eric Caspole wrote:
> > > Hi everybody,
> > > Does anybody know the reason why SpinPause is simply "return 0" on
> > > Win64 but uses PAUSE on Linux in a .s file?
> > > We would like to remove PAUSE from linux too.
> > >
> > > Thanks,
> > > Eric
> > >
> > >
> > > ./src/os_cpu/windows_x86/vm/os_windows_x86.cpp
> > >
> > > 548 extern "C" int SpinPause () {
> > > 549 #ifdef AMD64
> > > 550 return 0 ;
> > > 551 #else
> > > 552 // pause == rep:nop
> > > 553 // On systems that don't support pause a rep:nop
> > > 554 // is executed as a nop. The rep: prefix is ignored.
> > > 555 _asm {
> > > 556 pause ;
> > > 557 };
> > > 558 return 1 ;
> > > 559 #endif // AMD64
> > > 560 }
> > >
> > > src/os_cpu/linux_x86/vm/linux_x86_64.s
> > >
> > > 63 .globl SpinPause
> > > 64 .align 16
> > > 65 .type SpinPause, at function
> > > 66 SpinPause:
> > > 67 rep
> > > 68 nop
> > > 69 movq $1, %rax
> > > 70 ret
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/attachments/20120829/11f98962/attachment-0001.html \
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic