[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-hotspot-runtime-dev
Subject:    Using x86 pause instr in SpinPause
From:       peter.levart () marand ! si (Peter Levart)
Date:       2012-08-29 16:43:21
Message-ID: 6205263.7S5AjWIolE () peterl ! marand ! si
[Download RAW message or body]

I played with PAUSE instruction a little on my i7 machine.

What I found out is that on i7 (4 cores, 2 threads per core) with NOP the wait loop \
spins about 3 - 4 times faster then with PAUSE instruction although all 4 CPU cores \
run with Max. Frequency in both cases.

Also it seems that a loop using PAUSE spins at much more "constant" speed and so the \
lock becomes more fair then the same lock using NOP in the spin-loop.

With high number of threads the efficiency is also better when using PAUSE in a loop \
instead of just NOP.

Here's how I tested:

I tried to mimic the hotspot's Thread::SpinAcquire/SpinRelease so I ripped-off some \
code fragments from hotspot sources. Here's a simplified SpinLock.c implementation \
that is not resorting to park after 5 yields but continues to yield every 4096 spins \
indefinitely:

#include <sched.h>

inline void fence() {
    // always use locked addl since mfence is sometimes expensive
#ifdef AMD64
    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
    __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
}

inline int cmpxchg(int exchange_value, volatile int* dest, int compare_value) {
  __asm__ volatile ("lock; cmpxchgl %1,(%3)"
                    : "=a" (exchange_value)
                    : "r" (exchange_value), "a" (compare_value), "r" (dest)
                    : "cc", "memory");
  return exchange_value;
}

int SpinPause () ;

int SpinAcquire (volatile int * adr) {
  if (cmpxchg (1, adr, 0) == 0) {
     return 0;   // normal fast-path return
  }

  // Slow-path : We've encountered contention -- Spin/Yield strategy.
  int ctr = 0 ;
  int Yields = 0 ;
  for (;;) {
     while (*adr != 0) {
        ++ctr ;
        if ((ctr & 0xFFF) == 0) {
           sched_yield() ;
           ++Yields ;
        } else {
           SpinPause() ;
        }
     }
     if (cmpxchg (1, adr, 0) == 0) return Yields;
  }
}

void SpinRelease (volatile int * adr) {
  fence() ;      // guarantee at least release consistency.
  // Roach-motel semantics.
  // It's safe if subsequent LDs and STs float "up" into the critical section,
  // but prior LDs and STs within the critical section can't be allowed
  // to reorder or float past the ST that releases the lock.
  *adr = 0 ;
}


... the SpinPause.s is either NOP or PAUSE:

        .globl SpinPause
        .align 16
        .type  SpinPause, at function
SpinPause:
        rep
        nop
        movq   $1, %rax
        ret
 


The benchmark is designed arround a relatively small "critical" section of code that \
is guarded with single spin/yield-lock and executed repeatedly by multiple threads.

The size of critical section is tuned so that when using a PAUSE in the spin-loop and \
10 concurrent threads, each lock acquire takes in average 3-4 yields with 4096 spins \
between each yield before returning.

Each thread executes a constant number of acquires and critical sections, so the \
amount of "useful work" is constant.

Number of threads is then varied: 10, 30, 100 and the type of spin-lock too: NOP / \
PAUSE

Here's the code:


#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#include "SpinLock.h"

#define THREADS 10
#define OUTER_LOOP_SIZE 100000
#define WORK_LOOP_SIZE 100

void *runnable( void *ptr );
 
void main(int argc, char *argv[])
{
     pthread_t thread[THREADS];
     int yields[THREADS];
     int i;
     
     for (i = 0; i < THREADS; i++)
     {
       yields[i] = 0;
       pthread_create( &thread[i], NULL, runnable, (void*) &yields[i]);
     }
      
     for (i = 0; i < THREADS; i++)
       pthread_join( thread[i], NULL);
 
     printf("Acquires per thread: %d\n", OUTER_LOOP_SIZE);
     
     for (i = 0; i < THREADS; i++)
       printf("Thread %d total yields: %d\n", i, yields[i]);
     
     exit(0);
}

int lock = 0;

void *runnable( void *ptr )
{
     int *thread_yields;
     thread_yields = (int *) ptr;
     
     int yields = 0;
     int i, j;
     int sum;
     
     for (i = 0; i < OUTER_LOOP_SIZE; i++)
     {
       yields += SpinAcquire(&lock);
       
       for (j = 0; j < WORK_LOOP_SIZE; j++)
       {
	 int k;
	 for (k = 0; k < 100; k++)
	   sum += yields;
       }
       
       SpinRelease(&lock);
     }
     
     *thread_yields = sum + yields - sum;
}



Here are the results:


10 threads
----------

NOP:

[peter at peterl src]$ time ./SpinLockTest 
Acquires per thread: 100000
Thread 0 total yields: 1569102
Thread 1 total yields: 1240318
Thread 2 total yields: 1339807
Thread 3 total yields: 1529634
Thread 4 total yields: 1215736
Thread 5 total yields: 1400302
Thread 6 total yields: 1501579
Thread 7 total yields: 1066652
Thread 8 total yields: 699626
Thread 9 total yields: 1295480

real    0m18.428s
user    2m15.209s
sys     0m4.084s


PAUSE:

[peter at peterl src]$ time ./SpinLockTest 
Acquires per thread: 100000
Thread 0 total yields: 151316
Thread 1 total yields: 285822
Thread 2 total yields: 321386
Thread 3 total yields: 493483
Thread 4 total yields: 306425
Thread 5 total yields: 420804
Thread 6 total yields: 330618
Thread 7 total yields: 176552
Thread 8 total yields: 434332
Thread 9 total yields: 474757

real    0m17.761s
user    1m58.394s
sys     0m0.936s


30 threads
----------

NOP:

[peter at peterl src]$ time ./SpinLockTest 
Acquires per thread: 100000
Thread 0 total yields: 1516779
Thread 1 total yields: 2007878
Thread 2 total yields: 1027554
Thread 3 total yields: 2085868
Thread 4 total yields: 1152372
Thread 5 total yields: 575832
Thread 6 total yields: 1742543
Thread 7 total yields: 1083592
Thread 8 total yields: 1522944
Thread 9 total yields: 1906178
Thread 10 total yields: 1127148
Thread 11 total yields: 1872796
Thread 12 total yields: 951841
Thread 13 total yields: 1997159
Thread 14 total yields: 1347902
Thread 15 total yields: 1490622
Thread 16 total yields: 1836486
Thread 17 total yields: 2037498
Thread 18 total yields: 1648107
Thread 19 total yields: 1415434
Thread 20 total yields: 2044485
Thread 21 total yields: 644851
Thread 22 total yields: 599960
Thread 23 total yields: 1328082
Thread 24 total yields: 612360
Thread 25 total yields: 640560
Thread 26 total yields: 541422
Thread 27 total yields: 2018600
Thread 28 total yields: 605493
Thread 29 total yields: 1994882

real    0m57.296s
user    7m12.540s
sys     0m18.230s

PAUSE:

[peter at peterl src]$ time ./SpinLockTest 
Acquires per thread: 100000
Thread 0 total yields: 470464
Thread 1 total yields: 361301
Thread 2 total yields: 349293
Thread 3 total yields: 364812
Thread 4 total yields: 360606
Thread 5 total yields: 517756
Thread 6 total yields: 421567
Thread 7 total yields: 301124
Thread 8 total yields: 555139
Thread 9 total yields: 481553
Thread 10 total yields: 634827
Thread 11 total yields: 344431
Thread 12 total yields: 308426
Thread 13 total yields: 405738
Thread 14 total yields: 349320
Thread 15 total yields: 563739
Thread 16 total yields: 301356
Thread 17 total yields: 350707
Thread 18 total yields: 329854
Thread 19 total yields: 471155
Thread 20 total yields: 380747
Thread 21 total yields: 533402
Thread 22 total yields: 638196
Thread 23 total yields: 628245
Thread 24 total yields: 342036
Thread 25 total yields: 356593
Thread 26 total yields: 351789
Thread 27 total yields: 300021
Thread 28 total yields: 290741
Thread 29 total yields: 484845

real    0m56.744s
user    7m12.831s
sys     0m5.127s


100 threads
-----------

NOP:

[peter at peterl src]$ time ./SpinLockTest 
Acquires per thread: 100000
Thread 0 total yields: 715692
Thread 1 total yields: 3956169
Thread 2 total yields: 1043740
Thread 3 total yields: 1029961
Thread 4 total yields: 1499409
Thread 5 total yields: 2012256
Thread 6 total yields: 2683384
Thread 7 total yields: 3229792
Thread 8 total yields: 2053604
Thread 9 total yields: 537083
Thread 10 total yields: 3068300
Thread 11 total yields: 485165
Thread 12 total yields: 1294566
Thread 13 total yields: 477668
Thread 14 total yields: 1935177
Thread 15 total yields: 1320217
Thread 16 total yields: 3630811
Thread 17 total yields: 2002443
Thread 18 total yields: 2573688
Thread 19 total yields: 656375
Thread 20 total yields: 723496
Thread 21 total yields: 2867001
Thread 22 total yields: 3462940
Thread 23 total yields: 2638588
Thread 24 total yields: 3591878
Thread 25 total yields: 674437
Thread 26 total yields: 1767265
Thread 27 total yields: 3254028
Thread 28 total yields: 1148442
Thread 29 total yields: 3232056
Thread 30 total yields: 1710429
Thread 31 total yields: 487365
Thread 32 total yields: 475716
Thread 33 total yields: 672629
Thread 34 total yields: 2235400
Thread 35 total yields: 1073231
Thread 36 total yields: 1564212
Thread 37 total yields: 1232321
Thread 38 total yields: 1668370
Thread 39 total yields: 3926584
Thread 40 total yields: 3639128
Thread 41 total yields: 2135553
Thread 42 total yields: 2410193
Thread 43 total yields: 465033
Thread 44 total yields: 2267986
Thread 45 total yields: 2556756
Thread 46 total yields: 1233673
Thread 47 total yields: 2296487
Thread 48 total yields: 1569566
Thread 49 total yields: 2087966
Thread 50 total yields: 1141489
Thread 51 total yields: 2895012
Thread 52 total yields: 1318840
Thread 53 total yields: 463860
Thread 54 total yields: 3063963
Thread 55 total yields: 1602932
Thread 56 total yields: 2998556
Thread 57 total yields: 3052040
Thread 58 total yields: 2936234
Thread 59 total yields: 1608002
Thread 60 total yields: 1947606
Thread 61 total yields: 3650902
Thread 62 total yields: 481011
Thread 63 total yields: 2946211
Thread 64 total yields: 2657741
Thread 65 total yields: 1854059
Thread 66 total yields: 612458
Thread 67 total yields: 3858631
Thread 68 total yields: 3645990
Thread 69 total yields: 2916354
Thread 70 total yields: 1587217
Thread 71 total yields: 625513
Thread 72 total yields: 810526
Thread 73 total yields: 3230899
Thread 74 total yields: 3117595
Thread 75 total yields: 680967
Thread 76 total yields: 1925092
Thread 77 total yields: 2205682
Thread 78 total yields: 2669335
Thread 79 total yields: 699507
Thread 80 total yields: 462614
Thread 81 total yields: 1108081
Thread 82 total yields: 998706
Thread 83 total yields: 1625907
Thread 84 total yields: 1364484
Thread 85 total yields: 2698464
Thread 86 total yields: 1132631
Thread 87 total yields: 1272493
Thread 88 total yields: 544296
Thread 89 total yields: 642514
Thread 90 total yields: 1659716
Thread 91 total yields: 3657423
Thread 92 total yields: 1152010
Thread 93 total yields: 864437
Thread 94 total yields: 1914716
Thread 95 total yields: 665765
Thread 96 total yields: 470625
Thread 97 total yields: 1515056
Thread 98 total yields: 1694343
Thread 99 total yields: 656651

real    4m9.802s
user    31m31.862s
sys     1m30.147s

PAUSE:

[peter at peterl src]$ time ./SpinLockTest 
Acquires per thread: 100000
Thread 0 total yields: 423635
Thread 1 total yields: 432373
Thread 2 total yields: 525277
Thread 3 total yields: 403625
Thread 4 total yields: 435605
Thread 5 total yields: 535598
Thread 6 total yields: 508091
Thread 7 total yields: 548162
Thread 8 total yields: 532789
Thread 9 total yields: 769968
Thread 10 total yields: 442565
Thread 11 total yields: 638749
Thread 12 total yields: 488601
Thread 13 total yields: 579624
Thread 14 total yields: 653364
Thread 15 total yields: 437042
Thread 16 total yields: 515524
Thread 17 total yields: 502716
Thread 18 total yields: 445556
Thread 19 total yields: 569855
Thread 20 total yields: 436675
Thread 21 total yields: 434850
Thread 22 total yields: 409440
Thread 23 total yields: 733155
Thread 24 total yields: 708836
Thread 25 total yields: 504327
Thread 26 total yields: 501003
Thread 27 total yields: 772928
Thread 28 total yields: 404170
Thread 29 total yields: 564273
Thread 30 total yields: 447199
Thread 31 total yields: 518266
Thread 32 total yields: 751953
Thread 33 total yields: 528898
Thread 34 total yields: 469453
Thread 35 total yields: 443822
Thread 36 total yields: 453571
Thread 37 total yields: 483523
Thread 38 total yields: 673307
Thread 39 total yields: 419745
Thread 40 total yields: 420812
Thread 41 total yields: 579195
Thread 42 total yields: 534738
Thread 43 total yields: 558074
Thread 44 total yields: 404649
Thread 45 total yields: 690615
Thread 46 total yields: 457234
Thread 47 total yields: 623036
Thread 48 total yields: 700575
Thread 49 total yields: 608860
Thread 50 total yields: 405334
Thread 51 total yields: 577808
Thread 52 total yields: 449998
Thread 53 total yields: 473125
Thread 54 total yields: 558360
Thread 55 total yields: 406760
Thread 56 total yields: 621827
Thread 57 total yields: 456095
Thread 58 total yields: 700446
Thread 59 total yields: 696581
Thread 60 total yields: 657749
Thread 61 total yields: 771747
Thread 62 total yields: 425028
Thread 63 total yields: 416165
Thread 64 total yields: 416922
Thread 65 total yields: 748436
Thread 66 total yields: 710466
Thread 67 total yields: 431879
Thread 68 total yields: 407904
Thread 69 total yields: 516825
Thread 70 total yields: 404993
Thread 71 total yields: 432439
Thread 72 total yields: 762656
Thread 73 total yields: 512795
Thread 74 total yields: 443227
Thread 75 total yields: 627807
Thread 76 total yields: 506496
Thread 77 total yields: 550415
Thread 78 total yields: 420525
Thread 79 total yields: 547715
Thread 80 total yields: 693714
Thread 81 total yields: 708321
Thread 82 total yields: 438307
Thread 83 total yields: 400358
Thread 84 total yields: 555737
Thread 85 total yields: 555062
Thread 86 total yields: 518418
Thread 87 total yields: 448384
Thread 88 total yields: 776252
Thread 89 total yields: 584543
Thread 90 total yields: 578234
Thread 91 total yields: 426361
Thread 92 total yields: 572331
Thread 93 total yields: 511610
Thread 94 total yields: 573504
Thread 95 total yields: 596209
Thread 96 total yields: 462665
Thread 97 total yields: 455619
Thread 98 total yields: 670371
Thread 99 total yields: 634330

real    3m50.676s
user    30m8.317s
sys     0m24.138s



With 100 threads the difference starts to show.


Regards, Peter


On Wednesday, August 29, 2012 08:31:23 AM Vitaly Davidovich wrote:

By the way, only thing I can think of as being a possible issue is the pause \
instruction killing (or reducing) out of order/pipelined execution of the rest of the \
loop body.  But then I don't know if this is such an issue as the loop exit will \
probably have a branch misprediction anyway, killing whatever is in the pipeline.  At \
any rate, would be interesting to see an explanation. Cheers
Sent from my phone
On Aug 29, 2012 8:23 AM, "Vitaly Davidovich" <vitalyd at gmail.com> wrote:

I'm actually curious to know if Eric can explain a bit more why pause is an issue \
here, possibly with some benchmark results. David's point earlier was that he doesn't \
think there's benefit to it in the way hotspot spins, but removing pause implies it \
can actually do harm rather than simply being unhelpful. I'm also assuming this is \
not AMD specific but Intel as well? Thanks
Sent from my phone
On Aug 29, 2012 8:04 AM, "Peter Levart" <peter.levart at marand.si> wrote:

Here's an interesting explanation about the impact of pause instruction in spin-wait \
loops:

http://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors/


4 years later: It may be that newer CPUs are more clever now.

Regards, Peter

On Tuesday, August 28, 2012 11:36:23 AM Eric Caspole wrote:
> Hi everybody,
> I have made a webrev making a one-line change to remove use of PAUSE
> in linux x64. This will bring linux into sync with windows where
> SpinPause is just "return 0" as Dan indicates below.
> 
> http://cr.openjdk.java.net/~ecaspole/nopause/webrev.00/webrev/
> 
> We find that it is better not to use PAUSE in this kind of spin
> routine. Apparently someone discovered that on windows x64 years ago.
> Thanks,
> Eric
> 
> On Aug 16, 2012, at 5:07 PM, Daniel D. Daugherty wrote:
> > On Win64, SpinPause() has been "return 0" since mid-2005. Way back
> > 
> > when Win64 code was in os_win32_amd64.cpp:
> > SCCS/s.os_win32_amd64.cpp:
> > 
> > D 1.9.1.1 05/07/04 03:20:45 dice 12 10  00025/00000/00334
> > MRs:
> > COMMENTS:
> > 5030359 -- back-end synchonization improvements - adaptive
> > 
> > spinning, etc
> > 
> > When the i486 and amd64 cpu dirs were merged back in 2007, the code
> > 
> > became like it is below (#ifdef'ed):
> > D 1.32 07/09/17 09:11:33 sgoldman 37 35 00264/00008/00218
> > MRs:
> > COMMENTS:
> > 5108146 Merge i486 and amd64 cpu directories.
> > Macro-ized register names. Inserted amd64 specific code.
> > 
> > Looks like on Linux-X64, the code has used the PAUSE instruction
> > 
> > since mid-2005:
> > D 1.3 05/07/04 03:14:09 dice 4 3        00031/00000/00353
> > MRs:
> > COMMENTS:
> > 5030359 -- back-end synchonization improvements - adaptive
> > 
> > spinning, etc
> > 
> > We'll have to see if Dave Dice remember why he implemented
> > it this way...
> > 
> > Dan
> > 
> > On 8/16/12 12:01 PM, Eric Caspole wrote:
> > > Hi everybody,
> > > Does anybody know the reason why SpinPause is simply "return 0" on
> > > Win64 but uses PAUSE on Linux in a .s file?
> > > We would like to remove PAUSE from linux too.
> > > 
> > > Thanks,
> > > Eric
> > > 
> > > 
> > > ./src/os_cpu/windows_x86/vm/os_windows_x86.cpp
> > > 
> > > 548 extern "C" int SpinPause () {
> > > 549 #ifdef AMD64
> > > 550    return 0 ;
> > > 551 #else
> > > 552    // pause == rep:nop
> > > 553    // On systems that don't support pause a rep:nop
> > > 554    // is executed as a nop.  The rep: prefix is ignored.
> > > 555    _asm {
> > > 556       pause ;
> > > 557    };
> > > 558    return 1 ;
> > > 559 #endif // AMD64
> > > 560 }
> > > 
> > > src/os_cpu/linux_x86/vm/linux_x86_64.s
> > > 
> > > 63         .globl SpinPause
> > > 64         .align 16
> > > 65         .type  SpinPause, at function
> > > 66 SpinPause:
> > > 67         rep
> > > 68         nop
> > > 69         movq   $1, %rax
> > > 70         ret



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/attachments/20120829/11f98962/attachment-0001.html \



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic