[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hpux-cxx-dev
Subject:    RE: CXX-DEV: Performance problems HP-UX 11.23 PA-RISC,
From:       "John Morris" <john () coyotebush ! net>
Date:       2006-07-10 16:55:37
Message-ID: 000c01c6a441$ac74e7d0$6500a8c0 () americas ! hpqcorp ! net
[Download RAW message or body]

I'm not able to recreate the cpu profile on itanium 11.23.

Using very old hardware, the unit test takes 4-4.5 seconds on both itanium
and parisc, which *is* a problem compared to the Sun numbers. The profile I
see isn't even close. Caliper shows ~50% of the time in lock/unlock, and 50%
scattered amoung other routines. Setmask shows up at only 10%. There is room
for 4x improvement, but I don't see 90% of time in setmask in my profile.

I'm confused about the role of MxN threads. In the unit test, we are not
spawning child threads. What steps did you take to get different numbers?

If the primary problem is the unit test runs 4x slower, then I believe we
can do something about it. If the unit test reflects your application, I'll
push especially hard to get it fixed.

However, are you sure the unit test models the application? Horrible
application performance is usually related to mutex contention, not
uncontended mutexes. I wamt to be sure we're solving the right problem.












 

> -----Original Message-----
> From: owner-cxx-dev@cxx.cup.hp.com 
> [mailto:owner-cxx-dev@cxx.cup.hp.com] On Behalf Of Maucci, Cyrille
> Sent: Sunday, July 09, 2006 6:41 AM
> To: John Morris; Johan Piculell (KA/EAB)
> Cc: Fredrik Lannergren (KA/EAB); cxx-dev@cxx.cup.hp.com; Ono, 
> Shigeru (Presales Japan); Maucci, Cyrille
> Subject: RE: CXX-DEV: Performance problems HP-UX 11.23 
> PA-RISC, pthread_mutex_unlock
> 
> FYI,
> 
> 
> Not directly related to this unit test case performance and 
> the MxN vs 1x1, but still related to pthread mutexes perf on 11.23:
> PHKL_34739 is a patch that fixes 'PTHREAD_PROCESS_SHARED rwlocks'
> specific performance level on 11.23 (to bring it back to 11.11 level).
> 
> 
> ++Cyrille
> 
> -----Original Message-----
> From: owner-cxx-dev@cxx.cup.hp.com 
> [mailto:owner-cxx-dev@cxx.cup.hp.com]
> On Behalf Of Maucci, Cyrille
> Sent: Saturday, July 08, 2006 10:34 PM
> To: John Morris; Johan Piculell (KA/EAB)
> Cc: Fredrik Lannergren (KA/EAB); cxx-dev@cxx.cup.hp.com; Ono, 
> Shigeru (Presales Japan)
> Subject: RE: CXX-DEV: Performance problems HP-UX 11.23 
> PA-RISC, pthread_mutex_unlock
> 
> Hi John, Johan,
> 
> 
> 
> John, I am sorry but I have to disagree on the following
> 
> " For a simple lock/unlock loop, it doesn't matter whether 
> you use 1x1 or MxN threads. The code path is the same for both cases."
> 
> This is not true. Of course, here we are not using 'threads' 
> but 'pthread mutexes', so you could argue that 1x1 or MxN 
> should not make the difference. But it actually DOES make a 
> difference.
> Wallclock time agrees with me, as well as ALL profiling 
> tools, as well as HP people that have access to the pthread lib code.
> 
> 
> 
> On my rp7410/8@875Mhz
> 
> bene:/root#aCC -mt toto.C
> bene:/root#./a.out (i.e. MxN)
> Total time 7
> bene:/root#PTHREAD_COMPAT_MODE=1 ./a.out (i.e. 1x1) Total time 2
> 
> A gprof report for the MxN case will show flat
>  20.9    1.44    1.44                    __pthread_mutex_unlock
>  18.0    2.67    1.24                    _lw_mxn_setsigmask
> <- Here's the MxN contrib!
>  16.0    3.77    1.10                    _mcount
>  13.0    4.66    0.89 10000124       0.00 sigaddset
>   9.5    5.31    0.65                    __pthread_mutex_lock
>   8.3    5.88    0.57       1     570.00 main
>   7.1    6.37    0.49 10000124       0.00 sigdelset
>   6.1    6.79    0.42                    
> __mutex_unlock_handoff_disabled
>   0.4    6.82    0.03                    _mcleanup
>   0.4    6.84    0.03                    __pthread_mutex_lock_wait
> ...
> 
> A gprof report for the 1x1 case will show flat
>  41.2    0.61    0.61                    __pthread_mutex_unlock
>  23.0    0.95    0.34                    
> __mutex_unlock_handoff_disabled
>  20.3    1.25    0.30                    __pthread_mutex_lock
>  15.5    1.48    0.23       1     230.00 main
>   0.0    1.48    0.00      73       0.00 __thread_rec_mutex_init
>   0.0    1.48    0.00      43       0.00 __errno
> ...
> 
> A quantify report for the MxN case will show flat
>  50.31%   _lw_mxn_setsigmask		<- Here's the MxN contrib!
>  22.10%   pthread_mutex_unlock
>   9.63%   pthread_mutex_lock
>   8.76%   uwss_248
>   2.98%   sigdelset
>   2.98%   sigaddset
>   2.98%   main
>   0.24%   _lw_sa_pending_signals
> 
> A quantify report for the 1x1 case will show flat
>  40.17%   pthread_mutex_unlock
>  26.94%   pthread_mutex_lock
>  24.49%   uwss_248
>   8.33%   main
>   0.05%   _crt_read
> 
> A prospect fprof report for the MxN case will show flat
>  41.69    41.69               168  libpthread.1::pthread_mutex_unlock
>  17.62    59.31                71  libpthread.1::_lw_mxn_setsigmask
> <- Here's the MxN contrib!
>  12.41    71.71                50  libpthread.1::pthread_mutex_lock
>   5.71    77.42                23  a.out::main
>   5.46    82.88                22  libc.2::sigaddset
>   5.21    88.09                21  libc.2::sigdelset
>   4.71    92.80                19  libpthread.1::sigaddset
>   4.47    97.27                18  libpthread.1::sigdelset
>   2.23    99.50                 9  a.out::pthread_mutex_unlock
>   0.25    99.75                 1  a.out::pthread_mutex_lock
>   0.25   100.00                 1  
> libpthread.1::_lw_sa_pending_signals
> 
> A prospect fprof report for the 1x1 case will show flat
>  50.72    50.72               105  libpthread.1::pthread_mutex_unlock
>  27.05    77.78                56  libpthread.1::pthread_mutex_lock
>  15.94    93.72                33  a.out::main
>   5.31    99.03                11  a.out::pthread_mutex_unlock
>   0.48    99.52                 1  a.out::pthread_mutex_lock
>   0.48   100.00                 1
> libstd.2::seed__18__random_generatorFUl
> 
> 
> 
> 
> 
> On HPUX 11.23, given the specific version of 11.23 (cf 
> http://devresource.hp.com/drc/resources/pthread_wp_jul2004.pdf
> ), a program linked with libpthread (-mt) will use either 1x1 
> or MxN by default. And even for a very simple example as the 
> one we've playing with, 1x1 is much more performant because 
> "The _lw_mxn_setsigmask() calls are all mxn related. They are 
> called from a couple of macros:
> ENTER_PTHREAD_LIBRARY and LEAVE_PTHREAD_LIBRARY. These 
> primarily mask off signals while in the pthread library, 
> preventing embarassing situations like the library acquires 
> an internal lock, a context switch timer pops so we enter the 
> mxn scheduler which then requires the very same lock - 
> deadlock with ourself..." (these words are not mine but from 
> one of the pthread lib
> experts)
> 
> 
> 
> >>     1) Use the latest 11.23 pthread performance patch.
> >>  (Can somebody mention the number? I can't seem to find it.)
> 
> As I said in my previous post, the newest pthread lib patch 
> in PHCO_34718.
> 
> 
> 
> 
> Now. Back to Johan's initial thread, I agree with you. Johan 
> must have posted this unit test case because he has seen 
> something wrong with his application.
> 
> So Johan, maybe you can enlighten us on the application level 
> perf issues you saw.
> Have you used tools like prospect (www.hp.com/go/prospect) on 
> PA-RISC or caliper (www.hp.com/go/caliper) on Itanium?
> If you can profile your whole application with either tool 
> (or probably also quantify since you talked about it), maybe 
> we can be of further help.
> 
> 
> 
> Best Regards
> Cyrille
> 
> 
> 
> 
> 
> -----Original Message-----
> From: owner-cxx-dev@cxx.cup.hp.com 
> [mailto:owner-cxx-dev@cxx.cup.hp.com]
> On Behalf Of John Morris
> Sent: Saturday, July 08, 2006 9:22 PM
> To: Maucci, Cyrille; 'Johan Piculell (KA/EAB)'
> Cc: 'Fredrik Lannergren (KA/EAB)'; cxx-dev@cxx.cup.hp.com; 
> Ono, Shigeru (Presales Japan)
> Subject: RE: CXX-DEV: Performance problems HP-UX 11.23 
> PA-RISC, pthread_mutex_unlock
> 
> Some information about mutexes:
> 
> *	For a simple lock/unlock loop, it doesn't matter whether you use
> 1x1
> or MxN threads. The code path is the same for both cases.
> 
> *	Likewise, it doesn't matter if you install the new pthread
> patch.
> The patch works miracles for mutex contention, but it doesn't 
> change the uncontended path used by your short loop.
> 
> *	The profile threw me off track.
>        90.18% _lw_mxn_setsigmask
>     4.17%  pthread_mutex_unlock
>     2.02%  pthread_mutex_lock
> I think the pthread library has been "stripped", and the 
> profiling tool gives the name of the closest external entry 
> point. Neither lock() nor
> unlock() call any form of setsigmask(), but they do call 
> internal procedures.
> 
> *	Currently, HPUX counts the number of lock/unlock calls, and
> maintaining these counters is a significant overhead. I 
> haven't seen the Sun code, but I suspect they are not 
> maintaining counters. You could verify by stepping through 
> with an assembly language debugger.
> 
> I have a question. How important are the performance counters 
> to you? If we had a version of lock/unlock which was 4x 
> faster but didn't monitor performance, would that version be 
> preferable?
> 
> Another question. What is the behaviour in your actual 
> application? Does the application suffer from contention , or 
> is the straight through lock/unlock path dominating your 
> perfomance? (The easist way to identify contention is to do a 
> short "tusc" system call trace. Mutex contention results in calls to
> sched_yield() and ksleep().)
> 
> And yet another question. If we doubled uncontended 
> lock/unlock performance again (4x-->8x) would it make a 
> measureable difference to your application?
> (Keep in mind, a single cache miss would consume 5-10x the savings.)
> 
>   - John Morris
> 
> FYI. Some tips for reducing contention. Just in case they are 
> relevant.
>     1) Use the latest 11.23 pthread performance patch.  (Can 
> somebody mention the number? I can't seem to find it.)
>     2) Run using  rtprio -s SCHED_NOAGE
>     3) Make sure your malloc library is tuned
>     4) If using C++ strings, use the recent option to disable 
> mutexes for reference counts.
>         (The presence of the mutex is a bug, but removing it 
> introduces compatibility problems when
>         passing strings to older libraries. The option is not 
> a problem if all your code has been recently recompiled.)
>     5) If you suspect contention, attach the debugger and 
> collect a stack trace. You should be able
>         figure out which mutexes are the problem from the 
> procedure names in the traces.
>     6) Use inlined atomic ops for reference counts (Itanium)
>         Emulate the atomic ops using handcoded spinlocks (PA)
> 
> 
> > -----Original Message-----
> > From: owner-cxx-dev@cxx.cup.hp.com
> > [mailto:owner-cxx-dev@cxx.cup.hp.com] On Behalf Of Maucci, Cyrille
> > Sent: Friday, July 07, 2006 11:17 PM
> > To: Johan Piculell (KA/EAB)
> > Cc: Fredrik Lannergren (KA/EAB); cxx-dev@cxx.cup.hp.com; Maucci, 
> > Cyrille; Ono, Shigeru (Presales Japan)
> > Subject: RE: CXX-DEV: Performance problems HP-UX 11.23 PA-RISC, 
> > pthread_mutex_unlock
> >
> > Johan,
> >
> >
> >
> > I've been confirmed internally that
> >
> > - 'Sun use 1x1 threads'
> > So we should compare 1x1 perf on HP with Sun's perf.
> >
> > - 'The workaround is to use "export
> > PTHREAD_FORCE_SCOPE_SYSTEM=1", or "export 
> PTHREAD_COMPAT_MODE=1" (they
> 
> > both do exactly the same thing).'
> >
> > - 'Another good thing to do whenever mutexes are involved 
> (although it
> 
> > won't help in this ultra simple test) is "export 
> > PTHREAD_DISABLE_HANDOFF=1".'
> >
> >
> >
> > Please get back to us if with either workaround, you're not back to 
> > Solaris performance.
> >
> >
> >
> > Regards
> > Cyrille
> >
> >
> > -----Original Message-----
> > From: owner-cxx-dev@cxx.cup.hp.com
> > [mailto:owner-cxx-dev@cxx.cup.hp.com]
> > On Behalf Of Maucci, Cyrille
> > Sent: Saturday, July 08, 2006 7:23 AM
> > To: Ono, Shigeru (Presales Japan); Johan Piculell (KA/EAB)
> > Cc: Fredrik Lannergren (KA/EAB); cxx-dev@cxx.cup.hp.com
> > Subject: RE: CXX-DEV: Performance problems HP-UX 11.23 PA-RISC, 
> > pthread_mutex_unlock
> >
> > Hello,
> >
> > As far sd I understand it from the WP I pointed to,
> > PTHREAD_COMPAT_MODE=1 and PTHREAD_FORCE_SCOPE_SYSTEM should lead to 
> > the same behavior (1x1) hence the same perf.
> >
> > ++Cyrille
> >
> > PTHREAD_COMPAT_MODE This variable is used to enable the 1x1 
> > compatibility mode.
> > Valid values: ON, on, 1
> >
> > PTHREAD_FORCE_SCOPE_SYSTEM This variable is used to specify 
> When this 
> > variable is set, the application gets the 1x1 behavior.
> > Valid values: ON, on, 1
> >
> > -----Original Message-----
> > From: Ono, Shigeru (Presales Japan)
> > Sent: Saturday, July 08, 2006 5:17 AM
> > To: Maucci, Cyrille; Johan Piculell (KA/EAB)
> > Cc: Fredrik Lannergren (KA/EAB); cxx-dev@cxx.cup.hp.com
> > Subject: RE: CXX-DEV: Performance problems HP-UX 11.23 PA-RISC, 
> > pthread_mutex_unlock
> >
> > Hello,
> >
> > Setting PTHREAD_FORCE_SCOPE_SYSTEM=1 env could help.
> >
> > Here is the result on rp7420/hpux11.23
> >
> > $ aCC -mt c.c
> > $ a.out
> > Total time 4
> > $ export PTHREAD_FORCE_SCOPE_SYSTEM=1
> > $ a.out
> > Total time 1
> > $
> >
> > Also, there are some other env for mutex performance.
> >
> >   PTHREAD_DISABLE_HANDOFF
> >   PERF_ENABLE
> >
> >
> > ono
> >
> > -----Original Message-----
> > From: owner-cxx-dev@cxx.cup.hp.com
> > [mailto:owner-cxx-dev@cxx.cup.hp.com]
> > On Behalf Of Maucci, Cyrille
> > Sent: Saturday, July 08, 2006 5:23 AM
> > To: Johan Piculell (KA/EAB)
> > Cc: Fredrik Lannergren (KA/EAB); cxx-dev@cxx.cup.hp.com
> > Subject: RE: CXX-DEV: Performance problems HP-UX 11.23 PA-RISC, 
> > pthread_mutex_unlock
> >
> > Hello Johan, All,
> >
> >
> >
> > First, I assume that all the tests are run then compiled on 
> the same 
> > machine.
> > So, when Johan says he has tested on
> > HP-UX 11.23 PA 8700 750MHz
> > HP-UX 11.23 PA 8800 800MHz
> > HP-UX 11.23 IA64 1299MHz
> > HP-UX 11.11 PA 8700 650MHz
> >
> > I guess it means 1 compile on each machine followed by its 
> execution.
> > Compile then run on HP-UX 11.23 PA 8700 750MHz Compile then run on 
> > HP-UX
> > 11.23 PA 8800 800MHz Compile then run on HP-UX 11.23 IA64 1299MHz 
> > Compile then run on HP-UX 11.11 PA 8700 650MHz
> >
> > Also, I won't talk about running 'emulated' (PA) bits on a IA 
> > platform, to me it is off topic here since we obviously 
> can't compare 
> > native perf with 'emulated' perf.
> >
> > I think Johan's problem is elsewhere. (MxN vs 1x1 theading model)
> >
> >
> >
> > -----Original Message-----
> > From: Stan Sieler [mailto:sieler@allegro.com]
> > >> Interestingly, no one seems to have caught the following 
> from the 
> > >> original numbers posted (looking only at PA numbers) ...
> > >> and asked the obvious question:
> > >>
> > >> Why is 11.11 so much better than 11.23?
> >
> > Yes I did (to myself), that's why I proposed the 
> PTHREAD_COMPAT_MODE 
> > investigation.
> > Indeed
> > - 11.11 does only know 1x1 threading model
> > - 11.23 does know both 1x1 and MxN and given the 11.23 
> version, MxN or
> > 1x1 comes by default.
> >
> > When I saw Quantify's report putting a MxN related function in the 
> > top, I thought of the MxN threading model brought by default in his 
> > test.
> > (in the early 11iv2 it was made default and then brought 
> back as non 
> > default in 11iv2 Sep04.)
> >
> > (http://devresource.hp.com/drc/resources/pthread_wp_jul2004.pdf)
> > (http://docs.hp.com/en/B2355-60105/pthread_scope_options.5.html)
> >
> > Obviously, he had to give a try to 1x1 threading model on 11.23, 
> > either by 'export PTHREAD_COMPAT_MODE=1' at runtime or 
> > '-DPTHREAD_COMPAT_MODE'
> > at compile time.
> >
> > Sure you're gonna tell me this program does not manipulate pthreads 
> > but only pthread_mutexes, but for some reason it has some impacts.
> > Somebody from the thread team could sure elighten us...
> >
> >
> >
> > -----Original Message-----
> > From: Wilbur, Stacey V
> > >> I also tried this little test and the
> > -DPTHREAD_COMPAT_MODE did not
> > >> make a difference on an old A500 PA-RISC with 11.23 I was
> > not sure is
> >
> > >> ordering made a difference so I tried it both ways.
> > >> Maybe the Define is incorrect?
> > >> But the export seemed to work fine, well about 3 to 4 
> times faster.
> >
> > That's strange... According to the wp pointed to above, it should 
> > work.
> > What version of 11.23 do you have?
> >
> >
> >
> > -----Original Message-----
> > From: John Morris [mailto:john@coyotebush.net]
> > >> I don't know the patch numbers, but there is a recent pthread
> > performance
> > >> patch which significantly enhnces mutex performance. It is
> > well worth
> > trying.
> >
> > You may be referring to PHKL_33820, but AFAIK, it only fixes shared 
> > mutex perf on 11.23 for 64-bit app (which is not the case 
> here in the 
> > given aCC lines).
> > Or also maybe to PHCO_34718 (that supercedes PHCO_33675), 
> but I am not
> 
> > sure it'll bring something in Johan's test case.
> >
> >
> >
> > To sum up, going back to the 1x1 threading model (either at 
> compile or
> 
> > run time) should solve Johan's issue.
> > Johan, can you confirm?
> >
> >
> >
> > My few cents
> > ++Cyrille
> >
> >
> > -----Original Message-----
> > From: owner-cxx-dev@cxx.cup.hp.com
> > [mailto:owner-cxx-dev@cxx.cup.hp.com]
> > On Behalf Of Johan Piculell (KA/EAB)
> > Sent: Friday, July 07, 2006 4:32 PM
> > To: cxx-dev@cxx.cup.hp.com
> > Cc: Fredrik Lannergren (KA/EAB)
> > Subject: CXX-DEV: Performance problems HP-UX 11.23 PA-RISC, 
> > pthread_mutex_unlock
> >
> > Hi all.
> > I'll go directly to my problem, think that the figures says it all:
> >
> > Take the following test program:
> >
> > #include <iostream.h>
> > #include <time.h>
> > #include <stdlib.h>
> > #include <pthread.h>
> >
> > static pthread_mutex_t aMutex = PTHREAD_MUTEX_INITIALIZER;
> >
> > int main()
> > {
> >     int i;
> >     int startTime = time(0);
> >
> >     for (i=0;i<10000000;i++ )
> >     {
> >         pthread_mutex_lock(&aMutex);
> >         pthread_mutex_unlock(&aMutex);
> >     }
> >
> >     int endTime = time(0);
> >     cerr << "Total time " << endTime - startTime << endl;
> >
> >     return 0;
> > }
> >
> > Compile this just like this "aCC -mt mutex.cc"
> >
> > Here are some execution benchmarks:
> > HP-UX 11.23 PA 8700 750MHz, PHCO_33675 installed	-	9
> > seconds
> > HP-UX 11.23 PA 8800 800MHz, no PHCO_33675 	-	6 seconds
> > HP-UX 11.23 IA64 1299MHz    , no PHCO_33675 	-	
> 3 seconds
> > HP-UX 11.11 PA 8700 650MHz				-	4
> > seconds
> > Solaris 10, IIIi 1593MHz					-
> > 1 second
> >
> > The conclusion I make is that there is some serious 
> problems in 11.23,
> 
> > mainly on PA-RISC. The pthread performance patch makes no 
> difference 
> > to this fact.
> > We have also run Rational quantify on this and in the slow 
> cases above
> 
> > (11.23, PA-RISC) we can see this call distribution:
> >
> > 90.18% _lw_mxn_setsigmask
> > 4.17%  pthread_mutex_unlock
> > 2.02%  pthread_mutex_lock
> >
> > _lw_mxn_setsigmask is called 100% from pthread_mutex_unlock.
> > So this means that the program spends some 95% of the total time in 
> > unlocking the mutex.
> > Now this program might seem stupid, but we have the same problem in 
> > allocating memory for example, and reference counting for 
> RWC strings.
> > So in a real case we see some 100-200% slower processing on HP 
> > compared to Solaris on a similar hardware and it all (well 
> most of it)
> 
> > seems to pin down to this _lw_mxn_setsigmask call.
> >
> > We will open a ticket for this with HP, but maybe someone has some 
> > quick comments on this? Heard of any patches?
> >
> > thanks
> > /Johan Piculell
> > Ericsson AB
> >  _________________________________________________________________
> >  To leave this mailing list, send mail to majordomo@cxx.cup.hp.com
> >     with the message UNSUBSCRIBE cxx-dev 
> > _________________________________________________________________
> >  _________________________________________________________________
> >  To leave this mailing list, send mail to majordomo@cxx.cup.hp.com
> >     with the message UNSUBSCRIBE cxx-dev 
> > _________________________________________________________________
> >  _________________________________________________________________
> >  To leave this mailing list, send mail to majordomo@cxx.cup.hp.com
> >     with the message UNSUBSCRIBE cxx-dev 
> > _________________________________________________________________
> >  _________________________________________________________________
> >  To leave this mailing list, send mail to majordomo@cxx.cup.hp.com
> >     with the message UNSUBSCRIBE cxx-dev 
> > _________________________________________________________________
> >
>  _________________________________________________________________
>  To leave this mailing list, send mail to majordomo@cxx.cup.hp.com
>     with the message UNSUBSCRIBE cxx-dev 
> _________________________________________________________________
>  _________________________________________________________________
>  To leave this mailing list, send mail to majordomo@cxx.cup.hp.com
>     with the message UNSUBSCRIBE cxx-dev 
> _________________________________________________________________
>  _________________________________________________________________
>  To leave this mailing list, send mail to majordomo@cxx.cup.hp.com
>     with the message UNSUBSCRIBE cxx-dev  
> _________________________________________________________________
> 
 _________________________________________________________________
 To leave this mailing list, send mail to majordomo@cxx.cup.hp.com
    with the message UNSUBSCRIBE cxx-dev
 _________________________________________________________________
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic