'Re: [pthreads-users] performance....'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       pthreads-users
Subject:    Re: [pthreads-users] performance....
From:       "Richard Seaman, Jr." <dick () seaman ! org>
Date:       2002-04-16 17:54:05
[Download RAW message or body]

On Mon, Apr 15, 2002 at 08:42:06PM -0500, Corey Minyard wrote:
> Richard Seaman, Jr. wrote:

[snip]

> > Well, I was very wrong about the context switching.  NGPT is very slow.
> > I'm attaching some very simple and very crude benchmark code.
> > 
> > Basically, I started 10 threads each doing 1,000,000 yields simultaneously,
> > as a very crude measure of context switch overhead. The results (average of
> > 3 runs, details attached):
> > 
> > NGPT          44.1 secs
> > Linuxthreads   7.3 secs
> > 
> > Or, NGPT is slower by a factor of 6! By all logic, NGPT should be faster.
> > 
> I'm not so sure about that. NGPT should be able to do much better than 6 
> times, of course. But Linux is very optimized on task switching already 
> and switching between LWP doesn't require MMU reloads, TLB misses, or 
> any other costly operations. NGPT currently has to make two kernel trips 
> to for every task switch for signal mask updates because it's using 
> setjmp/longjmp to do the task switch (which itself is a problem because 
> it doesn't save floating point or MMX context). It could reduce this to 
> one, perhaps, by implementing a context switch routine for the 
> architecture. But if you have to make a kernel call to do a task switch, 
> it's probably a wash on performance. If you could avoid the kernel call, 
> it would be better, but I'm not sure that's possible to do and still 
> meet POSIX.
> 
> If you put the signal mask in user memory and had the kernel fetch it 
> from a specific location when it needed it, that would be a big win, 
> much like the ability to have a thread set itself unpreemtable by 
> setting a userspace variable that the scheduler looks at (NGPT doesn't 
> do this, I don't think, but I've read about it in the threading 
> research). But the update would probably need to be atomic somehow.

I ran an strace on my simple benchmark.  It looks like an NGPT context
switch in this case involves 7 syscalls.  3  to gettimeofday and 4 to
rt_sigprocmask. This would certainly explain the slow context switch times.

For example, I think this is a context switch sequence:

rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 \
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62], NULL, 8) = 0 gettimeofday({1018979153, \
711055}, NULL) = 0 gettimeofday({1018979153, 711085}, NULL) = 0
gettimeofday({1018979153, 711116}, NULL) = 0
rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP 33 34 35 36 37 38 39 40 41 42 43 44 45 46 \
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62], 8) = 0 rt_sigprocmask(SIG_SETMASK, \
[], NULL, 8) = 0

A few years ago the FreeBSD user thread context switches were slow too (but
not this bad).  However, they found they could eliminate calls to gettimeofday()
from the scheduler except when the scheduler is called by a preempt timer signal.
This eliminates all calls to gettimeofday during a context switch except during
time-slice forced preemption.  This is a BIG saving.

Also, at the expense of much more complex pthread code, FreeBSD user threads
intercept (via wrappers) most of the signal syscalls.  They insert a pthread
library signal handler to handle all signals and dispatch them to the
appropriate threads, and they manage to avoid any signal related syscalls
within the scheduler. I think their signal handling is POSIX compliant.

As a result, they are able to do many (most?) user thread context switches
without any syscalls.  It seems to me that this is once area where user
threads should be able to excel -- very fast context switches.

> > I also did some read/write comparisons.  Basically I did 10 threads each
> > doing 100,000 reads and writes, yielding after each read and write.  The results
> > (average of 3 runs, details attached):
> > 
> > NGPT          59.3 secs
> > Linuxthreads  40.3 secs
> > 
> > Or, NGPT is close to 50% slower.  This might be expected based on the overhead
> > in the read/write wrappers in the NGPT code.
> > 
> > 
> Linuxthreads also has read/write wrappers (for cancellation). I guess 
> NGPT has to do more work, though.

I confirmed via strace that each readin NGPT seems to translate into
3 syscalls in the wrapper code.  fcntl() plus select() plus the read():

fcntl64(0xa, 0x3, 0, 0xa)               = 2
select(11, [10], NULL, NULL, {0, 0})    = 1 (in [10], left {0, 0})
read(10, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 10000) = 10000

However, writes seem to generate 5 syscalls, 4 to fcntl() plus the write():

fcntl64(0x5, 0x3, 0, 0x5)               = 2
fcntl64(0x5, 0x4, 0x802, 0x5)           = 0
write(5, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 10000) = 10000
fcntl64(0x5, 0x3, 0, 0x5)               = 2050
fcntl64(0x5, 0x4, 0x2, 0x5)             = 0

-- 
Richard Seaman, Jr.        email:    dick@seaman.org
5182 N. Maple Lane         phone:    262-367-5450
Nashotah WI 53058            fax:    262-367-5852
_______________________________________________
pthreads-users mailing list
pthreads-users@www-124.ibm.com
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/pthreads-users

[prev in list] [next in list] [prev in thread] [next in thread]