'Re: Athlon/Intel floating point'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-smp
Subject:    Re: Athlon/Intel floating point
From:       Neil Conway <nconway.list () ukaea ! org ! uk>
Date:       1999-10-27 14:54:25
[Download RAW message or body]

A comment and a weird sort-of-problem:

On the stream_d benchmark, let's not forget that it's only measuring the
out-of-cache FP speed.  Most of the codes that I run (and that I help
people to run here) are a lot closer to the in-cache speed.  If someone
would care to run an Athlon FP code that's mostly in-cache that would be
an interesting comparison number too.  I have a tiny code here (flops.c
from Al Aburto, I've had it for years, though it's only a scalar
benchmark) which I can post if no-one has a better one.

The weird "problem": using stream_d on a pair of "identical" Dell dual
PIII-550 Xeons with RH6.0 and 1 gig of RAM, I'm seeing differences in
the speeds - one machine is consistently a little faster than the
other.  I first noticed this when they both had about 8-900 megs of
cache&buffers used, and I thought that maybe it was a funny issue with
memory non-contiguity, so I flushed out the RAM to leave >900 megs
free.  This made *both* of them speed up substantially (more than 10%!)
but I'm still left with a very stable ~3% difference.  I've checked that
the bogomips match, and the /proc/cpuinfo and dmesg output matches, and
that the bootup RAID-checksum speeds match, and I am now stuck.  The
final option is to reboot them both but they are being intermittently
used for production codes so I need to wait a bit for that.  They are
both running the original RH6.0 SMP kernel ("2.2.5-15") and have been up
for about 3 weeks (same reboot time).  Any ideas folks?

Output is:
"slow" machine:
Copy:         332.1854       1.4454       1.4450       1.4464
Scale:        330.4472       1.4528       1.4526       1.4535
Add:          394.0810       1.8275       1.8270       1.8282
Triad:        363.3719       1.9824       1.9814       1.9841
69.030u 3.830s 1:13.95 98.5%    0+0k 0+0io 114pf+0w

"fast" machine:
Copy:         342.6267       1.4019       1.4009       1.4055
Scale:        341.3278       1.4067       1.4063       1.4071
Add:          404.8233       1.7793       1.7786       1.7801
Triad:        370.7472       1.9438       1.9420       1.9490
67.130u 3.800s 1:12.00 98.5%    0+0k 0+0io 115pf+0w

(Full output from stream_d in the attached files)

Anyone got any ideas about this one?

cheers
Neil
["stream_d_huge_cycle.linb" (text/plain)]

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 30000000, Offset = 0
Total memory required = 686.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Cycles/second = 547184347.141847
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1123813 microseconds.
   (= 1123813 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         332.1854       1.4454       1.4450       1.4464
Scale:        330.4472       1.4528       1.4526       1.4535
Add:          394.0810       1.8275       1.8270       1.8282
Triad:        363.3719       1.9824       1.9814       1.9841
69.030u 3.830s 1:13.95 98.5%    0+0k 0+0io 114pf+0w

["stream_d_huge_cycle.linc" (text/plain)]

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 30000000, Offset = 0
Total memory required = 686.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Cycles/second = 547184480.165958
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1074164 microseconds.
   (= 1074164 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         342.6267       1.4019       1.4009       1.4055
Scale:        341.3278       1.4067       1.4063       1.4071
Add:          404.8233       1.7793       1.7786       1.7801
Triad:        370.7472       1.9438       1.9420       1.9490
67.130u 3.800s 1:12.00 98.5%    0+0k 0+0io 115pf+0w

-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/dmentre/smp-howto/
To Unsubscribe: send "unsubscribe linux-smp" to majordomo@vger.rutgers.edu


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic