[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-smp
Subject: Re: Athlon/Intel floating point
From: Neil Conway <nconway.list () ukaea ! org ! uk>
Date: 1999-10-27 14:54:25
[Download RAW message or body]
A comment and a weird sort-of-problem:
On the stream_d benchmark, let's not forget that it's only measuring the
out-of-cache FP speed. Most of the codes that I run (and that I help
people to run here) are a lot closer to the in-cache speed. If someone
would care to run an Athlon FP code that's mostly in-cache that would be
an interesting comparison number too. I have a tiny code here (flops.c
from Al Aburto, I've had it for years, though it's only a scalar
benchmark) which I can post if no-one has a better one.
The weird "problem": using stream_d on a pair of "identical" Dell dual
PIII-550 Xeons with RH6.0 and 1 gig of RAM, I'm seeing differences in
the speeds - one machine is consistently a little faster than the
other. I first noticed this when they both had about 8-900 megs of
cache&buffers used, and I thought that maybe it was a funny issue with
memory non-contiguity, so I flushed out the RAM to leave >900 megs
free. This made *both* of them speed up substantially (more than 10%!)
but I'm still left with a very stable ~3% difference. I've checked that
the bogomips match, and the /proc/cpuinfo and dmesg output matches, and
that the bootup RAID-checksum speeds match, and I am now stuck. The
final option is to reboot them both but they are being intermittently
used for production codes so I need to wait a bit for that. They are
both running the original RH6.0 SMP kernel ("2.2.5-15") and have been up
for about 3 weeks (same reboot time). Any ideas folks?
Output is:
"slow" machine:
Copy: 332.1854 1.4454 1.4450 1.4464
Scale: 330.4472 1.4528 1.4526 1.4535
Add: 394.0810 1.8275 1.8270 1.8282
Triad: 363.3719 1.9824 1.9814 1.9841
69.030u 3.830s 1:13.95 98.5% 0+0k 0+0io 114pf+0w
"fast" machine:
Copy: 342.6267 1.4019 1.4009 1.4055
Scale: 341.3278 1.4067 1.4063 1.4071
Add: 404.8233 1.7793 1.7786 1.7801
Triad: 370.7472 1.9438 1.9420 1.9490
67.130u 3.800s 1:12.00 98.5% 0+0k 0+0io 115pf+0w
(Full output from stream_d in the attached files)
Anyone got any ideas about this one?
cheers
Neil
["stream_d_huge_cycle.linb" (text/plain)]
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 30000000, Offset = 0
Total memory required = 686.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Cycles/second = 547184347.141847
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1123813 microseconds.
(= 1123813 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 332.1854 1.4454 1.4450 1.4464
Scale: 330.4472 1.4528 1.4526 1.4535
Add: 394.0810 1.8275 1.8270 1.8282
Triad: 363.3719 1.9824 1.9814 1.9841
69.030u 3.830s 1:13.95 98.5% 0+0k 0+0io 114pf+0w
["stream_d_huge_cycle.linc" (text/plain)]
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 30000000, Offset = 0
Total memory required = 686.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Cycles/second = 547184480.165958
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1074164 microseconds.
(= 1074164 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) RMS time Min time Max time
Copy: 342.6267 1.4019 1.4009 1.4055
Scale: 341.3278 1.4067 1.4063 1.4071
Add: 404.8233 1.7793 1.7786 1.7801
Triad: 370.7472 1.9438 1.9420 1.9490
67.130u 3.800s 1:12.00 98.5% 0+0k 0+0io 115pf+0w
-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/dmentre/smp-howto/
To Unsubscribe: send "unsubscribe linux-smp" to majordomo@vger.rutgers.edu
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic