[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gcc-fortran
Subject:    Re: [gomp] timings (was: [gomp] omp performance question)
From:       Tim Prince <timothyprince () sbcglobal ! net>
Date:       2006-10-31 14:18:26
Message-ID: 45475B32.4080105 () sbcglobal ! net
[Download RAW message or body]

Daniel Franke wrote:
> This is to summarize the OMP timings I posted yesterday.
> 
> A relevant code fragment and explanations may be found here:
> http://gcc.gnu.org/ml/fortran/2006-10/msg00753.html
> 
> $> uname -mo
> x86_64 GNU/Linux
> 
> Four cores, dual CPU, dual core each.
> 
> 
> Single threaded, optimized build (FCFLAGS="-O1"):
> real    64m36.502s
> user    64m36.886s
> sys      0m00.040s
> 
> OpenMP enabled builds, FCFLAGS="-O1 -fopenmp");
> 
>                   | OMP_DYNAMIC=FALSE | OMP_DYNAMIC=TRUE |
> -------------------+-------------------+------------------+
>                   |    57m36.233s     |    165m54.954s   |
> OMP_NUM_THREADS=4 |    91m59.685s     |     98m13.528s   |
>                   |    26m31.735s     |    109m26.858s   |
> -------------------+-------------------+------------------+
>                   |    85m54.649s     |    168m46.983s   |
> OMP_NUM_THREADS=8 |   125m20.442s     |     97m53.903s   |
>                   |    48m15.253s     |    113m00.108s   |
> -------------------+-------------------+------------------+
> 
> Processes that ran with OMP_DYNAMIC=TRUE employed three threads
> (most of the time). Using default values of
> OMP_NUM_THREADS/OMP_DYNAMIC, i.e. by not specifying them explicitely,
> the resulting values are:
> 
> real     67m16.611s
> user    112m22.885s
> sys      25m44.641s
> 
> All these numbers are oneshots, no variances are measured/available.
> 
> Hints and suggestions are still highly welcome =)
> 
>     Daniel
> 
> 
If you run 2 threads, does it make a difference whether you assign them 
all to 2 sockets or spread them out 1 per socket (e.g. using taskset)? 
Why do you care about performance with 8 threads?  If you do care, what 
are you doing to pair them efficiently?
Wouldn't it perform better if you would organize the data so the inner 
loops could run at stride 1?  If the current version runs better with 
threads paired properly on shared caches, it would tend to confirm you 
have a cache sharing problem.

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic