[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gmp-devel
Subject:    mpn_sqrtrem1
From:       adrien.prost-boucle () laposte ! net (Adrien Prost-Boucle)
Date:       2016-12-23 16:22:29
Message-ID: 1482510149.966.38.camel () laposte ! net
[Download RAW message or body]

Hi,

I have modified my code, following advice from Marco Bodrato (thx!).
I now have the same API for my implementation and for the code I've taken from GMP.
See the archive on my server (somehow my email is rejected if with attachment):
http://94.23.21.190/publicshare/sqrt.tar.gz

Updated speedup:
? laptop Core2 Duo .. 0% for 32-bit, 3% for 64-bit, 4% for 64x2
? machine i7-4790 ... 3% for 32-bit, 4% for 64-bit, still 20% for 64x2
? Note: my machines are under some load right now, so you may find slightly different \
values

So the speedup of my implementation is lower that my initial code.
But there is still a bit of speedup, particularly for 64x2 on my higher-end CPU.
(well, that'll hold until only you find another possible optim to my code xD).
However, that won't bring the magic speedup Paul Zimmermann initially wanted in this \
message: https://gmplib.org/list-archives/gmp-devel/2016-July/004308.html


It seems most of the overhead came from that test:

  // Separately handle when the number is already normalized
  if(val & (((uint64_t)3) << (64 - 2))) {
    return mpn_sqrtrem1_norm64(&rl, val);
  }

That test was present in GMP in function mpn_sqrtrem(), file sqrtrem.c line 447
So I kept it for coherence, to compare?against what's actually in GMP.
That test costs a lot because it prevents efficient branch prediction.
But that test was only in function mpn_sqrtrem() and not in "core" functions that \
call mpn_sqrtrem1() and mpn_sqrtrem2(). So my updated "mpn" code should better look \
like what's currently in GMP.


Something else I'm not sure of:
In my 64x2 implementation, function sqrt64x2_in(), there are many uses of macros 64x2 \
for add, sub, mul. Whereas the GMP algorithm consists of one call to sqrt64 \
(one-word), then some operations to get the sqrt64x2 (one div + 2 mul + misc arith \
and logic). So for me it's not obvious that my version would bring speedup on ALL \
platforms, particularly when having hardware division. Like what I measured on my \
laptop. Any idea?

But when not having hardware division, then my 64x2 implementation should be better \
for sure. However don't forget that nothing is fully checked nor proven in my \
implementation! It may still be wrong in some corner cases!

Adrien


On Tue, 2016-12-20 at 22:36 +0100, Marco Bodrato wrote:
> Ciao,
> 
> Il Lun, 19 Dicembre 2016 6:21 pm, Adrien Prost-Boucle ha scritto:
> > The program enables to launch benchmark with these commands:
> > ./sqrt bench32 10000000
> > ./sqrt bench64 10000000
> > ./sqrt bench64x2 10000000
> 
> It is quite difficult to interpret the numbers, times spent by direct
> calls to some functions is compared to the time spent by other functions
> called trough a wrapper...
> 
> You should decide a single interface and adapt to it all the to be
> compared algorithms.
> 
> Of course, from the point of view of GMP, the best would be to compare
> functions that can be used as a drop-in replacement of the current
> mpn_sqrtrem1 (and mpn_sqrtrem2) functions in GMP mpn/generic/sqrtrem.c
> code.
> Obviously you can choose a different interface, but you should avoid
> wrappers.
> 
> Regards,
> m
> 


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic