'Surprise performance from Apple M1'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gmp-discuss
Subject:    Surprise performance from Apple M1
From:       Torbjörn_Granlund <tg () gmplib ! org>
Date:       2020-11-21 18:50:27
Message-ID: 86y2iuy26k.fsf () shell ! gmplib ! org
[Download RAW message or body]

The GMP project got a low-end Apple Mac Mini M1 in order to make sure
GMP works for arm-macos systems.

We had a major surprise from the GMP performance of these CPUs!

No other CPU runs GMP this well.  Almost every inner loop runs at < 1
cycle/limb.  That inclues mpn_mul_1, but not the most important loop
mpn_addmul_1.  And that is before any attempt at optmising things for
the M1.

The 3.2 GHz M1 in our system takes the #2 spot in the GMPbench top-list.
The #1 spot is an AMD Ryzen, but that runs ar 4.4 GHz.

Getting mpn_addmul_1 to run closer to 1 cycle/limb would mean a lot for
GMP's performance.  There is an architecture shortcoming which might
make it tricky, though: There is just one carry/borrow flag, unlike
x86's two (as used by adcx/adox) and also there is no instruction for
highword(a*b+c).  As a result, addmul_1 which needs a 3-way add for its
product accumulation needs to add some words, save carry, restore carry,
add to the same words again, save carry, restore carry, etc.  That's
quite expensive.

X86 used to have that same problem.  They added adox/adcx which greatly
helped GMP.  IBM's Power used to have the same problem, and they added
both highword(a*b+c) *and* multiple carry flags.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-discuss mailing list
gmp-discuss@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-discuss

[prev in list] [next in list] [prev in thread] [next in thread]