'Re: [john-dev] JtR on ARM (NEON)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       john-dev
Subject:    Re: [john-dev] JtR on ARM (NEON)
From:       Solar Designer <solar () openwall ! com>
Date:       2015-07-31 8:35:17
Message-ID: 20150731083517.GB31035 () openwall ! com
[Download RAW message or body]

On Fri, Jul 31, 2015 at 03:58:27PM +0800, Lei Zhang wrote:
> A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some \
> model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which \
> specific model it is.

You should check /proc/cpuinfo under Linux.

> (OpenMP is disabled in this test. PBKDF2-HMAC-SHA512 failed somehow, so I chose \
> sha512crypt here.)

You'll need to investigate why PBKDF2-HMAC-SHA512 fails.  This might
provide a clue as to why sha512crypt became slower.

> Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 64/32 OpenSSL]... \
> DONE

BTW, the 64/32 here is wrong.  Should be 32/32.  Just because an
algorithm uses 64-bit integers logically doesn't mean we should report
it as using 64 out of 32 physical bits, since it can't.  magnum?

> From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no \
> speedup; SHA512 gets a lot slower.

Yes.  That's weird.

I assume you haven't started playing with interleaving factors yet?

> In my currently implementation, most pseudo-intrinsics are directly mapped to NEON \
> intrinsics. The only exceptions are vcmov and vroti, which have to be emulated.

As I told you before, no, vcmov must not be emulated - we have it on
NEON natively.  Please see how it's done in DES_bs_b.c.

As to vroti, yes, although there's a 2-instruction way to emulate it,
see page 4 in:

https://cryptojedi.org/papers/neoncrypto-20120320.pdf

Maybe it'd work faster at high interleaving factors (and slower at low
interleaving factors, since it's higher latency than the straightforward
3-instruction approach).

BTW, when you emulate a rotate with two shifts, you may sometimes see
better results when you combine them with a XOR rather than an OR,
because crypto code tends to use XORs nearby, so the compiler will be
able to re-order the XORs if it sees an opportunity to hide latencies
that way.  With an OR and a XOR, it won't be easy for the compiler to
see that the OR is equivalent to a XOR in this particular case.

> But I don't think they're the excuses for the poor performance, since they're also \
> emulated in a AVX build.

Yes, there must be something else as well.  Maybe unaligned accesses.

Alexander


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic