[prev in list] [next in list] [prev in thread] [next in thread]
List: john-dev
Subject: Re: [john-dev] JtR on ARM (NEON)
From: Solar Designer <solar () openwall ! com>
Date: 2015-07-31 8:35:17
Message-ID: 20150731083517.GB31035 () openwall ! com
[Download RAW message or body]
On Fri, Jul 31, 2015 at 03:58:27PM +0800, Lei Zhang wrote:
> A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some \
> model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which \
> specific model it is.
You should check /proc/cpuinfo under Linux.
> (OpenMP is disabled in this test. PBKDF2-HMAC-SHA512 failed somehow, so I chose \
> sha512crypt here.)
You'll need to investigate why PBKDF2-HMAC-SHA512 fails. This might
provide a clue as to why sha512crypt became slower.
> Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 64/32 OpenSSL]... \
> DONE
BTW, the 64/32 here is wrong. Should be 32/32. Just because an
algorithm uses 64-bit integers logically doesn't mean we should report
it as using 64 out of 32 physical bits, since it can't. magnum?
> From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no \
> speedup; SHA512 gets a lot slower.
Yes. That's weird.
I assume you haven't started playing with interleaving factors yet?
> In my currently implementation, most pseudo-intrinsics are directly mapped to NEON \
> intrinsics. The only exceptions are vcmov and vroti, which have to be emulated.
As I told you before, no, vcmov must not be emulated - we have it on
NEON natively. Please see how it's done in DES_bs_b.c.
As to vroti, yes, although there's a 2-instruction way to emulate it,
see page 4 in:
https://cryptojedi.org/papers/neoncrypto-20120320.pdf
Maybe it'd work faster at high interleaving factors (and slower at low
interleaving factors, since it's higher latency than the straightforward
3-instruction approach).
BTW, when you emulate a rotate with two shifts, you may sometimes see
better results when you combine them with a XOR rather than an OR,
because crypto code tends to use XORs nearby, so the compiler will be
able to re-order the XORs if it sees an opportunity to hide latencies
that way. With an OR and a XOR, it won't be easy for the compiler to
see that the OR is equivalent to a XOR in this particular case.
> But I don't think they're the excuses for the poor performance, since they're also \
> emulated in a AVX build.
Yes, there must be something else as well. Maybe unaligned accesses.
Alexander
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic