'Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       qemu-arm
Subject:    Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
From:       Richard Henderson <richard.henderson () linaro ! org>
Date:       2023-05-31 17:08:51
Message-ID: c8499cae-befb-7130-3114-350ee97bf49d () linaro ! org
[Download RAW message or body]

On 5/31/23 09:47, Ard Biesheuvel wrote:
> On Wed, 31 May 2023 at 18:33, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 5/31/23 04:22, Ard Biesheuvel wrote:
>>> Use the host native instructions to implement the AES instructions
>>> exposed by the emulated target. The mapping is not 1:1, so it requires a
>>> bit of fiddling to get the right result.
>>>
>>> This is still RFC material - the current approach feels too ad-hoc, but
>>> given the non-1:1 correspondence, doing a proper abstraction is rather
>>> difficult.
>>>
>>> Changes since v1/RFC:
>>> - add second patch to implement x86 AES instructions on ARM hosts - this
>>>     helps illustrate what an abstraction should cover.
>>> - use cpuinfo framework to detect host support for AES instructions.
>>> - implement ARM aesimc using x86 aesimc directly
>>>
>>> Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
>>> tcrypt benchmark (mode=500)
>>>
>>> Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
>>> the fact that ARM uses two instructions to implement a single AES round,
>>> whereas x86 only uses one.
>>
>> Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and
>> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.
>>
> 
> I don't understand what 'overhead' means in this context. Are you
> saying you saw barely any improvement?

I saw, without changes, just over 1% of total system emulation time was devoted to aes, 
which gives an upper limit to the runtime improvement possible there.  But I'll have a 
look at tcrypt.

> aesenc_MC() can be implemented on x86 the way I did in patch #!, using
> aesdeclast+aesenc

Oh, nice.  I have not read the actual patches yet.

>> ppc64:
>>
>>       asm("lxvd2x 32,0,%1;"
>>           "lxvd2x 33,0,%2;"
>>           "vcipher 0,0,1;"
>>           "stxvd2x 32,0,%0"
>>           : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");
>>
>> ppc64le:
>>
>>       unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
>>       asm("lxvd2x 32,0,%1;"
>>           "lxvd2x 33,0,%2;"
>>           "lxvd2x 34,0,%3;"
>>           "vperm 0,0,0,2;"
>>           "vperm 1,1,1,2;"
>>           "vcipher 0,0,1;"
>>           "vperm 0,0,0,2;"
>>           "stxvd2x 32,0,%0"
>>           : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");
>>
>> There are also differences in their AES_Te* based C routines as well, which made me wonder
>> if we are handling host endianness differences correctly in emulation right now.  I think
>> I should most definitely add some generic-ish tests for this...
>>
> 
> The above kind of sums it up, no? Or isn't this working code?

It sums up the problem.  It works to produce the same output as the x86 instructions, with 
input bytes in the same order.  It shows that we have to extra careful emulating vcipher 
etc, and should have unit tests.


r~


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic