[prev in list] [next in list] [prev in thread] [next in thread]
List: squeak-vm-dev
Subject: Re: [Vm-dev] Primitives, interpreterProxy, function calls, lto
From: Eliot Miranda <eliot.miranda () gmail ! com>
Date: 2018-04-20 17:34:31
Message-ID: CAC20JE14CYUhPOHrGLxKLb=L9C6Mvw7aiUcVmeha9=WYKU8_8A () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (text/plain)]
[Attachment #3 (multipart/alternative)]
Hi Levente,
On Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <leves@caesar.elte.hu>
wrote:
>
> Hi All,
>
> I'm in the progress of rewriting the primitives of MiscPrimitivePlugin
> (mostly done, I'm stuck with the tests). My main goal was to add the
> necessary checks to make these primitives safer.
> But while I was doing that, I decided to optimize the code a bit by
> getting rid of a few repeated function calls, notably stackValue() calls,
> because those are pretty much the only ones that are called more than once
> with the same argument from these primitives.
>
> This kind of rewrite resulted in small, but measurable speedup (~5-15%),
> despite of the new checks being in place.
>
This is cool.
> Unfortunately, the generated assembly code for almost all interpeterProxy
> methods will be a function call (callq).
> IMHO the best solution would be if these calls were just macros, so that
> the compiler could optimize them away, but I don't know how to achieve that.
>
Look at
VMPluginCodeGenerator>>#generateInterpreterProxyFunctionDereference:on:indent:.
This has to generate code to
- declare interpreterProxy functions imported from the VM (for builtin
plugins)
- declare function pointers for
- assign the function pointers from the interpreterProxy in the
plugin's setInterpreter routine.
So you would have to add a third option for builtin plugins which declared
those functions as macros, or generated them as local static inlined
functions.
> So, I decided to check if the C compiler could do that for us, and yes,
> gcc starting from 4.5 supports link time optimization (lto)[1].
> The results are promising, performance is better, but it still won't
> optimize everything it could[2].
>
> I wrote a small benchmark to see how lto and my rewrite affects the
> overhead of primitive calls:
>
> | string collation |
> string := ''.
> collation := (0 to: 255) asByteArray.
> [ 1 to: 10000000 do: [ :i |
> ByteString compare: string with: string collated: collation ] ]
> timeToRun.
>
> The results on my machine were;
> original: 16 function calls - 795 ms
> rewrite-no lto: 13 function calls - 762 ms
> rewrite-with lto: 10 function calls - 674 ms
>
> So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when
> compiling with gcc 4.5+ to see how stable it is.
> Also, I would be happy if someone could point me to the direction to
> generate macros and use them instead of the function calls for
> interpeterProxy functions.
>
The easiest thing would be to generate them as local inlined static
functions. But I would only bother to do this for performance-critical
primitives, for example by having plugin classes (such as
LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as
performance4-critical via a class-side method.
> Levente
>
> [1] https://gcc.gnu.org/wiki/LinkTimeOptimization
> [2] Generated assembly code for comparison
>
> Without lto:
> 00000000004ed310 <primitiveCompareString>:
> 4ed310: 41 55 push %r13
> 4ed312: 31 ff xor %edi,%edi
> 4ed314: 41 54 push %r12
> 4ed316: 55 push %rbp
> 4ed317: 53 push %rbx
> 4ed318: 48 83 ec 08 sub $0x8,%rsp
> 4ed31c: e8 2f 19 f8 ff callq 46ec50 <stackValue>
> 4ed321: bf 01 00 00 00 mov $0x1,%edi
> 4ed326: 48 89 c3 mov %rax,%rbx
> 4ed329: e8 22 19 f8 ff callq 46ec50 <stackValue>
> 4ed32e: bf 02 00 00 00 mov $0x2,%edi
> 4ed333: 48 89 c5 mov %rax,%rbp
> 4ed336: e8 15 19 f8 ff callq 46ec50 <stackValue>
> 4ed33b: 48 89 df mov %rbx,%rdi
> 4ed33e: 49 89 c4 mov %rax,%r12
> 4ed341: e8 4a d4 f3 ff callq 42a790 <isBytes>
> 4ed346: 48 85 c0 test %rax,%rax
> 4ed349: 74 0d je 4ed358
> <primitiveCompareString+0x48>
> ...
>
> With lto:
> 00000000004fe4f0 <primitiveCompareString.35928>:
> 4fe4f0: 41 55 push %r13
> 4fe4f2: 41 54 push %r12
> 4fe4f4: 55 push %rbp
> 4fe4f5: 53 push %rbx
> 4fe4f6: 48 83 ec 08 sub $0x8,%rsp
> 4fe4fa: 48 8b 05 77 6b 32 00 mov 0x326b77(%rip),%rax
> # 825078 <stackPointer.7294>
> 4fe501: 48 8b 18 mov (%rax),%rbx
> 4fe504: 48 8b 68 08 mov 0x8(%rax),%rbp
> 4fe508: 4c 8b 60 10 mov 0x10(%rax),%r12
> 4fe50c: 48 89 df mov %rbx,%rdi
> 4fe50f: e8 bc fe fb ff callq 4be3d0 <isBytes>
> 4fe514: 48 85 c0 test %rax,%rax
> 4fe517: 74 0d je 4fe526
> <primitiveCompareString.35928+0x36>
> ...
>
> So, gcc could optimize away the three stackvalue calls with 4 mov
> instructions, but failed to inline isBytes() and the rest of the functions.
>
V nice.
Also look at CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack. The
idea here is that if a primitive does not have a deep call chain (only
contains local loops and simple argument checking, does not call allocation
or GC, etc) then
- the interpreter can implement it as a simple C function that takes its
arguments as parameters of the function and returns the result object, or 0
on failure (0 not being an up), and
- the JIT can call it directly from machine code, avoiding the slow stack
switch
Currently we don't use this at all. We used this only for hash multiply
(see mcprimHashMultiply:), but superseded it with a JIT primitive.
However, it could be used for several MiscPrimitivePlugin primitives. And
of course, this is orthogonal to the improvements you're achieving through
inlining argument marshaling.
_,,,^..^,,,_
best, Eliot
[Attachment #6 (text/html)]
<div dir="ltr">Hi Levente,<div class="gmail_extra"><br><div class="gmail_quote">On \
Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <span dir="ltr"><<a \
href="mailto:leves@caesar.elte.hu" \
target="_blank">leves@caesar.elte.hu</a>></span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><br>
Hi All,<br>
<br>
I'm in the progress of rewriting the primitives of MiscPrimitivePlugin (mostly \
done, I'm stuck with the tests). My main goal was to add the necessary checks to \
make these primitives safer.<br> But while I was doing that, I decided to optimize \
the code a bit by getting rid of a few repeated function calls, notably stackValue() \
calls, because those are pretty much the only ones that are called more than once \
with the same argument from these primitives.<br> <br>
This kind of rewrite resulted in small, but measurable speedup (~5-15%), despite of \
the new checks being in place.<br></blockquote><div><br></div><div>This is \
cool.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Unfortunately, \
the generated assembly code for almost all interpeterProxy methods will be a function \
call (callq).<br> IMHO the best solution would be if these calls were just macros, so \
that the compiler could optimize them away, but I don't know how to achieve \
that.<br></blockquote><div><br></div><div>Look at \
VMPluginCodeGenerator>>#generateInterpreterProxyFunctionDereference:on:indent:. \
This has to generate code to</div><div>- declare interpreterProxy functions imported \
from the VM (for builtin plugins)</div><div>- declare function pointers for \
</div><div>- assign the function pointers from the interpreterProxy in the \
plugin's setInterpreter routine.</div><div><br></div><div>So you would have to \
add a third option for builtin plugins which declared those functions as macros, or \
generated them as local static inlined functions.</div><div> </div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
So, I decided to check if the C compiler could do that for us, and yes, gcc starting \
from 4.5 supports link time optimization (lto)[1].<br> The results are promising, \
performance is better, but it still won't optimize everything it could[2].<br> \
<br> I wrote a small benchmark to see how lto and my rewrite affects the overhead of \
primitive calls:<br> <br>
> string collation |<br>
string := ''.<br>
collation := (0 to: 255) asByteArray.<br>
[ 1 to: 10000000 do: [ :i |<br>
ByteString compare: string with: string collated: collation ] ] \
timeToRun.<br> <br>
The results on my machine were;<br>
original: 16 function calls - 795 ms<br>
rewrite-no lto: 13 function calls - 762 ms<br>
rewrite-with lto: 10 function calls - 674 ms<br>
<br>
So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when compiling with \
gcc 4.5+ to see how stable it is.<br> Also, I would be happy if someone could point \
me to the direction to generate macros and use them instead of the function calls for \
interpeterProxy functions.<br></blockquote><div><br></div><div>The easiest thing \
would be to generate them as local inlined static functions. But I would only \
bother to do this for performance-critical primitives, for example by having plugin \
classes (such as LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as \
performance4-critical via a class-side method.</div><div><br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
<br>
Levente<br>
<br>
[1] <a href="https://gcc.gnu.org/wiki/LinkTimeOptimization" rel="noreferrer" \
target="_blank">https://gcc.gnu.org/wiki/LinkT<wbr>imeOptimization</a><br> [2] \
Generated assembly code for comparison<br> <br>
Without lto:<br>
00000000004ed310 <primitiveCompareString>:<br>
4ed310: 41 55 push %r13<br>
4ed312: 31 ff xor %edi,%edi<br>
4ed314: 41 54 push %r12<br>
4ed316: 55 push %rbp<br>
4ed317: 53 push %rbx<br>
4ed318: 48 83 ec 08 sub $0x8,%rsp<br>
4ed31c: e8 2f 19 f8 ff callq 46ec50 \
<stackValue><br> 4ed321: bf 01 00 00 00 mov \
$0x1,%edi<br> 4ed326: 48 89 c3 mov \
%rax,%rbx<br> 4ed329: e8 22 19 f8 ff callq 46ec50 \
<stackValue><br> 4ed32e: bf 02 00 00 00 mov \
$0x2,%edi<br> 4ed333: 48 89 c5 mov \
%rax,%rbp<br> 4ed336: e8 15 19 f8 ff callq 46ec50 \
<stackValue><br> 4ed33b: 48 89 df mov \
%rbx,%rdi<br> 4ed33e: 49 89 c4 mov \
%rax,%r12<br> 4ed341: e8 4a d4 f3 ff callq 42a790 \
<isBytes><br> 4ed346: 48 85 c0 test \
%rax,%rax<br> 4ed349: 74 0d je 4ed358 \
<primitiveCompareString+0x48><br>
...<br>
<br>
With lto:<br>
00000000004fe4f0 <primitiveCompareString.35928><wbr>:<br>
4fe4f0: 41 55 push %r13<br>
4fe4f2: 41 54 push %r12<br>
4fe4f4: 55 push %rbp<br>
4fe4f5: 53 push %rbx<br>
4fe4f6: 48 83 ec 08 sub $0x8,%rsp<br>
4fe4fa: 48 8b 05 77 6b 32 00 mov 0x326b77(%rip),%rax \
# 825078 <stackPointer.7294><br> 4fe501: 48 8b 18 \
mov (%rax),%rbx<br> 4fe504: 48 8b 68 08 mov \
0x8(%rax),%rbp<br> 4fe508: 4c 8b 60 10 mov \
0x10(%rax),%r12<br> 4fe50c: 48 89 df mov \
%rbx,%rdi<br> 4fe50f: e8 bc fe fb ff callq 4be3d0 \
<isBytes><br> 4fe514: 48 85 c0 test \
%rax,%rax<br> 4fe517: 74 0d je 4fe526 \
<primitiveCompareString.35928+<wbr>0x36><br>
...<br>
<br>
So, gcc could optimize away the three stackvalue calls with 4 mov instructions, but \
failed to inline isBytes() and the rest of the functions.<br> \
</blockquote></div><br><br clear="all"><div>V nice.</div><div><br></div><div>Also \
look at CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack. The idea here is \
that if a primitive does not have a deep call chain (only contains local loops and \
simple argument checking, does not call allocation or GC, etc) then</div><div>- the \
interpreter can implement it as a simple C function that takes its arguments as \
parameters of the function and returns the result object, or 0 on failure (0 not \
being an up), and</div><div>- the JIT can call it directly from machine code, \
avoiding the slow stack switch</div><div class="gmail_extra"><br></div><div \
class="gmail_extra">Currently we don't use this at all. We used this only for \
hash multiply (see mcprimHashMultiply:), but superseded it with a JIT primitive. \
However, it could be used for several MiscPrimitivePlugin primitives. And of \
course, this is orthogonal to the improvements you're achieving through inlining \
argument marshaling.</div><div class="gmail_extra"><br></div><div \
class="gmail_signature"><div dir="ltr"><div><span \
style="font-size:small;border-collapse:separate"><div>_,,,^..^,,,_<br></div><div>best, \
Eliot</div></span></div></div></div> </div></div>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic