[prev in list] [next in list] [prev in thread] [next in thread] 

List:       squeak-vm-dev
Subject:    Re: [Vm-dev] Primitives, interpreterProxy, function calls, lto
From:       Eliot Miranda <eliot.miranda () gmail ! com>
Date:       2018-04-20 17:34:31
Message-ID: CAC20JE14CYUhPOHrGLxKLb=L9C6Mvw7aiUcVmeha9=WYKU8_8A () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (text/plain)]

 
[Attachment #3 (multipart/alternative)]


Hi Levente,

On Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <leves@caesar.elte.hu>
wrote:

>
> Hi All,
>
> I'm in the progress of rewriting the primitives of MiscPrimitivePlugin
> (mostly done, I'm stuck with the tests). My main goal was to add the
> necessary checks to make these primitives safer.
> But while I was doing that, I decided to optimize the code a bit by
> getting rid of a few repeated function calls, notably stackValue() calls,
> because those are pretty much the only ones that are called more than once
> with the same argument from these primitives.
>
> This kind of rewrite resulted in small, but measurable speedup (~5-15%),
> despite of the new checks being in place.
>

This is cool.


> Unfortunately, the generated assembly code for almost all interpeterProxy
> methods will be a function call (callq).
> IMHO the best solution would be if these calls were just macros, so that
> the compiler could optimize them away, but I don't know how to achieve that.
>

Look at
VMPluginCodeGenerator>>#generateInterpreterProxyFunctionDereference:on:indent:.
This has to generate code to
- declare interpreterProxy functions imported from the VM (for builtin
plugins)
- declare function pointers for
- assign the function pointers from the interpreterProxy in the
plugin's setInterpreter routine.

So you would have to add a third option for builtin plugins which declared
those functions as macros, or generated them as local static inlined
functions.


> So, I decided to check if the C compiler could do that for us, and yes,
> gcc starting from 4.5 supports link time optimization (lto)[1].
> The results are promising, performance is better, but it still won't
> optimize everything it could[2].
>
> I wrote a small benchmark to see how lto and my rewrite affects the
> overhead of primitive calls:
>
> | string collation |
> string := ''.
> collation := (0 to: 255) asByteArray.
> [ 1 to: 10000000 do: [ :i |
>         ByteString compare: string with: string collated: collation ] ]
> timeToRun.
>
> The results on my machine were;
> original: 16 function calls - 795 ms
> rewrite-no lto: 13 function calls - 762 ms
> rewrite-with lto: 10 function calls - 674 ms
>
> So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when
> compiling with gcc 4.5+ to see how stable it is.
> Also, I would be happy if someone could point me to the direction to
> generate macros and use them instead of the function calls for
> interpeterProxy functions.
>

The easiest thing would be to generate them as local inlined static
functions.  But I would only bother to do this for performance-critical
primitives, for example by having plugin classes (such as
LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as
performance4-critical via a class-side method.


> Levente
>
> [1] https://gcc.gnu.org/wiki/LinkTimeOptimization
> [2] Generated assembly code for comparison
>
> Without lto:
> 00000000004ed310 <primitiveCompareString>:
>   4ed310:       41 55                   push   %r13
>   4ed312:       31 ff                   xor    %edi,%edi
>   4ed314:       41 54                   push   %r12
>   4ed316:       55                      push   %rbp
>   4ed317:       53                      push   %rbx
>   4ed318:       48 83 ec 08             sub    $0x8,%rsp
>   4ed31c:       e8 2f 19 f8 ff          callq  46ec50 <stackValue>
>   4ed321:       bf 01 00 00 00          mov    $0x1,%edi
>   4ed326:       48 89 c3                mov    %rax,%rbx
>   4ed329:       e8 22 19 f8 ff          callq  46ec50 <stackValue>
>   4ed32e:       bf 02 00 00 00          mov    $0x2,%edi
>   4ed333:       48 89 c5                mov    %rax,%rbp
>   4ed336:       e8 15 19 f8 ff          callq  46ec50 <stackValue>
>   4ed33b:       48 89 df                mov    %rbx,%rdi
>   4ed33e:       49 89 c4                mov    %rax,%r12
>   4ed341:       e8 4a d4 f3 ff          callq  42a790 <isBytes>
>   4ed346:       48 85 c0                test   %rax,%rax
>   4ed349:       74 0d                   je     4ed358
> <primitiveCompareString+0x48>
> ...
>
> With lto:
> 00000000004fe4f0 <primitiveCompareString.35928>:
>   4fe4f0:       41 55                   push   %r13
>   4fe4f2:       41 54                   push   %r12
>   4fe4f4:       55                      push   %rbp
>   4fe4f5:       53                      push   %rbx
>   4fe4f6:       48 83 ec 08             sub    $0x8,%rsp
>   4fe4fa:       48 8b 05 77 6b 32 00    mov    0x326b77(%rip),%rax
> # 825078 <stackPointer.7294>
>   4fe501:       48 8b 18                mov    (%rax),%rbx
>   4fe504:       48 8b 68 08             mov    0x8(%rax),%rbp
>   4fe508:       4c 8b 60 10             mov    0x10(%rax),%r12
>   4fe50c:       48 89 df                mov    %rbx,%rdi
>   4fe50f:       e8 bc fe fb ff          callq  4be3d0 <isBytes>
>   4fe514:       48 85 c0                test   %rax,%rax
>   4fe517:       74 0d                   je     4fe526
> <primitiveCompareString.35928+0x36>
> ...
>
> So, gcc could optimize away the three stackvalue calls with 4 mov
> instructions, but failed to inline isBytes() and the rest of the functions.
>


V nice.

Also look at CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack.  The
idea here is that if a primitive does not have a deep call chain (only
contains local loops and simple argument checking, does not call allocation
or GC, etc) then
- the interpreter can implement it as a simple C function that takes its
arguments as parameters of the function and returns the result object, or 0
on failure (0 not being an up), and
- the JIT can call it directly from machine code, avoiding the slow stack
switch

Currently we don't use this at all.  We used this only for hash multiply
(see mcprimHashMultiply:), but superseded it with a JIT primitive.
However, it could be used for several MiscPrimitivePlugin primitives.  And
of course, this is orthogonal to the improvements you're achieving through
inlining argument marshaling.

_,,,^..^,,,_
best, Eliot

[Attachment #6 (text/html)]

<div dir="ltr">Hi Levente,<div class="gmail_extra"><br><div class="gmail_quote">On \
Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <span dir="ltr">&lt;<a \
href="mailto:leves@caesar.elte.hu" \
target="_blank">leves@caesar.elte.hu</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><br>
 Hi All,<br>
<br>
I&#39;m in the progress of rewriting the primitives of MiscPrimitivePlugin (mostly \
done, I&#39;m stuck with the tests). My main goal was to add the necessary checks to \
make these primitives safer.<br> But while I was doing that, I decided to optimize \
the code a bit by getting rid of a few repeated function calls, notably stackValue() \
calls, because those are pretty much the only ones that are called more than once \
with the same argument from these primitives.<br> <br>
This kind of rewrite resulted in small, but measurable speedup (~5-15%), despite of \
the new checks being in place.<br></blockquote><div><br></div><div>This is \
cool.</div><div>  <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Unfortunately, \
the generated assembly code for almost all interpeterProxy methods will be a function \
call (callq).<br> IMHO the best solution would be if these calls were just macros, so \
that the compiler could optimize them away, but I don&#39;t know how to achieve \
that.<br></blockquote><div><br></div><div>Look at \
VMPluginCodeGenerator&gt;&gt;#generateInterpreterProxyFunctionDereference:on:indent:. \
This has to generate code to</div><div>- declare interpreterProxy functions imported \
from the VM (for builtin plugins)</div><div>- declare function pointers for  \
</div><div>- assign the function pointers from the interpreterProxy in the \
plugin&#39;s  setInterpreter routine.</div><div><br></div><div>So you would have to \
add a third option for builtin plugins which declared those functions as macros, or \
generated them as local static inlined functions.</div><div>  </div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
 So, I decided to check if the C compiler could do that for us, and yes, gcc starting \
from 4.5 supports link time optimization (lto)[1].<br> The results are promising, \
performance is better, but it still won&#39;t optimize everything it could[2].<br> \
<br> I wrote a small benchmark to see how lto and my rewrite affects the overhead of \
primitive calls:<br> <br>
> string collation |<br>
string := &#39;&#39;.<br>
collation := (0 to: 255) asByteArray.<br>
[ 1 to: 10000000 do: [ :i |<br>
            ByteString compare: string with: string collated: collation ] ] \
timeToRun.<br> <br>
The results on my machine were;<br>
original: 16 function calls - 795 ms<br>
rewrite-no lto: 13 function calls - 762 ms<br>
rewrite-with lto: 10 function calls - 674 ms<br>
<br>
So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when compiling with \
gcc 4.5+ to see how stable it is.<br> Also, I would be happy if someone could point \
me to the direction to generate macros and use them instead of the function calls for \
interpeterProxy functions.<br></blockquote><div><br></div><div>The easiest thing \
would be to generate them as local inlined static functions.   But I would only \
bother to do this for performance-critical primitives, for example by having plugin \
classes (such as LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as \
performance4-critical via a class-side method.</div><div><br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
 <br>
Levente<br>
<br>
[1] <a href="https://gcc.gnu.org/wiki/LinkTimeOptimization" rel="noreferrer" \
target="_blank">https://gcc.gnu.org/wiki/LinkT<wbr>imeOptimization</a><br> [2] \
Generated assembly code for comparison<br> <br>
Without lto:<br>
00000000004ed310 &lt;primitiveCompareString&gt;:<br>
   4ed310:           41 55                             push     %r13<br>
   4ed312:           31 ff                             xor      %edi,%edi<br>
   4ed314:           41 54                             push     %r12<br>
   4ed316:           55                                 push     %rbp<br>
   4ed317:           53                                 push     %rbx<br>
   4ed318:           48 83 ec 08                    sub      $0x8,%rsp<br>
   4ed31c:           e8 2f 19 f8 ff               callq   46ec50 \
&lt;stackValue&gt;<br>  4ed321:           bf 01 00 00 00               mov      \
$0x1,%edi<br>  4ed326:           48 89 c3                        mov      \
%rax,%rbx<br>  4ed329:           e8 22 19 f8 ff               callq   46ec50 \
&lt;stackValue&gt;<br>  4ed32e:           bf 02 00 00 00               mov      \
$0x2,%edi<br>  4ed333:           48 89 c5                        mov      \
%rax,%rbp<br>  4ed336:           e8 15 19 f8 ff               callq   46ec50 \
&lt;stackValue&gt;<br>  4ed33b:           48 89 df                        mov      \
%rbx,%rdi<br>  4ed33e:           49 89 c4                        mov      \
%rax,%r12<br>  4ed341:           e8 4a d4 f3 ff               callq   42a790 \
&lt;isBytes&gt;<br>  4ed346:           48 85 c0                        test     \
%rax,%rax<br>  4ed349:           74 0d                             je        4ed358 \
                &lt;primitiveCompareString+0x48&gt;<br>
...<br>
<br>
With lto:<br>
00000000004fe4f0 &lt;primitiveCompareString.35928&gt;<wbr>:<br>
   4fe4f0:           41 55                             push     %r13<br>
   4fe4f2:           41 54                             push     %r12<br>
   4fe4f4:           55                                 push     %rbp<br>
   4fe4f5:           53                                 push     %rbx<br>
   4fe4f6:           48 83 ec 08                    sub      $0x8,%rsp<br>
   4fe4fa:           48 8b 05 77 6b 32 00      mov      0x326b77(%rip),%rax           \
# 825078 &lt;stackPointer.7294&gt;<br>  4fe501:           48 8b 18                    \
mov      (%rax),%rbx<br>  4fe504:           48 8b 68 08                    mov      \
0x8(%rax),%rbp<br>  4fe508:           4c 8b 60 10                    mov      \
0x10(%rax),%r12<br>  4fe50c:           48 89 df                        mov      \
%rbx,%rdi<br>  4fe50f:           e8 bc fe fb ff               callq   4be3d0 \
&lt;isBytes&gt;<br>  4fe514:           48 85 c0                        test     \
%rax,%rax<br>  4fe517:           74 0d                             je        4fe526 \
                &lt;primitiveCompareString.35928+<wbr>0x36&gt;<br>
...<br>
<br>
So, gcc could optimize away the three stackvalue calls with 4 mov instructions, but \
failed to inline isBytes() and the rest of the functions.<br> \
</blockquote></div><br><br clear="all"><div>V nice.</div><div><br></div><div>Also \
look at  CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack.   The idea here is \
that if a primitive does not have a deep call chain (only contains local loops and \
simple argument checking, does not call allocation or GC, etc) then</div><div>- the \
interpreter can implement it as a simple C function that takes its arguments as \
parameters of the function and returns the result object, or 0 on failure (0 not \
being an up), and</div><div>- the JIT can call it directly from machine code, \
avoiding the slow stack switch</div><div class="gmail_extra"><br></div><div \
class="gmail_extra">Currently we don&#39;t use this at all.   We used this only for \
hash multiply (see  mcprimHashMultiply:), but superseded it with a JIT primitive.   \
However, it could be used for several MiscPrimitivePlugin primitives.   And of \
course, this is orthogonal to the improvements you&#39;re achieving through inlining \
argument marshaling.</div><div class="gmail_extra"><br></div><div \
class="gmail_signature"><div dir="ltr"><div><span \
style="font-size:small;border-collapse:separate"><div>_,,,^..^,,,_<br></div><div>best, \
Eliot</div></span></div></div></div> </div></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic