[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gcc
Subject:    Re: LTO slows down calculix by more than 10% on aarch64
From:       Prathamesh Kulkarni via Gcc <gcc () gcc ! gnu ! org>
Date:       2020-10-28 6:56:18
Message-ID: CAAgBjMkiqB4WREozAn7ZJJ6p67_1eNFh5BE9hrk6SYj8rEh-ZA () mail ! gmail ! com
[Download RAW message or body]

On Wed, 21 Oct 2020 at 16:10, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Wed, Oct 21, 2020 at 12:04 PM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> > 
> > On Thu, 24 Sep 2020 at 16:44, Richard Biener <richard.guenther@gmail.com> wrote:
> > > 
> > > On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > 
> > > > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guenther@gmail.com> \
> > > > wrote:
> > > > > 
> > > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > 
> > > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guenther@gmail.com> \
> > > > > > wrote:
> > > > > > > 
> > > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > 
> > > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener \
> > > > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > 
> > > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener \
> > > > > > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov \
> > > > > > > > > > > > > > <amonakov@ispras.ru> wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I obtained perf stat results for following benchmark \
> > > > > > > > > > > > > > > > runs: 
> > > > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 7856832.692380      task-clock (msec)         #    \
> > > > > > > > > > > > > > > > 1.000 CPUs utilized 3758               \
> > > > > > > > > > > > > > > > context-switches          #    0.000 K/sec 40         \
> > > > > > > > > > > > > > > > cpu-migrations             #    0.000 K/sec 40847     \
> > > > > > > > > > > > > > > > page-faults                   #    0.005 K/sec \
> > > > > > > > > > > > > > > > 7856782413676      cycles                           # \
> > > > > > > > > > > > > > > > 1.000 GHz 6034510093417      instructions             \
> > > > > > > > > > > > > > > > #    0.77  insn per cycle 363937274287       branches \
> > > > > > > > > > > > > > > > #   46.321 M/sec 48557110132       branch-misses      \
> > > > > > > > > > > > > > > > #   13.34% of all branches
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile \
> > > > > > > > > > > > > > > over a minute should be enough for this kind of code)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 8319643.114380      task-clock (msec)       #    \
> > > > > > > > > > > > > > > > 1.000 CPUs utilized 4285               \
> > > > > > > > > > > > > > > > context-switches         #    0.001 K/sec 28          \
> > > > > > > > > > > > > > > > cpu-migrations            #    0.000 K/sec 40843      \
> > > > > > > > > > > > > > > > page-faults                  #    0.005 K/sec \
> > > > > > > > > > > > > > > > 8319591038295      cycles                          #  \
> > > > > > > > > > > > > > > > 1.000 GHz 6276338800377      instructions             \
> > > > > > > > > > > > > > > > #    0.75  insn per cycle 467400726106       branches \
> > > > > > > > > > > > > > > > #   56.180 M/sec 45986364011        branch-misses     \
> > > > > > > > > > > > > > > > #    9.84% of all branches
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 \
> > > > > > > > > > > > > > > cycles, probably implying that extra instructions are \
> > > > > > > > > > > > > > > appearing in this loop nest, but not in the innermost \
> > > > > > > > > > > > > > > loop. As a reminder for others, the innermost loop has \
> > > > > > > > > > > > > > > only 3 iterations. 
> > > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this \
> > > > > > > > > > > > > > > > removes the extra branches): 
> > > > > > > > > > > > > > > > 8207331.088040      task-clock (msec)   #    1.000 \
> > > > > > > > > > > > > > > > CPUs utilized 2266               context-switches    \
> > > > > > > > > > > > > > > > #    0.000 K/sec 32                 cpu-migrations    \
> > > > > > > > > > > > > > > > #    0.000 K/sec 40846              page-faults       \
> > > > > > > > > > > > > > > > #    0.005 K/sec 8207292032467      cycles            \
> > > > > > > > > > > > > > > > #   1.000 GHz 6035724436440      instructions         \
> > > > > > > > > > > > > > > > #    0.74  insn per cycle 364415440156       branches \
> > > > > > > > > > > > > > > > #   44.401 M/sec 53138327276        branch-misses     \
> > > > > > > > > > > > > > > > #   14.58% of all branches
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This seems to match baseline in terms of instruction \
> > > > > > > > > > > > > > > count, but without PRE the loop nest may be carrying \
> > > > > > > > > > > > > > > some dependencies over memory. I would simply check the \
> > > > > > > > > > > > > > > assembly for the entire 6-level loop nest in question, \
> > > > > > > > > > > > > > > I hope it's not very complicated (though Fortran array \
> > > > > > > > > > > > > > > addressing...). 
> > > > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 7797265.206850      task-clock (msec)         #    \
> > > > > > > > > > > > > > > > 1.000 CPUs utilized 3139              \
> > > > > > > > > > > > > > > > context-switches          #    0.000 K/sec 20         \
> > > > > > > > > > > > > > > > cpu-migrations             #    0.000 K/sec 40846     \
> > > > > > > > > > > > > > > > page-faults                  #    0.005 K/sec \
> > > > > > > > > > > > > > > > 7797221351467      cycles                          #  \
> > > > > > > > > > > > > > > > 1.000 GHz 6187348757324      instructions             \
> > > > > > > > > > > > > > > > #    0.79  insn per cycle 461840800061       branches \
> > > > > > > > > > > > > > > > #   59.231 M/sec 26920311761        branch-misses     \
> > > > > > > > > > > > > > > > #    5.83% of all branches
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 \
> > > > > > > > > > > > > > > reduction in cycle count. I don't think the former \
> > > > > > > > > > > > > > > fully covers the latter (there's also a 90e9 reduction \
> > > > > > > > > > > > > > > in insn count). 
> > > > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my \
> > > > > > > > > > > > > > > main suggestion is to consider how the profile for the \
> > > > > > > > > > > > > > > entire loop nest looks like (it's 6 loops deep, each \
> > > > > > > > > > > > > > > iterating only 3 times). 
> > > > > > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > > > > > > > > 985098 |            add    x2, x2, #0x18
> > > > > > > > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > > > > > > > > │             cmp    w0, #0x4
> > > > > > > > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > > > > > > > > 15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > \
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > > > > > > > > 72262104254│            add    x2, x2, #0x18
> > > > > > > > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > > > > > > > > 245967│            cmp    w0, #0x4
> > > > > > > > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > > > > > > > > 704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This indicates that the loop only covers about 46-48% \
> > > > > > > > > > > > > > > of overall time. 
> > > > > > > > > > > > > > > High count on the initial ldur instruction could be \
> > > > > > > > > > > > > > > explained if the loop is not entered by "fallthru" from \
> > > > > > > > > > > > > > > the preceding block, or if its backedge is \
> > > > > > > > > > > > > > > mispredicted. Sampling mispredictions should be \
> > > > > > > > > > > > > > > possible with perf record, and you may be able to check \
> > > > > > > > > > > > > > > if loop entry is fallthrough by inspecting assembly.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > It may also be possible to check if code alignment \
> > > > > > > > > > > > > > >                 matters, by compiling with
> > > > > > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry \
> > > > > > > > > > > > > > for late response. 
> > > > > > > > > > > > > > The hoisting region is:
> > > > > > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > > > > > > 4 loops
> > > > > > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > > > > > > {
> > > > > > > > > > > > > > orthonl inlined into basic block;
> > > > > > > > > > > > > > loads w[0] .. w[8]
> > > > > > > > > > > > > > }
> > > > > > > > > > > > > > else
> > > > > > > > > > > > > > 6 loops  // load anisox
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > followed by basic block:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > senergy=
> > > > > > > > > > > > > > &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > > > > > > &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > > > > > > &                    \
> > > > > > > > > > > > > > +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight \
> > > > > > > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy \
> > > > > > > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy \
> > > > > > > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy 
> > > > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and \
> > > > > > > > > > > > > > senergy block, right in block 181, which is:
> > > > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > From block 181, we have two paths towards senergy block \
> > > > > > > > > > > > > > (bb 194): bb 181 -> bb 182 (orthonl block) -> bb 194 \
> > > > > > > > > > > > > > (senergy block) AND
> > > > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == \
> > > > > > > > > > > > > > REFERENCE), avoid hoisting of 'w' array and brings back \
> > > > > > > > > > > > > > most of performance. Which verifies that it is hoisting \
> > > > > > > > > > > > > > of the 'w' array (w[0] ... w[8]), which is causing the \
> > > > > > > > > > > > > > slowdown ? 
> > > > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled \
> > > > > > > > > > > > > > hoisting of 'w' array for the 6 loops, and the most \
> > > > > > > > > > > > > > drastic difference was for ldur instruction:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > With full hoisting:
> > > > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Without full hoisting:
> > > > > > > > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I \
> > > > > > > > > > > > > > have attached profiles for both cases).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > IIUC, the instruction seems to be loading the first \
> > > > > > > > > > > > > > element from anisox array, which makes me wonder if the \
> > > > > > > > > > > > > > issue was with data-cache miss for slower version. I ran \
> > > > > > > > > > > > > > perf script on perf data for L1-dcache-load-misses with \
> > > > > > > > > > > > > > period = 1million, and it reported two cache misses on \
> > > > > > > > > > > > > > the ldur instruction in full hoisting case, while it \
> > > > > > > > > > > > > > reported zero for the disabled load hoisting case. So I \
> > > > > > > > > > > > > > wonder if the slowdown happens because hoisting of 'w' \
> > > > > > > > > > > > > > array possibly results
> > > > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside \
> > > > > > > > > > > > > > the inner loop and making load slower ?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Hoisting also seems to improve the number of overall \
> > > > > > > > > > > > > > cache misses tho. For disabled hoisting  of 'w' array \
> > > > > > > > > > > > > > case, there were a total of 463 cache misses, while with \
> > > > > > > > > > > > > > full hoisting there were 357 cache misses (with period = \
> > > > > > > > > > > > > > 1 million). Does that happen because hoisting probably \
> > > > > > > > > > > > > > reduces cache misses along the orthonl path (bb 173 - > \
> > > > > > > > > > > > > > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > In general I feel for this or PR80155 case, the issues come \
> > > > > > > > > > > > > with long range hoistings, inside a large CFG, since we \
> > > > > > > > > > > > > don't have an accurate way to model target resources \
> > > > > > > > > > > > > (register pressure in PR80155 case / or possibly cache \
> > > > > > > > > > > > > pressure in this case?) at tree level and we end up with \
> > > > > > > > > > > > > register spill or cache miss inside loops, which may offset \
> > > > > > > > > > > > > the benefit of hoisting. As previously discussed the right \
> > > > > > > > > > > > > way to go is a live range splitting pass, at GIMPLE -> RTL \
> > > > > > > > > > > > > border which can also help with other code-movement \
> > > > > > > > > > > > > optimizations (or if the source had variables with long \
> > > > > > > > > > > > > live ranges). 
> > > > > > > > > > > > > I was wondering tho as a cheap workaround, would it make \
> > > > > > > > > > > > > sense to check if we are hoisting across a "large" region \
> > > > > > > > > > > > > of nested loops, and avoid in that case since hoisting may \
> > > > > > > > > > > > > exert resource pressure inside loop region ? (Especially, \
> > > > > > > > > > > > > in the cases where hoisted expressions were not originally \
> > > > > > > > > > > > > AVAIL in any of the loop blocks, and the loop region \
> > > > > > > > > > > > > doesn't benefit from hoisting). 
> > > > > > > > > > > > > For instance:
> > > > > > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > > > > > > {
> > > > > > > > > > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > > > > > > if (e->dest is a loop pre-header or loop header
> > > > > > > > > > > > > && nesting depth of loop is > 3)
> > > > > > > > > > > > > return false;
> > > > > > > > > > > > > }
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think this would work for resolving the calculix issue \
> > > > > > > > > > > > > because it hoists across one region of 6 loops and another \
> > > > > > > > > > > > > of 4 loops (didn' test yet).
> > > > > > > > > > > > > It's not bulletproof in that it will miss detecting cases \
> > > > > > > > > > > > > where loop header (or pre-header) isn't a successor of \
> > > > > > > > > > > > > candidate block (checking for
> > > > > > > > > > > > > that might get expensive tho?). I will test it on gcc suite \
> > > > > > > > > > > > > and SPEC for any regressions.
> > > > > > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no \
> > > > > > > > > > > > regressions. And it brings back most of performance for \
> > > > > > > > > > > > calculix on par with O2 (without inlining orthonl).
> > > > > > > > > > > > I verified that with patch there is no cache miss happening \
> > > > > > > > > > > > on load insn inside loop
> > > > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > > > > > > 
> > > > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for \
> > > > > > > > > > > > SPEC for speed and will report numbers
> > > > > > > > > > > > in couple of days. (If required, we could parametrize number \
> > > > > > > > > > > > of nested loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > > > > > and set it in backend to not affect other targets).
> > > > > > > > > > > 
> > > > > > > > > > > I don't think this patch captures the case in a sensible way - \
> > > > > > > > > > > it will simply never hoist computations out of loop header \
> > > > > > > > > > > blocks with depth > 3 which is certainly not what you want.  \
> > > > > > > > > > > Also the pre-header check is odd - we're looking for \
> > > > > > > > > > > computations in successors of BLOCK but clearly a computation \
> > > > > > > > > > > in a pre-header is not at the same loop level as one in the \
> > > > > > > > > > > header itself.
> > > > > > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > > > > > which has multiple nested loops, and in that case, avoid hoisting \
> > > > > > > > > > expressions that do not belong to any loop blocks, because that \
> > > > > > > > > > may increase resource pressure
> > > > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' \
> > > > > > > > > > array from post-dom and neither
> > > > > > > > > > loop region has any uses of 'w'.  I agree checking just for loop \
> > > > > > > > > > level is too coarse.
> > > > > > > > > > The check with pre-header was essentially the same to see if we \
> > > > > > > > > > are hoisting across a loop region,
> > > > > > > > > > not necessarily from within the loops.
> > > > > > > > > 
> > > > > > > > > But it will fail to hoist *p in
> > > > > > > > > 
> > > > > > > > > if (x)
> > > > > > > > > {
> > > > > > > > > a = *p;
> > > > > > > > > }
> > > > > > > > > else
> > > > > > > > > {
> > > > > > > > > b = *p;
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > <make distance large>
> > > > > > > > > pdom:
> > > > > > > > > c = *p;
> > > > > > > > > 
> > > > > > > > > so it isn't what matters either.  What happens at the immediate \
> > > > > > > > > post-dominator isn't directly relevant - what matters would be if \
> > > > > > > > > the pdom is the one making the value antic on one of the outgoing \
> > > > > > > > > edges.  In that case we're also going to PRE *p into the arm not \
> > > > > > > > > computing *p (albeit in a later position).  But that property is \
> > > > > > > > > impossible to compute from the sets itself (not even mentioning the \
> > > > > > > > > arbitrary CFG that can be inbetween the block and its pdom or the \
> > > > > > > > > weird pdoms we compute for regions not having a path to exit, like \
> > > > > > > > > infinite loops). 
> > > > > > > > > You could eventually look at the pdom predecessors and if *p is not \
> > > > > > > > > AVAIL_OUT in each of them we _might_ have the situation you want to \
> > > > > > > > > protect against. But as said PRE insertion will likely have made \
> > > > > > > > > sure it _is_ AVAIL_OUT in each of them ...
> > > > > > > > Hi Richard,
> > > > > > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > > > > > post-dom block is making expressions ANTIC. Before doing insert, \
> > > > > > > > could we simply copy AVAIL_OUT sets of each block into another set \
> > > > > > > > say ORIG_AVAIL_OUT, as a guard against PRE eventually inserting \
> > > > > > > > expressions in pred blocks of pdom and making them available?
> > > > > > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > > > > > if the distance is "large".
> > > > > > > 
> > > > > > > Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> > > > > > > very large (it's actually quadratic in size of the CFG * # values), in
> > > > > > > particular
> > > > > > > we're inserting in RPO and update AVAIL_OUT only up to the current \
> > > > > > > block (from dominators) so the PDOM block should have the original \
> > > > > > > AVAIL_OUT (but from the last iteration - we do iterate insertion).
> > > > > > > 
> > > > > > > Note I'm still not happy with pulling off this kind of heuristics.
> > > > > > > What the suggestion
> > > > > > > means is that for
> > > > > > > 
> > > > > > > if (x)
> > > > > > > y = *p;
> > > > > > > z = *p;
> > > > > > > 
> > > > > > > we'll end up with
> > > > > > > 
> > > > > > > if (x)
> > > > > > > y = *p;
> > > > > > > else
> > > > > > > z = *p;
> > > > > > > 
> > > > > > > instead of
> > > > > > > 
> > > > > > > tem = *p;
> > > > > > > if (x)
> > > > > > > y = tem;
> > > > > > > else
> > > > > > > z = tem;
> > > > > > > 
> > > > > > > that is, we get the runtime benefit but not the code-size one
> > > > > > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > > > > > in some cases).  Now, if we iterate (like if we'd have a second \
> > > > > > > hoisting pass) then the above would still cause hoisting - so the \
> > > > > > > heuristic isn't sustainable. Again, sth like "distance" isn't really \
> > > > > > > available.
> > > > > > Hi Richard,
> > > > > > It doesn't work without copying AVAIL_OUT.
> > > > > > I tried for small example with attached patch:
> > > > > > 
> > > > > > int foo(int cond, int x, int y)
> > > > > > {
> > > > > > int t;
> > > > > > void f(int);
> > > > > > 
> > > > > > if (cond)
> > > > > > t = x + y;
> > > > > > else
> > > > > > t = x - y;
> > > > > > 
> > > > > > f (t);
> > > > > > int t2 = (x + y) * 10;
> > > > > > return t2;
> > > > > > }
> > > > > > 
> > > > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > > > > > it does not hoist in first pass, but then PRE inserts x + y in the "else \
> > > > > > block", and we eventually hoist before if (cond). Similarly for e_c3d
> > > > > > hoistings in calculix.
> > > > > > 
> > > > > > IIUC, we want hoisting to be:
> > > > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT \
> > > > > > (block) to ensure that hoisted expressions are along all paths from block \
> > > > > > to post-dom ? If copying AVAIL_OUT sets is too large, could we keep \
> > > > > > another set that precomputes intersection of AVAIL_OUT of pdom preds
> > > > > > for each block and then use this info during hoisting ?
> > > > > > 
> > > > > > For computing "distance", I implemented a simple DFS walk from block
> > > > > > to post-dom, that gives up if depth crosses
> > > > > > threshold before reaching post-dom. I am not sure tho, how expensive
> > > > > > that can get.
> > > > > 
> > > > > As written it is quadratic in the CFG size.
> > > > > 
> > > > > You can optimize away the
> > > > > 
> > > > > +    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> > > > > +      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
> > > > > 
> > > > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
> > > > > 
> > > > > As said, I don't think this is the way to go - trying to avoid code
> > > > > hoisting isn't
> > > > > what we'd want to do - your quoted assembly instead points to a loop
> > > > > with a non-empty latch which is usually caused by PRE and avoided with -O3
> > > > > because it also harms vectorization.
> > > > But disabling PRE (which removes non empty latch), only results in
> > > > marginal performance improvement,
> > > > while disabling hoisting of 'w' array, with non-empty latch, seems to
> > > > gain most of the performance.
> > > > AFAIU, that was happening, because after disabling hoisting of 'w',
> > > > there wasn't a cache miss (as seen with perf -e
> > > > L1-dcache-load-misses),
> > > > on the load instruction inside the inner loop.
> > > 
> > > But that doesn't make much sense then.  If code generation isn't
> > > an issue I don't see how the hoisted loads should cause a L1
> > > dcache load miss for data that is accessed in the respective loop
> > > as well (though not hoisted from it since at -O2 not sufficiently
> > > unrolled)
> > Hi Richard,
> > I am very sorry to respond late, I was away from work for some
> > personal commitments, and couldn't respond earlier.
> > Yes, I agree this doesn't seem to make much sense but I am
> > consistently seeing L1 dcache load misses, which goes away
> > after disabling hoisting of 'w'. I am not sure tho why this happens.
> > Also the load instruction is the only one that has most
> > significant performance difference across several runs. Or maybe I am
> > interpreting the results incorrectly.
> > Do you have suggestions for any benchmarking experiment I could try ?
> 
> I think you want to edit the bad assembly manually and try a few things,
> like ordering the hoisted loads and then re-measure.  Unfortunately
> modern CPU pipelines have no easy way to tell us why they're unhappy.
Hi Richard,
Thanks for the suggestions. I reordered the hoisted loads to be
in-order but that
didn't seem to improve performance.

Thanks,
Prathamesh
> 
> > > 
> > > > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
> > > > node, since that's quadratic in terms of CFG size.
> > > > Would it make sense to break interaction between PRE and hoisting,
> > > > only for the case when inserting into preds of pdom ?
> > > > I tried doing that in attached patch, where insert runs in two phases:
> > > > (a) PRE and hoisting, where hoisting marks block to not do PRE for.
> > > > (b) Second phase, which only runs PRE on all blocks.
> > > > This (expectedly) regresses ssa-hoist-3.c.
> > > > 
> > > > If the heuristic isn't acceptable, I suppose encoding distance of expr
> > > > within ANTIC sets
> > > > during compute_antic would be the right way to fix this ?
> > > > So ANTIC_IN (block) contains the anticipated expressions, and for each
> > > > antic expr, the "distance" from the furthest block
> > > > it's computed in ? Could you please elaborate a bit on how we could go
> > > > about encoding distance during compute_antic ?
> > > 
> > > But the distance in this case is just one CFG node ... we have
> > > 
> > > if (mattyp.eq.1)
> > > ... use of w but not with constant indices
> > > else if (mattyp.eq.2)
> > > .. inlined orthonl with constant index w() accesses, single BB
> > > else
> > > ... use of w but not with constant indices - the actual relevant
> > > loop of calculix
> > > endif
> > > ... constant index w() accesses, single BB
> > > 
> > > so the CFG distance is one node - unless you want to compute the
> > > maximum distance?  Btw, I only see 9 loads hoisted.
> > > 
> > > I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark.
> > > 
> > > And indeed this case is exactly one where hoisting is superior to
> > > PRE which would happily insert the 9 loads into the two variable-access
> > > predecessors to get rid of the redundancy wrt the mattyp.eq.1 path.
> > > 
> > > In .optimized I see
> > > 
> > > pretmp_5573 = w[0];
> > > pretmp_5574 = w[3];
> > > pretmp_5575 = w[6];
> > > pretmp_5576 = w[1];
> > > pretmp_5577 = w[4];
> > > pretmp_5578 = w[7];
> > > pretmp_5579 = w[2];
> > > pretmp_5580 = w[5];
> > > pretmp_5581 = w[8];
> > > if (mattyp.157_742 == 1)
> > > 
> > > I do remember talks/patches about ordering of such sequences of loads
> > > to make them prefetch-happier.  Are the loads actually emitted in-order
> > > for arm?  Thus w[0]...w[8] rather than as seen above with some random
> > > permutes inbetween?  On x86 they are emitted in random order
> > > (they are also spilled immediately).
> > On aarch64, they are emitted in random order as well.
> > 
> > Thanks,
> > Prathamesh
> > 
> > 
> > Thanks,
> > Prathamesh
> > > 
> > > Richard.
> > > 
> > > > Thanks,
> > > > Prathamesh
> > > > 
> > > > 
> > > > Thanks,
> > > > Prathamesh
> > > > > 
> > > > > Richard.
> > > > > 
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > > 
> > > > > > > Richard.
> > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Note the difficulty to capture "distance" is that the distance \
> > > > > > > > > > > is simply not available at this point - it is the anticipated \
> > > > > > > > > > > values from the successors that do _not_ compute the value \
> > > > > > > > > > > itself that are the issue.  To capture "distance" you'd need to \
> > > > > > > > > > > somehow "age" anticipated value when propagating them upwards \
> > > > > > > > > > > during compute_antic (where it then would also apply to PRE in \
> > > > > > > > > > > general).
> > > > > > > > > > Yes, indeed.  As a hack, would it make sense to avoid inserting \
> > > > > > > > > > an expression in the block,
> > > > > > > > > > if it's ANTIC in post-dom block as a trade-off between extending \
> > > > > > > > > > live range and hoisting
> > > > > > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > > > > > general, as you point out, we'd need to compute,
> > > > > > > > > > distance info for successors block during compute_antic, but \
> > > > > > > > > > special casing for post-dom should be easy enough
> > > > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > > > > > post-dom could be potentially "long range", if the region is \
> > > > > > > > > > large. It's still a coarse heuristic tho. I tried it in the \
> > > > > > > > > > attached patch. 
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > As with all other heuristics the only place one could do \
> > > > > > > > > > > hackish attempts with at least some reasoning is the \
> > > > > > > > > > > elimination phase where we make use of the (hoist) insertions - \
> > > > > > > > > > > of course for hoisting we already know we'll get the "close" \
> > > > > > > > > > > use in one of the successors so I fear even there it will be \
> > > > > > > > > > > impossible to do something sensible. 
> > > > > > > > > > > Richard.
> > > > > > > > > > > 
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Prathamesh
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Alexander


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic