'[Bug rtl-optimization/59393] [4.9/5/6 regression] mips16 code size'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gcc-bugs
Subject:    [Bug rtl-optimization/59393] [4.9/5/6 regression] mips16 code size
From:       "law at redhat dot com" <gcc-bugzilla () gcc ! gnu ! org>
Date:       2016-03-31 23:48:13
Message-ID: bug-59393-4-SXgIhLvjTI () http ! gcc ! gnu ! org/bugzilla/
[Download RAW message or body]

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59393

Jeffrey A. Law <law at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |law at redhat dot com

--- Comment #7 from Jeffrey A. Law <law at redhat dot com> ---
I was looking at this and noticed we have several sequences like

 _18 = l_11 >> 16;
  _19 = _18 & 255;
  _20 = _19 + 256;
  _21 = _20 * 8;

There's variations in the constants, but the pattern repeats regularly.  My
first thought was to rewrite that as

 _18 = l_11 >> 13;
 _19 = _18 & 0x7f8;
 _20 = _19 + 0x800;

That seemed to be slightly worse on x86_64.  I'd already noticed that the
addition was setting bits we knew to be zero, so it could be rewritten using an
IOR like this:


 _18 = l_11 >> 13;
 _19 = _18 & 0x7f8;
 _20 = _19 | 0x800;

In isolation, that looked good on x86_64.  So my thought was that we may have
an gcc-7 improvement that could be made for this code.  But then I coded up a
quick pattern in match.pd and tested it and the resulting assembly code was
considerably worse on x86_64 for the benchmark code.

There's a couple things in play here on x86_64.  In the benchmark code these
are address computations.  The *8 and +256 in the original sequence can be a
part of the effective address in the memory reference.  Furthermore, the
masking is a 2 byte movzbl in the original sequence, but a 3 byte and # in the
later sequences.  This negates all the gain by using IOR instead of PLUS, which
was shorter for x86_64.

mips16 does slightly better with the second sequence, saving ~76 bytes on the
included testcase.

However, given how highly dependent this is on the target's addressing modes,
match.pd is probably not the place to attack this problem.  Combine is likely a
better place, using either a generic splitting sequence that self-tunes via
rtx_cost.  Or via a target specific splitter.

The closest we get right now is this combine attempt:

(set (reg:SI 1077)
    (plus:SI (ashift:SI (and:SI (lshiftrt:SI (reg:SI 1073)
                    (const_int 8 [0x8]))
                (reg:SI 1074))
            (const_int 2 [0x2]))
        (const_int 1024 [0x400])))


reg:SI 1074 is (const_int 255), but we can't blindly substitute in because reg
1074 has other uses as seen by this attempt:

(parallel [
        (set (reg:SI 1077)
            (plus:SI (and:SI (ashift:SI (reg:SI 1072)
                        (const_int 2 [0x2]))
                    (const_int 1020 [0x3fc]))
                (const_int 1024 [0x400])))
        (set (reg:SI 1074)
            (const_int 255 [0xff]))
    ])=
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic