[prev in list] [next in list] [prev in thread] [next in thread]
List: gcc-bugs
Subject: [Bug rtl-optimization/59393] [4.9/5/6 regression] mips16 code size
From: "law at redhat dot com" <gcc-bugzilla () gcc ! gnu ! org>
Date: 2016-03-31 23:48:13
Message-ID: bug-59393-4-SXgIhLvjTI () http ! gcc ! gnu ! org/bugzilla/
[Download RAW message or body]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59393
Jeffrey A. Law <law at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |law at redhat dot com
--- Comment #7 from Jeffrey A. Law <law at redhat dot com> ---
I was looking at this and noticed we have several sequences like
_18 = l_11 >> 16;
_19 = _18 & 255;
_20 = _19 + 256;
_21 = _20 * 8;
There's variations in the constants, but the pattern repeats regularly. My
first thought was to rewrite that as
_18 = l_11 >> 13;
_19 = _18 & 0x7f8;
_20 = _19 + 0x800;
That seemed to be slightly worse on x86_64. I'd already noticed that the
addition was setting bits we knew to be zero, so it could be rewritten using an
IOR like this:
_18 = l_11 >> 13;
_19 = _18 & 0x7f8;
_20 = _19 | 0x800;
In isolation, that looked good on x86_64. So my thought was that we may have
an gcc-7 improvement that could be made for this code. But then I coded up a
quick pattern in match.pd and tested it and the resulting assembly code was
considerably worse on x86_64 for the benchmark code.
There's a couple things in play here on x86_64. In the benchmark code these
are address computations. The *8 and +256 in the original sequence can be a
part of the effective address in the memory reference. Furthermore, the
masking is a 2 byte movzbl in the original sequence, but a 3 byte and # in the
later sequences. This negates all the gain by using IOR instead of PLUS, which
was shorter for x86_64.
mips16 does slightly better with the second sequence, saving ~76 bytes on the
included testcase.
However, given how highly dependent this is on the target's addressing modes,
match.pd is probably not the place to attack this problem. Combine is likely a
better place, using either a generic splitting sequence that self-tunes via
rtx_cost. Or via a target specific splitter.
The closest we get right now is this combine attempt:
(set (reg:SI 1077)
(plus:SI (ashift:SI (and:SI (lshiftrt:SI (reg:SI 1073)
(const_int 8 [0x8]))
(reg:SI 1074))
(const_int 2 [0x2]))
(const_int 1024 [0x400])))
reg:SI 1074 is (const_int 255), but we can't blindly substitute in because reg
1074 has other uses as seen by this attempt:
(parallel [
(set (reg:SI 1077)
(plus:SI (and:SI (ashift:SI (reg:SI 1072)
(const_int 2 [0x2]))
(const_int 1020 [0x3fc]))
(const_int 1024 [0x400])))
(set (reg:SI 1074)
(const_int 255 [0xff]))
])=
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic