[prev in list] [next in list] [prev in thread] [next in thread]
List: gcc-bugs
Subject: [Bug rtl-optimization/89895] New: Unable to sink high half of widening multiply out of loop
From: "lkml at sdf dot org" <gcc-bugzilla () gcc ! gnu ! org>
Date: 2019-03-31 7:54:05
Message-ID: bug-89895-4 () http ! gcc ! gnu ! org/bugzilla/
[Download RAW message or body]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89895
Bug ID: 89895
Summary: Unable to sink high half of widening multiply out of
loop
Product: gcc
Version: 8.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: lkml at sdf dot org
Target Milestone: ---
This is part of gcc's general problem with double-word values, but
I was encouraged to submit a PR, since it's a particularly simple but
real-world-applicable test case.
Lemire's algorithm for uniform random number generation in a range
(https://arxiv.org/abs/1805.10941) has the following core:
static uint64_t __attribute__((noinline)) get_random_u64(void);
u64 get_random_range(uint64_t range, uint64_t lim)
{
unsigned __int128 prod;
do {
prod = (unsigned __int128)range * get_random_u64();
} while ((uint64_t)prod < lim);
return prod >> 64;
}
(In practice, get_random_u64() would be inlined, but I've left it
out of line for exposition.)
GCC's isn't sinking generation of the high half of the product out
of the loop. This particularly applies on platforms with a separate
multiply-high instruction like alpha:
$L9:
bsr $26,get_random_u64 !samegp
mulq $0,$9,$1
umulh $0,$9,$0
cmpule $10,$1,$1
beq $1,$L9
and PowerPC:
.L12:
bl get_random_u64
mulld 9,3,31
mulhdu 3,3,31
cmpld 7,30,9
bgt+ 7,.L12
But is also applies to MIPS, where the mfhi could be sunk out of the
loop:
.L10:
jal get_random_u64
nop
dmultu $2,$17
mflo $2
sltu $6,$2,$16
bne $6,$0,.L10
mfhi $3
In this case, there's nothing *better* to do in the delay slot than mfhi,
but that's kind of an accident.
The code I'd hope to see is
Alpha:
$L9:
bsr $26,get_random_u64
mulq $0,$9,$1
cmpule $10,$1,$1
beq $1,$L9
umulh $0,$9,$0
PowerPC:
.L12:
bl get_random_u64
mulld 9,3,31
cmpld 7,30,9
bgt+ 7,.L12
mulhdu 3,3,31
and (when the mulditi3 expander is added) MIPS r6:
.L10:
balc get_random_u64
dmulu $3, $2, $17
sltu $3, $3, $16
bnezc $3, .L10
dmuhu $2, $2, $17
In these cases, since the low-half multiply is the last multiply in
the loop, the high half will still catch the hardware-optimized case
for both halves of a multiply.=
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic