'[Bug rtl-optimization/89895] New: Unable to sink high half of widening multiply out of loop'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gcc-bugs
Subject:    [Bug rtl-optimization/89895] New: Unable to sink high half of widening multiply out of loop
From:       "lkml at sdf dot org" <gcc-bugzilla () gcc ! gnu ! org>
Date:       2019-03-31 7:54:05
Message-ID: bug-89895-4 () http ! gcc ! gnu ! org/bugzilla/
[Download RAW message or body]

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89895

            Bug ID: 89895
           Summary: Unable to sink high half of widening multiply out of
                    loop
           Product: gcc
           Version: 8.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lkml at sdf dot org
  Target Milestone: ---

This is part of gcc's general problem with double-word values, but
I was encouraged to submit a PR, since it's a particularly simple but
real-world-applicable test case.

Lemire's algorithm for uniform random number generation in a range
(https://arxiv.org/abs/1805.10941) has the following core:

static uint64_t __attribute__((noinline)) get_random_u64(void);

u64 get_random_range(uint64_t range, uint64_t lim)
{
        unsigned __int128 prod;

        do {
                prod = (unsigned __int128)range * get_random_u64();
        } while ((uint64_t)prod < lim);
        return prod >> 64;
}

(In practice, get_random_u64() would be inlined, but I've left it
out of line for exposition.)

GCC's isn't sinking generation of the high half of the product out
of the loop.  This particularly applies on platforms with a separate
multiply-high instruction like alpha:
$L9:
        bsr $26,get_random_u64          !samegp
        mulq $0,$9,$1
        umulh $0,$9,$0
        cmpule $10,$1,$1
        beq $1,$L9
and PowerPC:
.L12:
        bl get_random_u64
        mulld 9,3,31
        mulhdu 3,3,31
        cmpld 7,30,9
        bgt+ 7,.L12
But is also applies to MIPS, where the mfhi could be sunk out of the
loop:
.L10:
        jal     get_random_u64
        nop

        dmultu  $2,$17
        mflo    $2
        sltu    $6,$2,$16
        bne     $6,$0,.L10
        mfhi    $3

In this case, there's nothing *better* to do in the delay slot than mfhi,
but that's kind of an accident.

The code I'd hope to see is
Alpha:
$L9:
        bsr $26,get_random_u64
        mulq $0,$9,$1
        cmpule $10,$1,$1
        beq $1,$L9
        umulh $0,$9,$0
PowerPC:
.L12:
        bl get_random_u64
        mulld 9,3,31
        cmpld 7,30,9
        bgt+ 7,.L12
        mulhdu 3,3,31
and (when the mulditi3 expander is added) MIPS r6:
.L10:
        balc    get_random_u64
        dmulu   $3, $2, $17
        sltu    $3, $3, $16
        bnezc   $3, .L10
        dmuhu   $2, $2, $17

In these cases, since the low-half multiply is the last multiply in
the loop, the high half will still catch the hardware-optimized case
for both halves of a multiply.=
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic