[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gcc
Subject:    current NEON status
From:       Marat Zakirov <m.zakirov () samsung ! com>
Date:       2014-07-30 13:38:24
Message-ID: 53D8F550.2060007 () samsung ! com
[Download RAW message or body]

Hi there!

My question came from bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725. I found that GCC 
considers NEON register ranges as unsplittable. So any subregister may 
be used only after whole chunk is dead. This issue leads to redundant 
spill fills which is performance trouble.

Example 1: RAL trouble

#include <arm_neon.h>
#include <inttypes.h>

extern  uint16x8x4_t m0;
extern  uint16x8x4_t m1;
extern  uint16x8x4_t m2;
extern  uint16x8x4_t m3;
extern  uint16x8_t   m4;

void foo1(uint16_t * in_ptr)
{
      uint16x8x4_t t0, t1, t2, t3;
      t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
      t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
      t2 = vld4q_u16((uint16_t *)&in_ptr[128]);
      t3 = vld4q_u16((uint16_t *)&in_ptr[192]);
      m4 = t0.val[3];
      m4 = m4 * 3;                       <<< *
      t0.val[3] = t1.val[3];
      m0 = t3;
      m1 = t2;
      m2 = t1;
      m3 = t0;
}

Here test uses all NEON registers. No spill is needed. Because 
multiplication requires one Q register which may be obtained from dead 
t0.val[3] subregister. But GCC makes spill if multiplication (*) exists 
because of issue described above.

Example 2: CSE makes trouble for IRA

#include <arm_neon.h>
#include <inttypes.h>

extern  uint16x8x4_t m0;
extern  uint16x8x4_t m1;

void foo2(uint16_t * in_ptr)
{
      uint16x8x4_t t0, t1;
      t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
      t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
      t0.val[0] *= 333;
      t0.val[1] *= 333;
      t0.val[2] *= 333;
      t0.val[3] *= 333;
      t1.val[0] *= 333;
      t1.val[1] *= 333;
      t1.val[2] *= 333;
      t1.val[3] *= 333;
      m0 = t0;
      m1 = t1;
}

Here test uses only half NEON + one Q for '333' factor. But GCC makes 
spills here too! Briefly speak problem is in partial CSE. GCC generates 
rtl with the listed bellow form:

Before CSE:

a = b
a0 = a0 * 3
a1 = a1 * 3
a2 = a2 * 3
a3 = a3 * 3

After:

a = b
a0 = b0 * 3
a1 = a1 * 3 <<< *
a2 = a2 * 3
a3 = a3 * 3

CSE do not substitute b1 to a1 because at the moment (*) a0 was already 
defined so actually a != b. Yes but a1 = b1, unfortunately CSE also do 
not handle register-ranges parts as RA does. Strange thing here is that 
even if we fix CSE, so CSE could propagate register-ranges subregs, this 
will make trouble to RAL also because of the same reason: IRA do not 
handle precisely register ranges parts. I attached a demo patch which 
forbids partial CSE propagation and removes spills from Ex2. Is this 
patch OK? Or maybe CSE should be fixed in a different way? Or maybe 
partial substitution is OK?

Main question: Are there any plans to fix/upgrade IRA?

--Marat

["ximode.diff" (text/x-patch)]

gcc/ChangeLog:

2014-07-30  Marat Zakirov  <m.zakirov@samsung.com>

	* cse.c (canon_reg): Forbid partial CSE.
	* fwprop.c (forward_propagate_and_simplify): Likewise.

diff --git a/gcc/cse.c b/gcc/cse.c
index 34f9364..a9e0442 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -2862,6 +2862,9 @@ canon_reg (rtx x, rtx insn)
 	    || ! REGNO_QTY_VALID_P (REGNO (x)))
 	  return x;
 
+        if (GET_MODE (x) == XImode)
+          return x;
+
 	q = REG_QTY (REGNO (x));
 	ent = &qty_table[q];
 	first = ent->first_reg;
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index 547fcd6..eadc729 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1317,6 +1317,9 @@ forward_propagate_and_simplify (df_ref use, rtx def_insn, rtx def_set)
   if (!new_rtx)
     return false;
 
+  if (GET_MODE (reg) == XImode)
+    return false;
+
   return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
 }
 


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic