'Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-hotspot-compiler-dev
Subject:    Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling
From:       Jie Fu <fujie () loongson ! cn>
Date:       2019-09-27 1:07:09
Message-ID: d6443a91-b888-cf47-646d-36cf60a8431f () loongson ! cn
[Download RAW message or body]

Thanks Vivek for  your help.

On 2019/9/27 上午7:16, Deshpande, Vivek R wrote:
> Hi Jie
> 
> I tried the patch from webrev.04 with NUM=4096 and looks like the instructions with \
> AVX512 are getting generated. I will do some more perf runs and let you know.
> 
> Regards,
> Vivek
> 
> -----Original Message-----
> From: Jie Fu [mailto:fujie@loongson.cn]
> Sent: Wednesday, September 25, 2019 8:51 AM
> To: Deshpande, Vivek R <vivek.r.deshpande@intel.com>; Vladimir Kozlov \
> <vladimir.kozlov@oracle.com>; hotspot-compiler-dev@openjdk.java.net; Viswanathan, \
>                 Sandhya <sandhya.viswanathan@intel.com>
> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop \
> unrolling 
> Hi Vivek,
> 
> Thanks for your review and help. Please see responses below.
> 
> 1. According to my observation, compiling with full vector length may not always be \
> the smartest choice, especially for small loops. For example, if running your test \
> case with NUM = 256 and 128 on my avx-256 machine, the performance can be improved \
> by 28% and 36% respectively if using 16-byte vectors, instead of full available \
> vector width (32-byte vectors). 
> 2. My fix [1] aims at improving performance with small loops, while keeping the \
> same performance for large loops compared with the original implementation. The \
> patch adds a heuristic to protect against over-unrolling with \
> SuperWordLoopUnrollAnalysis. For a more detailed quantitative analysis, please \
> refer to [2]. 
> 3. I don't quite understand why your test case has to be compiled with 512-bit \
> vector. Could you please explain why? For your test case, vector-256 is used in my \
> patch to protect against over-unrolling. If I recall correctly, there is no \
> performance difference between 512-bit and 256-bit vectors on your machine. \
> However, it doesn't mean vector-512 won't be generated. If you try to increase the \
> NUM in your program (e.g., NUM=4096), you will find vector-512 will be generated on \
> your machine. I can't see the benefit of using 512-bit vector. That's why I'm \
> asking this question all the time. I'd be really appreciated if you would like to \
> answer it. 
> To validate the effectiveness of the patch, you can test the performance for NUM = \
> 256 and 128 on your avx-512 machine. 
> Looking forward to your reply.
> 
> Thanks a lot.
> Best regards,
> Jie
> 
> [1] http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
> [2]
> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034783.html
>  
> 
> On 2019/9/25 上午1:20, Deshpande, Vivek R wrote:
> > Hi Jie
> > 
> > May be you missed my earlier reply, I had tried your patch from webrev.04.
> > It does not use full 512 bits of the vector and generates 256 bit vector \
> > instructions. The log is similar to earlier patch from webrev.03.
> > May be if you tweak this condition it would work.
> > if (future_unroll_factor > cur_trip_cnt) break;
> > 
> > Regards,
> > Vivek
> > 
> > 
> > -----Original Message-----
> > From: Jie Fu [mailto:fujie@loongson.cn]
> > Sent: Tuesday, September 24, 2019 7:59 AM
> > To: Deshpande, Vivek R <vivek.r.deshpande@intel.com>; Vladimir Kozlov \
> > <vladimir.kozlov@oracle.com>; hotspot-compiler-dev@openjdk.java.net; Viswanathan, \
> >                 Sandhya <sandhya.viswanathan@intel.com>
> > Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop \
> > unrolling 
> > Hi Vivek,
> > 
> > May I get to know whether the not-unroll-after-vectorization problem was fixed by \
> > webrev.04 on your avx-512 machine? If not, could you please share me the compile \
> > log? 
> > Thanks a lot.
> > Best regards,
> > Jie
> > 
> > [1]
> > https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034817.html
> >  
> > On 2019/9/18 上午9:46, Jie Fu wrote:
> > > Hi Vivek,
> > > 
> > > Thank you for your help.
> > > 
> > > Does webrev.04 fix the the not-unroll-after-vectorization problem you
> > > mentioned in [1] on your avx-512 machine?
> > > 
> > > The patch just adds a heuristic [2] to protect against over-unrolling
> > > with SuperWordLoopUnrollAnalysis.
> > > In order to use the full available vector width,
> > > SuperWordLoopUnrollAnalysis performs loop unrolling much more
> > > aggressively, which may hurt the performance for some cases.
> > > One of the important reasons for the performance degradation of
> > > SuperWordLoopUnrollAnalysis is that it doesn't consider the negative
> > > impact of pre/post-loop at all.
> > > The current SuperWordLoopUnrollAnalysis focuses on reducing the
> > > iterations of the main-loop, but ignores the increment of iterations
> > > in pre/post-loop.
> > > For a more detailed quantitative analysis of that case, please refer
> > > to [2].
> > > 
> > > Thanks a lot.
> > > Best regards,
> > > Jie
> > > 
> > > [1]
> > > https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
> > > st/034817.html
> > > [2]
> > > https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
> > > st/034783.html
> > > 
> > > On 2019/9/17 下午10:55, Deshpande, Vivek R wrote:
> > > > Hi Jie
> > > > 
> > > > I tried your patch from webrev.04. I still see the similar behavior
> > > > as earlier patch. So I am trying to understand what your new patch is
> > > > doing and how we can fix it.
> > > > 
> > > > Regards,
> > > > Vivek
> > > > 
> > > > -----Original Message-----
> > > > From: Jie Fu [mailto:fujie@loongson.cn]
> > > > Sent: Tuesday, September 10, 2019 8:42 PM
> > > > To: Deshpande, Vivek R <vivek.r.deshpande@intel.com>; Vladimir Kozlov
> > > > <vladimir.kozlov@oracle.com>; hotspot-compiler-dev@openjdk.java.net;
> > > > Viswanathan, Sandhya <sandhya.viswanathan@intel.com>
> > > > Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to
> > > > over loop unrolling
> > > > 
> > > > Hi Vivek,
> > > > 
> > > > Updated: http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
> > > > 
> > > > With the help of your compile logs, I successfully reproduced the
> > > > not-unroll-after-vectorization problem you mentioned in [1].
> > > > It had been fixed on my avx-256 machine with this version.
> > > > The patch just adds a heuristic [2] to protect against over-unrolling
> > > > with SuperWordLoopUnrollAnalysis.
> > > > Please review it and give me some advice.
> > > > 
> > > > Again, if you have any questions on your avx-512 machine, could you
> > > > please share me the compile logs, especially for NUM = 256, 2048 and
> > > > 4096?
> > > > Please see comments inline.
> > > > 
> > > > Thanks a lot.
> > > > Best regards,
> > > > Jie
> > > > 
> > > > [1]
> > > > https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
> > > > ust/034817.html
> > > > 
> > > > [2]
> > > > https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
> > > > ust/034783.html
> > > > 
> > > > 
> > > > On 2019/9/7 上午7:35, Deshpande, Vivek R wrote:
> > > > > Hi Jie
> > > > > 
> > > > > I experimented with both the sizes 1024 and 2048 bytes and looks
> > > > > like the 2nd compilation generates the suboptimal code with shorter
> > > > > vector width.
> > > > I still don't think it's a problem since there is no performance gain
> > > > with full available vector width according to your performance analysis.
> > > > 
> > > > 
> > > > > Please find it attached.
> > > > > IMO, the fix you have should be able to unroll enough to use the
> > > > > full available vector width.
> > > > Why?
> > > > Unfortunately, compiling with full available vector width can be
> > > > harmful to performance.
> > > > I experimented your test case with NUM = 256 and 128 on my avx-256
> > > > machine, finding that the performance was frustrated with full
> > > > available vector width (32-byte vectors).
> > > > After the patch, the performance (16-byte vectors) for NUM = 256 and
> > > > 128 had been improved by 28% and 36% respectively.
> > > > 
> > > > So I wonder about the performance before and after the patch for NUM
> > > > =
> > > > 256 and 128 on your avx-512 machine.
> > > > Could you please also share us?
> > > > 
> > > > Thanks.
> > > > 
> > > > 
> > > > > Regards,
> > > > > Vivek


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic