'Re: RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange()'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-hotspot-compiler-dev
Subject:    Re: RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange()
From:       Xiaohong Gong <xgong () openjdk ! org>
Date:       2023-01-30 3:29:16
Message-ID: 2qVUjzI54duZendHLzYBgzMfXTts5E_ehYabh38Du8o=.267353cc-b74c-4aec-8ba2-c55e5e2ca4f4 () github ! com
[Download RAW message or body]

On Wed, 18 Jan 2023 08:58:42 GMT, Xiaohong Gong <xgong@openjdk.org> wrote:

> The Vector API `"indexInRange(int offset, int limit)"` is used
> to compute a vector mask whose lanes are set to true if the
> index of the lane is inside the range specified by the `"offset"`
> and `"limit"` arguments, otherwise the lanes are set to false.
> 
> There are two special cases for this API:
> 1) If `"offset >= 0 && offset >= limit"`, all the lanes of the
> generated mask are false.
> 2) If` "offset >= 0 && limit - offset >= vlength"`, all the
> lanes of the generated mask are true. Note that `"vlength"` is
> the number of vector lanes.
> 
> For such special cases, we can simply use `"maskAll(false|true)"`
> to implement the API. Otherwise, the original comparison with
> `"iota" `vector is needed. And for further optimization, we have
> optimal instruction supported by SVE (i.e. whilelo [1]), which
> can implement the API directly if the `"offset >= 0"`.
> 
> As a summary, to optimize the API, we can use the if-else branches
> to handle the specific cases in java level and intrinsify the
> remaining case by C2 compiler:
> 
> 
> public VectorMask<E> indexInRange(int offset, int limit) {
> if (offset < 0) {
> return this.and(indexInRange0Helper(offset, limit));
> } else if (offset >= limit) {
> return this.and(vectorSpecies().maskAll(false));
> } else if (limit - offset >= length()) {
> return this.and(vectorSpecies().maskAll(true));
> }
> return this.and(indexInRange0(offset, limit));
> }
> 
> 
> The last part (i.e. `"indexInRange0"`) in the above implementation
> is expected to be intrinsified by C2 compiler if the necessary IRs
> are supported. Otherwise, it will fall back to the original API
> implementation (i.e. `"indexInRange0Helper"`). Regarding to the
> intrinsifaction, the compiler will generate `"VectorMaskGen"` IR
> with "limit - offset" as the input if the current platform supports
> it. Otherwise, it generates `"VectorLoadConst + VectorMaskCmp"` based
> on `"iota < limit - offset"`.
> 
> For the following java code which uses `"indexInRange"`:
> 
> 
> static final VectorSpecies<Double> SPECIES =
> DoubleVector.SPECIES_PREFERRED;
> static final int LENGTH = 1027;
> 
> public static double[] da;
> public static double[] db;
> public static double[] dc;
> 
> private static void func() {
> for (int i = 0; i < LENGTH; i += SPECIES.length()) {
> var m = SPECIES.indexInRange(i, LENGTH);
> var av = DoubleVector.fromArray(SPECIES, da, i, m);
> av.lanewise(VectorOperators.NEG).intoArray(dc, i, m);
> }
> }
> 
> 
> The core code generated with SVE 256-bit vector size is:
> 
> 
> ptrue   p2.d                  ; maskAll(true)
> ...
> LOOP:
> ...
> sub     w11, w13, w14         ; limit - offset
> cmp     w14, w13
> b.cs    LABEL-1               ; if (offset >= limit) => uncommon-trap
> cmp     w11, #0x4
> b.lt    LABEL-2               ; if (limit - offset < vlength)
> mov     p1.b, p2.b
> LABEL-3:
> ld1d    {z16.d}, p1/z, [x10]  ; load vector masked
> ...
> cmp     w14, w29
> b.cc    LOOP
> ...
> LABEL-2:
> whilelo p1.d, x16, x10        ; VectorMaskGen
> ...
> b       LABEL-3
> ...
> LABEL-1:
> uncommon-trap
> 
> 
> Please note that if the array size `LENGTH` is aligned with
> the vector size 256 (i.e. `LENGTH = 1024`), the branch "LABEL-2"
> will be optimized out by compiler and it becomes another
> uncommon-trap.
> 
> For NEON, the main CFG is the same with above. But the compiler
> intrinsification is different. Here is the code:
> 
> 
> sub     x10, x10, x12          ; limit - offset
> scvtf   d16, x10
> dup     v16.2d, v16.d[0]       ; replicateD
> 
> mov     x8, #0xd8d0
> movk    x8, #0x84cb, lsl #16
> movk    x8, #0xffff, lsl #32
> ldr     q17, [x8], #0          ; load the "iota" const vector
> fcmgt   v18.2d, v16.2d, v17.2d ; mask = iota < limit - offset
> 
> 
> Here is the performance data of the new added benchmark on an ARM
> SVE 256-bit platform:
> 
> 
> Benchmark                               (size)  Before    After   Units
> IndexInRangeBenchmark.byteIndexInRange   1024 11203.697 41404.431 ops/ms
> IndexInRangeBenchmark.byteIndexInRange   1027  2365.920  8747.004 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1024  1227.505  6092.194 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1027   351.215  1156.683 ops/ms
> IndexInRangeBenchmark.floatIndexInRange  1024  1468.876 11032.580 ops/ms
> IndexInRangeBenchmark.floatIndexInRange  1027   699.645  2439.671 ops/ms
> IndexInRangeBenchmark.intIndexInRange    1024  2842.187 11903.544 ops/ms
> IndexInRangeBenchmark.intIndexInRange    1027   689.866  2547.424 ops/ms
> IndexInRangeBenchmark.longIndexInRange   1024  1394.135  5902.973 ops/ms
> IndexInRangeBenchmark.longIndexInRange   1027   355.621  1189.458 ops/ms
> IndexInRangeBenchmark.shortIndexInRange  1024  5521.468 21578.340 ops/ms
> IndexInRangeBenchmark.shortIndexInRange  1027  1264.816  4640.504 ops/ms
> 
> 
> And the performance data with ARM NEON:
> 
> 
> Benchmark                               (size)  Before    After   Units
> IndexInRangeBenchmark.byteIndexInRange   1024  4026.548 15562.880 ops/ms
> IndexInRangeBenchmark.byteIndexInRange   1027   305.314   576.559 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1024   289.224  2244.080 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1027    39.740    76.499 ops/ms
> IndexInRangeBenchmark.floatIndexInRange  1024   675.264  4457.470 ops/ms
> IndexInRangeBenchmark.floatIndexInRange  1027    79.918   144.952 ops/ms
> IndexInRangeBenchmark.intIndexInRange    1024   740.139  4014.583 ops/ms
> IndexInRangeBenchmark.intIndexInRange    1027    78.608   147.903 ops/ms
> IndexInRangeBenchmark.longIndexInRange   1024   400.683  2209.551 ops/ms
> IndexInRangeBenchmark.longIndexInRange   1027    41.146    69.599 ops/ms
> IndexInRangeBenchmark.shortIndexInRange  1024  1821.736  8153.546 ops/ms
> IndexInRangeBenchmark.shortIndexInRange  1027   158.810   243.205 ops/ms
> 
> 
> The performance improves about `3.5x ~ 7.5x` on the vector size aligned
> (1024 size) benchmarks both with NEON and SVE. And it improves about
> `3.5x/1.8x` on the vector size not aligned (1027 size) benchmarks with
> SVE/NEON respectively. We can also observe the similar improvement on
> the x86 platforms.
> 
> [1] https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-
> 

Hi @PaulSandoz , @jatin-bhateja, @sviswa7, could you please help to take a look at \
this optimization? Any feedback is welcome. Thanks in advance!

-------------

PR: https://git.openjdk.org/jdk/pull/12064


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic