'Re: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2]'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-hotspot-compiler-dev
Subject:    Re: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2]
From:       Yuri Gaevsky <duke () openjdk ! org>
Date:       2024-01-30 16:45:25
Message-ID: -BozG265CkpO9U1kgyog_37ezPgkxUgGj_XFGgiMaWI=.d2203393-0fe7-42a4-ab47-43d39a8c5240 () github ! com
[Download RAW message or body]

On Tue, 30 Jan 2024 16:35:14 GMT, Yuri Gaevsky <duke@openjdk.org> wrote:

> > Hi, I don't quite understand why there is a need to change LMUL from `m4` to `m2` \
> > if we are switching to use the stripmining approach. The tail calculation should \
> > normally share the code for `VEC_LOOP`, which also means we need to use some \
> > vector mask instructions to filter out the active elements for each loop \
> > iteration especially the iteration for handing the tail elements. And the vl \
> > returned by `vsetvli` tells us the number of elements which could be processed in \
> > parallel for one certain iteration ([1] is one example). I am not sure if you are \
> > trying this way. Do you have more details or code changes to share? Thanks. 
> > [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew
> > 
> 
> I used m4->m2 change to process 8 elements in the tail with vector instructions \
> after main vector loop. IIUC, the m4->m2 change in runtime is very costly, so I've \
> created another patch with same goal but **without** m4->m2 change: 
> void C2_MacroAssembler::arrays_hashcode_v(Register ary, Register cnt, Register \
> result, Register tmp1, Register tmp2, Register tmp3,
> Register tmp4, Register tmp5, Register tmp6,
> BasicType eltype)
> {
> ...
> const int nof_vec_elems = MaxVectorSize;
> const int hof_vec_elems = nof_vec_elems >> 1;
> const int elsize_bytes = arrays_hashcode_elsize(eltype);
> const int elsize_shift = exact_log2(elsize_bytes);
> const int vec_step_bytes = nof_vec_elems << elsize_shift;
> const int half_vec_step_bytes = vec_step_bytes >> 1;
> const address adr_pows31 = StubRoutines::riscv::arrays_hashcode_powers_of_31()
> + sizeof(jint);
> 
> ...
> 
> const Register chunks = tmp1;
> const Register chunks_end = chunks;
> const Register pows31 = tmp2;
> const Register powmax = tmp3;
> 
> const VectorRegister v_coeffs =  v4;
> const VectorRegister v_src    =  v8;
> const VectorRegister v_sum    = v12;
> const VectorRegister v_powmax = v16;
> const VectorRegister v_result = v20;
> const VectorRegister v_tmp    = v24;
> const VectorRegister v_zred   = v28;
> 
> Label DONE, TAIL, TAIL_LOOP, PRE_TAIL, SAVE_VRESULT, WIDE_TAIL, VEC_LOOP;
> 
> // result has a value initially
> 
> beqz(cnt, DONE);
> 
> andi(chunks, cnt, ~(hof_vec_elems-1));
> beqz(chunks, TAIL);
> 
> // load pre-calculated powers of 31
> la(pows31, ExternalAddress(adr_pows31));
> mv(t1, nof_vec_elems);
> vsetvli(t0, t1, Assembler::e32, Assembler::m4);
> vle32_v(v_coeffs, pows31);
> // clear vector registers used in intermediate calculations
> vmv_v_i(v_sum, 0);
> vmv_v_i(v_powmax, 0);
> vmv_v_i(v_result, 0);
> // set initial values
> vmv_s_x(v_result, result);
> vmv_s_x(v_zred, x0);
> 
> andi(chunks, cnt, ~(nof_vec_elems-1));
> beqz(chunks, WIDE_TAIL);
> 
> subw(cnt, cnt, chunks);
> slli(chunks_end, chunks, elsize_shift);
> add(chunks_end, ary, chunks_end);
> // get value of 31^^nof_vec_elems
> lw(powmax, Address(pows31, -1 * sizeof(jint)));
> vmv_s_x(v_powmax, powmax);
> 
> bind(VEC_LOOP);
> // result = result * 31^^(hof_vec_elems) + v_src[0] * 31^^(hof_vec_elems-1)
> //                                + ...  + v_src[hof_vec_elems-1] * 31^^(0)
> vmul_vv(v_result, v_result, v...

Of course, any ideas for improvements the code are very welcome.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17413#discussion_r1471587439


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic