'Re: [Mesa-dev] [PATCH v4 23/44] i965/fs: Enables 16-bit load_ubo with sampler'

[prev in list] [next in list] [prev in thread] [next in thread] 

List: mesa3d-dev
Subject: Re: [Mesa-dev] [PATCH v4 23/44] i965/fs: Enables 16-bit load_ubo with sampler
From: Jason Ekstrand <jason () jlekstrand ! net>
Date: 2017-11-30 22:58:03
Message-ID: CAOFGe966ig19T7uqsXqp+hEn7MaeG5vdQdKSX-oHFQiTK+yRUg () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


On Wed, Nov 29, 2017 at 6:50 PM, Jose Maria Casanova Crespo <
jmcasanova@igalia.com> wrote:

> load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit
> surface format defined. So when reading 16-bit components with the
> sampler we need to unshuffle two 16-bit components from each 32-bit
> component.
>
> Using the sampler avoids the use of the byte_scattered_read message
> that needs one message for each component and is supposed to be
> slower.
>
> In the case of SKL+ we take advantage of a hardware feature that
> automatically defines a channel mask based on the rlen value, so on
> SKL+ we only use half of the registers without using a header in the
> payload.
> ---
> src/intel/compiler/brw_fs.cpp | 31
> +++++++++++++++++++++++++++----
> src/intel/compiler/brw_fs_generator.cpp | 10 ++++++++--
> 2 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 1ca4d416b2..9c543496ba 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -184,9 +184,17 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
> fs_builder &bld,
> * a double this means we are only loading 2 elements worth of data.
> * We also want to use a 32-bit data type for the dst of the load
> operation
> * so other parts of the driver don't get confused about the size of
> the
> - * result.
> + * result. On the case of 16-bit data we only need half of the 32-bit
> + * components on SKL+ as we take advance of using message return size
> to
> + * define an xy channel mask.
> */
> - fs_reg vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
> + fs_reg vec4_result;
> + if (type_sz(dst.type) == 2 && (devinfo->gen >= 9)) {
> + vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 2);
> + vec4_result = retype(vec4_result, BRW_REGISTER_TYPE_HF);
> + } else {
> + vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
> + }
>
 fs_inst *inst = bld.emit(FS_OPCODE_VARYING_PULL_CONSTANT_LOAD_LOGICAL,
> vec4_result, surf_index, vec4_offset);
> inst->size_written = 4 * vec4_result.component_size(inst->exec_size);
> @@ -197,8 +205,23 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
> fs_builder &bld,
> }
>
> vec4_result.type = dst.type;
> - bld.MOV(dst, offset(vec4_result, bld,
> - (const_offset & 0xf) / type_sz(vec4_result.type)));
> +
> + if (type_sz(dst.type) == 2) {
> + /* 16-bit types need to be unshuffled as each pair of 16-bit
> components
> + * is packed on a 32-bit compoment because we are using a 32-bit
> format
> + * in the surface of uniform that is read by the sampler.
> + * TODO: On BDW+ mark when an uniform has 16-bit type so we could
> setup a
> + * surface format of 16-bit and use the 16-bit return format at the
> + * sampler.
> + */
> + vec4_result.stride = 2;
> + bld.MOV(dst, byte_offset(offset(vec4_result, bld,
> + (const_offset & 0x7) / 4),
> + (const_offset & 0x7) / 2 % 2 * 2));
> + } else {
> + bld.MOV(dst, offset(vec4_result, bld,
> + (const_offset & 0xf) /
> type_sz(vec4_result.type)));
> + }
>

This seems overly complicated. How about something like

fs_reg dw = offset(vec4_result, bld, (const_offset & 0xf) / 4);
switch (type_sz(dst.type)) {
case 2:
 shuffle_32bit_load_result_to_16bit_data(bld, dst, dw, 1);
 bld.MOV(dst, subscript(dw, dst.type, (const_offset / 2) & 1));
 break;
case 4:
 bld.MOV(dst, dw);
 break;
case 8:
 shuffle_32bit_load_result_to_64bit_data(bld, dst, dw, 1);
 break;
default:
 unreachable();
}


> }
>
> /**
> diff --git a/src/intel/compiler/brw_fs_generator.cpp
> b/src/intel/compiler/brw_fs_generator.cpp
> index a3861cd68e..00a4e29147 100644
> --- a/src/intel/compiler/brw_fs_generator.cpp
> +++ b/src/intel/compiler/brw_fs_generator.cpp
> @@ -1381,12 +1381,18 @@ fs_generator::generate_varying
> _pull_constant_load_gen7(fs_inst *inst,
> uint32_t simd_mode, rlen, mlen;
> if (inst->exec_size == 16) {
> mlen = 2;
> - rlen = 8;
> + if (type_sz(dst.type) == 2 && (devinfo->gen >= 9))
> + rlen = 4;
> + else
> + rlen = 8;
>

I'm not sure what I think of this. We intentionally use a vec4 today
instead of a single float because it lets us potentially batch up a bunch
of loads in a single texture operation. Are we sure that we never need
more than 2 of those result registers?

--Jason

 simd_mode = BRW_SAMPLER_SIMD_MODE_SIMD16;
> } else {
> assert(inst->exec_size == 8);
> mlen = 1;
> - rlen = 4;
> + if (type_sz(dst.type) == 2 && (devinfo->gen >= 9))
> + rlen = 2;
> + else
> + rlen = 4;
> simd_mode = BRW_SAMPLER_SIMD_MODE_SIMD8;
> }
>
> --
> 2.14.3
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>

[Attachment #5 (text/html)]

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Nov 29, 2017 \
at 6:50 PM, Jose Maria Casanova Crespo &lt;<a \
href="mailto:jmcasanova@igalia.com" \
target="_blank">jmcasanova@igalia.com</a>&gt; wrote: <blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">load_ubo is using 32-bit loads as uniforms \
surfaces have a 32-bit surface format defined. So when reading 16-bit components \
with the sampler we need to unshuffle two 16-bit components from each 32-bit 
component. 
 
Using the sampler avoids the use of the byte_scattered_read message 
that needs one message for each component and is supposed to be 
slower. 
 
In the case of SKL+ we take advantage of a hardware feature that 
automatically defines a channel mask based on the rlen value, so on 
SKL+ we only use half of the registers without using a header in the 
payload. 
--- 
 src/intel/compiler/brw_fs.cpp | 31 \
+++++++++++++++++++++++++++---- \
src/intel/compiler/brw_fs_generator.cpp | 10 ++++++++-- 2 files changed, 35 \
insertions(+), 6 deletions(-)

diff --git a/src/intel/compiler/brw_fs.cp<wbr>p \
b/src/intel/compiler/brw_fs.cp<wbr>p<br> index 1ca4d416b2..9c543496ba 100644<br>
--- a/src/intel/compiler/brw_fs.cp<wbr>p<br>
+++ b/src/intel/compiler/brw_fs.cp<wbr>p<br>
@@ -184,9 +184,17 @@ fs_visitor::VARYING_PULL_CONST<wbr>ANT_LOAD(const fs_builder \
                &amp;bld,<br>
        * a double this means we are only loading 2 elements worth of data.<br>
        * We also want to use a 32-bit data type for the dst of the load \
                operation<br>
        * so other parts of the driver don&#39;t get confused about the size of \
                the<br>
-      * result.<br>
+      * result. On the case of 16-bit data we only need half of the 32-bit<br>
+      * components on SKL+ as we take advance of using message return size to<br>
+      * define an xy channel mask.<br>
        */<br>
-     fs_reg vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);<br>
+     fs_reg vec4_result;<br>
+     if (type_sz(dst.type) == 2 &amp;&amp; (devinfo-&gt;gen &gt;= 9)) {<br>
+         vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 2);<br>
+         vec4_result = retype(vec4_result, BRW_REGISTER_TYPE_HF);<br>
+     } else {<br>
+         vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);<br>
+     } <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px \
                0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
      fs_inst *inst = bld.emit(FS_OPCODE_VARYING_PUL<wbr>L_CONSTANT_LOAD_LOGICAL,<br>
                                            vec4_result, surf_index, \
                vec4_offset);<br>
      inst-&gt;size_written = 4 * \
vec4_result.component_size(ins<wbr>t-&gt;exec_size);<br> @@ -197,8 +205,23 @@ \
fs_visitor::VARYING_PULL_CONST<wbr>ANT_LOAD(const fs_builder &amp;bld,<br>  }<br>
<br>
      vec4_result.type = dst.type;<br>
-     bld.MOV(dst, offset(vec4_result, bld,<br>
-                                   (const_offset &amp; 0xf) / \
type_sz(vec4_result.type)));<br> +<br>
+     if (type_sz(dst.type) == 2) {<br>
+         /* 16-bit types need to be unshuffled as each pair of 16-bit components<br>
+           * is packed on a 32-bit compoment because we are using a 32-bit \
format<br> +           * in the surface of uniform that is read by the sampler.<br>
+           * TODO: On BDW+ mark when an uniform has 16-bit type so we could setup \
a<br> +           * surface format of 16-bit and use the 16-bit return format at \
the<br> +           * sampler.<br>
+           */<br>
+         vec4_result.stride = 2;<br>
+         bld.MOV(dst, byte_offset(offset(vec4_result<wbr>, bld,<br>
+                                                         (const_offset &amp; 0x7) / \
4),<br> +                                               (const_offset &amp; 0x7) / 2 \
% 2 * 2));<br> +     } else {<br>
+         bld.MOV(dst, offset(vec4_result, bld,<br>
+                                       (const_offset &amp; 0xf) / \
type_sz(vec4_result.type)));<br> +     }<br></blockquote><div><br></div><div>This \
seems overly complicated.   How about something like</div><div><br></div><div>fs_reg \
dw = offset(vec4_result, bld, (const_offset &amp; 0xf) / 4);<br></div><div>switch \
(type_sz(dst.type)) {</div><div>case 2:</div><div>     \
shuffle_32bit_load_result_to_16bit_data(bld, dst, dw, 1);</div><div>     bld.MOV(dst, \
subscript(dw, dst.type, (const_offset / 2) &amp; 1));</div><div>     \
break;<br></div><div>case 4:</div><div>     bld.MOV(dst, dw);<br></div><div>     \
break;<br></div><div>case 8:</div><div>     \
shuffle_32bit_load_result_to_64bit_data(bld, dst, dw, 1);</div><div>     \
break;</div><div>default:</div><div>     \
unreachable();<br></div><div>}<br></div><div>  </div><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">  }<br>
<br>
  /**<br>
diff --git a/src/intel/compiler/brw_fs_ge<wbr>nerator.cpp \
b/src/intel/compiler/brw_fs_ge<wbr>nerator.cpp<br> index a3861cd68e..00a4e29147 \
                100644<br>
--- a/src/intel/compiler/brw_fs_ge<wbr>nerator.cpp<br>
+++ b/src/intel/compiler/brw_fs_ge<wbr>nerator.cpp<br>
@@ -1381,12 +1381,18 @@ \
fs_generator::generate_varying<wbr>_pull_constant_load_gen7(fs_<wbr>inst *inst,<br>  \
uint32_t simd_mode, rlen, mlen;<br>  if (inst-&gt;exec_size == 16) {<br>
           mlen = 2;<br>
-         rlen = 8;<br>
+         if (type_sz(dst.type) == 2 &amp;&amp; (devinfo-&gt;gen &gt;= 9))<br>
+              rlen = 4;<br>
+         else<br>
+              rlen = 8;<br></blockquote><div><br></div><div>I&#39;m not sure what I \
think of this.   We intentionally use a vec4 today instead of a single float because \
it lets us potentially batch up a bunch of loads in a single texture operation.   Are \
we sure that we never need more than 2 of those result \
registers?</div><div><br></div><div>--Jason<br></div><div><br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">  simd_mode = BRW_SAMPLER_SIMD_MODE_SIMD16;<br>
      } else {<br>
           assert(inst-&gt;exec_size == 8);<br>
           mlen = 1;<br>
-         rlen = 4;<br>
+         if (type_sz(dst.type) == 2 &amp;&amp; (devinfo-&gt;gen &gt;= 9))<br>
+              rlen = 2;<br>
+         else<br>
+              rlen = 4;<br>
           simd_mode = BRW_SAMPLER_SIMD_MODE_SIMD8;<br>
      }<br>
<span class="gmail-m_265234056147124101HOEnZb"><font color="#888888"><br>
--<br>
2.14.3<br>
<br>
______________________________<wbr>_________________<br>
mesa-dev mailing list<br>
<a href="mailto:mesa-dev@lists.freedesktop.org" \
target="_blank">mesa-dev@lists.freedesktop.org</a><br> <a \
href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" \
target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/mesa-dev</a><br> \
</font></span></blockquote></div><br></div></div>


[Attachment #6 (text/plain)]

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[prev in list] [next in list] [prev in thread] [next in thread]