[prev in list] [next in list] [prev in thread] [next in thread]
List: llvm-dev
Subject: Re: [LLVMdev] Vectorized LLVM IR
From: Stphane_Letz <letz () grame ! fr>
Date: 2010-05-29 7:42:15
Message-ID: 45E47D93-A61E-4631-8D93-E2C295F1818B () grame ! fr
[Download RAW message or body]
Le 29 mai 2010 01:08, Bill Wendling a crit :
> Hi Stphane,
>
> The SSE support is the LLVM backend is fine. What is the code that's generated? Do \
> you have some short examples of where LLVM doesn't do as well as the equivalent \
> scalar code?
> -bw
>
> On May 28, 2010, at 12:13 PM, Stphane Letz wrote:
We are actually testing LLVM for the Faust language (http://faust.grame.fr/)
Currently Faust generates C++ class from its .dsp Faust source file. So for the \
simple following Faust example :
process = (+,+):*;
Which can be displayed as the following processor (takes 4 streams of float samples, \
do a "+" and then a "*" operation on the streams to produce a single output)
["plus.png" (plus.png)]
PNG
IHDR >M IiCCPICC Profile xYgT˶3!s9EDD IE \
P *D( kx}yVW]5+}FL D̍<<p V
4 6^|(뿚gsPpB \
JFE `D PHJb,7t/po~a1 D*5> "HEL Ђ \
i =F \
?Ԁ}R4lEMJBׯ)(P{&jb<yglԯ!6W0maZ_X/$ H]!6h#
MtpKfl`""?`Vy[|+Eܛl`dDA\Y
6K!fm`D˭XEXoiV
$4 l10!h]Fj ub,I~Eo M߭ \
mGNEm./<6Xl&Y?}BЪh#.Z 4E \
5Іh}6f9OG$WĤi!ڭG~Yz \
_ Sy qLlZ|xhX!2se,ir2J
[7֚KN"rYL, xlGF \
2|i @P`R|o #2B9{Vj@ S` C`<H` U \
,hpa0s0X ` D'BҐA
yBP(D h7T AUP=t:uAס[)@?`LYa^X5`CvpEp 7ux'9xP(
J%@Q^T<**G5 ZPݨ!CjEhYdZ ]с8t&]nDw o# \
Xb<0L.sӎa0X,Ǫc-vl!ۋ}Įp8N4NgqJ\3n7fx/<
/ǟGktLttZttAtittMѭ ]!PAh!^5+ߤCB" \
}I")b/)qD"H^DR4A!a'C5C('F:FQFCF_trĘLLL]LL+dfEf{hB3̷gXp,b,,A,9,XnLQda19|<@bŲZFeǺ \
ƖVvmQ,)QbJ1;/!{0{{(Wn`|V18_rRjY۸<R<N<yY5;G3;7O? @@@ \
``=5!q!WlV a } \
""""M"DD5DD~s+vIlFCR<]IIB_"NA$VRC2R,*&U-u_V>"@#)Ci%&6ɾe]$/"%_*?$SAU!J \
sEE+ln/JRJJJIf;;UUUjUUmUnūͪרkj8hjhiԼ]KM+QM볶vq`:BTz ==zT \
A' JF6~2R07j7jeøebnorϔմYYقv^EŸ%e \
i+uVDkg*6R66ݶvv4KKq8ˎXGji'E!gU#bI}nn>nݾOx{ \
yyZ6>=xm|||1Q.cS \
ԕ ˀ@ÁsAAfu˂߇膔̄ +7
_iy*r3=5EcEcbRcJNi[? %lKLdEw$$IKNr!9z'M*-/}Y23vea>)3gTyV. \
]w˲wʙc)!7>w|}})Uʿ] \
PP^^Xx{E!EՊkK%ǥeeelt8pߡ[*u OTTtVTTWUUU|=td \
O;ʏa%>v|Ɖ'N8E;5Z3<gতf摳&g;[d[[)s \
fwABEы5#cRإN]V]}."x*BONk+CO=qQc끛f7Խy֭/
wQ~Wn={wht?y3?zG<v}dg|IГQO%?[{"%W<^KnP \
͝oONνKx>3M.̕Yّbs?2$gw<7.q.ZVY[qXy5߇~x[ؐiff,5,BJ8$/p 9/@
SA/
%FdW}ba+px}:a#=Leea|eUm;63:o"߈anz1R2?KŔnl \
S^5ǥcH5MΘٙX6cGwt(vLru-rU=ۯmh`DPհk(hEZ \
vaYU]S[GgWWv.;mveQߣ+WlP_>wg!~J{1{ g)Od˕kVW \
VYT[X?X| NO۞j2o6=kԢjpo[;.|5߽vpGJu> \
Bozݲ7,.ڽ##]>xd7ؓO>'=W_Nd1y~7NTt{ \
3Qٮ>6ju?>?|aAwapfqחwKqK{VH+%ү_a}7ɸğ4@rZan89c5L1VwN'FN \
@Oz,Od}FdNK_foԒ.c$/oХ\JRKQu@^+C)& \
s dZ~Y;{ewt#=.݀۞~p˄(jDGFDq'/HLLZL~r=DZ^:mKLBΗYvg7.ٱ'*{><|BPeTѓ%}eg;Xu|
Z%ʹڤFH-[:PzȺѸqzͫg[~9p~m \
ů%R'GPe+Wz,:_17:fޭsdM}?xCGcljO>y6܋ \
|_N};<x*fqfvC\ʼGO^[ryՊ}aSvWFTՃ^)6; \
;0,/BC`8t9ŀJaepbDxy" +8&wJB2r
*juZo\mkԯ58dXhcahmbgej`mmc+jG[AN
C睪s\hn&rh^ϼ}ڶSDGwæ FGҢLyi1b%@ \
Ī$d䙔iZpe>;v,d^7yݻrYνl5O!7)_)). \
,S?@<CM*D+qU-5%Gk봏 ԣgk?^}bɰSϘ46
hnM={`۱.?-zY^_AV#2Ci7o3x~ȣQ%>=o}|o'-J̙|L<Ey-m X5 N1 ԇ \
Q H hx *{ | d2p'`!]J&1(l \
{3]xar:@^TjA #W&`0=y,' \
R8<N=:x|#5;]]AH8GH/MAўxHޤ10*010y0]eAx \
?0Y| - )n=Üa\8Fnk
^#|G/OJJJJ2ݲUr~
&Jd
Y1~'5*5hgm6(0,7:n|dْu!kvs\6NYΗ\Dܭ<<z@ \
}0(t^ܱēɲ)i/22Yv^9g(7gY>Ҭ>\yzVwCS/6Ԟ+J[wG=~ \
gZ]A[9k!99i$ľe$7 9BP ta_`2 # \
rCjQ}4OA \
Ϣ1FًL#aۭ)7w!<Κ)AM{П!$!Ri"#/>e QfK,,'|ȎEa=FaRKsss&rpqr9kɻ \
wU ')#)< R$#&+?h*)/"MM^ZB\BJ^MW3L+I{NkF&ff \
OPJ65v=jg\URz9mO?:((T>l\m9kLzD^BmfrP4; Y{rK-*l/2*~XdyE}}#u<
_?YӸL@]96ޝqWNxsjh筫w=1/0&뚞%i~|tŲ*FcarC2*!m(@f:s(6*ՅG!$ֺtL'Vs/ \
KStPFB0*=>>ѐA!gcb``bEq3Kف<φe;BQcsr!qɣ{/_C`PTWXN6JKI J
I+)))QSUTϩjkYi42 \
45?diiƮAK{װ~TACKU#^DҔbOMINJMڞC8Q־l_ɓRQDt
5K%G>A:yjSYÖs%muLRЃz`M[U;wSGF-^Sy|Ӫl/J_ \
oކMkg>[~vlQyC_|gGZFMí'(+m "!WKb (\k8
Ho[X$^s~ ,\ˁg pHYs ,IDATx]iPW \
,pIPѨeΈIEԲ&MJMU$UgϿNUjGR:I4jYFl \
&"QO@gzg^;T*u9gw! \
(!b@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*{ \
|(1_)PSт0j0!`F ~ RVa \
$iڴiSnn.?vX&4L伎67߿gg'?ʕ+ٳl9s0?t \
99ڵɓ'tLrf \
M$---l̥Km۶oܸ] egg3ɏ?xbiz }:~kj+<L4G@hbs;xzDL. \
)%D ȟ%<W q8\sv# 9 0`^F`ڵBXWX|MM-B
B"RnP1AU},%kMAr `&VM! \
==9S$C"8J݅ "Kcɓ'E [ƪW*aDE@NL #@R:1<3D \
#V ܹsErpyf-m(3AE/X3 \
HcBy~MY+D)Jc:~p?lLHuq'&$$Xyj8ζWn#m믿6'zHγj
D![)ƪ;wXMcROYr1;c!?mȂ!i*fHُ ~՟X1 \
Cb7A7 $*--M$U i*..V EiO7f̘!I*ILLt֭[ܢ~&XuܹoV]Ƚ豤 \
GRjn/-B 1#x ?m@ZʲB0 ) \
ʕ+&MzdfT|BN;byO- VOs贵edd \
N;s!jJ P- \
ɂM2]\<*@7$+76lC߶\@Ù{MǏ;TKvLLL||>,,,9#ʞ|,U֭/F \
PE2uFޤ<CZQQQ Ї*J'Y#fQtt5' \
>U92Q[>daVI9g=}HfHd J \
;ELW^EBNNΈkbbBgHsRiy~x3=rH#(//zV c{w;U%슽&X_ SQ`,? \
j0%(*y&SSoPٟ+-9xa`o0/_~)Ź*55;XLf+a&u捶k؟#ONq/~:r s6<Y'Lq;ֹD/`'HOO7[
̙*_topf]_#(cWWWna@DO{gp \
C22iϫUl"S!k֟˙g*B$`n=G#Ƣ@_Tt \
-\㡒OoQݻ"DSֶB$7ivHdXUSS.X%* JtU^ر̙3x)ͩCPo-2s<A \
-r+x}$8*(j-efRMJ t+zS\!;uNFv<xUT \
}q s2A19೧^>-x{-cæ;<U3ÿ1~lP8,2L)ƍ2eUƧGI86OL
NPzgeWҬ>X"uUgUk֬ Z4Cm9n9*kmsݻ]kdiOpj92F@;i+U \
额ݣGLM'U{151)# \
pC7GE/E|b^t?*jB7~JJJە[|re451<v \
vM|Fsgy>zW}ΡC&Lhjی S3.7o6B&f"Uif \
ފ^zHc42xco߾/{" iWȄolD }v
7m;ɪ#ƶT7Q^H؋u \
9xNbh{}'ޭ*0dB'Yu%7B2>|{ \
W@ ᨄ\'6'ƪ~a"}s]5yӲָI{GiѨQBĉ \
[Jx nӔƪ{Y˭~v[+>D
=<
UXN$el
odeDgC79=%.
hN6,vacVް @$GYG'@<yOiWKA*! \
UP Ei lCX7DDSرc"IcUcc~P8~8Ψ{=Yg=:i \
x_E0&X%ku$\QgnnOՑ&{maY=~8z$C} LJpu$]|-SPe>O8Wm)r?|9yvNu`i1 Vu \
So:髸 &!L!&pa3WִUXIp%=!PY2 \
۩σi7uG<C@NB!`}Rp_X#OîOذ8nX,T){bXkMn@gglDbpW_}~X5X={V \
Ӓ0:rJBA2G@ ^)pF>=*:xE>V% \
+XJ?EI[24תb+&ׯ+aP[BښBVڻw/SgZFB#`U0FRE
Z r^^|݈!xLd*,,ălxY O2w'
'J.BrC!oA"DS+"&U)٪4VJP."HcHd*+!V0-%K \
DAOYtܤD$U iJLLT Ei lC<"6aO@^$nb[j'BH/v}}=-gԩS`}@аZ1_ָiD@ڹ
xX22z[so*FFcSNNAOY&UT.r
*8UgBb{""DS믿M]aPVvޭ4erd*IY@'n:~Ef^ \
3*++y33 m۶mjw}TTT`+^|Y⬩"ak.@.+V2 mb0{lo|퍊 \
իW&{ -o3Za61X`^dt+//gddd,_l߾MxɆ &.cz
j(R3 ]ME VL @2DXe(R3 XjbA H*`AU"5LE \
VL @2DXe(R3 XjbA H7e{ IENDB`
For scalar code in C++ code is :
virtual void compute (int count, FAUSTFLOAT** input, FAUSTFLOAT** output) {
FAUSTFLOAT* input0 = input[0];
FAUSTFLOAT* input1 = input[1];
FAUSTFLOAT* input2 = input[2];
FAUSTFLOAT* input3 = input[3];
FAUSTFLOAT* output0 = output[0];
for (int i=0; i<count; i++) {
output0[i] = (FAUSTFLOAT)(((float)input2[i] + (float)input3[i]) * \
((float)input0[i] + (float)input1[i])); }
}
The "vectorized" C++ code is :
virtual void compute (int fullcount, FAUSTFLOAT** input, FAUSTFLOAT** output) {
for (int index = 0; index < fullcount; index += 32) {
int count = min(32, fullcount-index);
FAUSTFLOAT* input0 = &input[0][index];
FAUSTFLOAT* input1 = &input[1][index];
FAUSTFLOAT* input2 = &input[2][index];
FAUSTFLOAT* input3 = &input[3][index];
FAUSTFLOAT* output0 = &output[0][index];
// SECTION : 1
for (int i=0; i<count; i++) {
output0[i] = (FAUSTFLOAT)(((float)input2[i] + (float)input3[i]) * \
((float)input0[i] + (float)input1[i])); }
}
}
(so basically the C++ code is separated in "vectors" [here 32 samples] to be computed \
in separated loops that can be auto-vectorized by some compilers like Intel ICC, this \
works quite well...)
The scalar LLVM code is :
define void @llvm_compute(%struct.llvm_dsp* %obj, i32 %count, float** noalias \
%inputs, float** noalias %outputs) nounwind readnone ssp { entry:
%input_array_ptr0 = getelementptr inbounds float** %inputs, i64 0
%input0 = load float** %input_array_ptr0, align 8
%input_array_ptr1 = getelementptr inbounds float** %inputs, i64 1
%input1 = load float** %input_array_ptr1, align 8
%input_array_ptr2 = getelementptr inbounds float** %inputs, i64 2
%input2 = load float** %input_array_ptr2, align 8
%input_array_ptr3 = getelementptr inbounds float** %inputs, i64 3
%input3 = load float** %input_array_ptr3, align 8
%output_array_ptr0 = getelementptr inbounds float** %outputs, i64 0
%output0 = load float** %output_array_ptr0, align 8
%out = icmp sgt i32 %count, 0
br i1 %out, label %convert, label %return
convert:
%count_64 = zext i32 %count to i64
br label %loop
loop:
%indvar = phi i64 [ 0, %convert ], [ %indvar.next, %loop ]
%output_ptr0 = getelementptr float* %output0, i64 %indvar
%input_ptr1 = getelementptr float* %input1, i64 %indvar
%fTemp0 = load float* %input_ptr1, align 4
%input_ptr0 = getelementptr float* %input0, i64 %indvar
%fTemp1 = load float* %input_ptr0, align 4
%fTemp2 = fadd float %fTemp1, %fTemp0
%input_ptr3 = getelementptr float* %input3, i64 %indvar
%fTemp3 = load float* %input_ptr3, align 4
%input_ptr2 = getelementptr float* %input2, i64 %indvar
%fTemp4 = load float* %input_ptr2, align 4
%fTemp5 = fadd float %fTemp4, %fTemp3
%fTemp6 = fmul float %fTemp5, %fTemp2
store float %fTemp6, float* %output_ptr0, align 4
%indvar.next = add i64 %indvar, 1
%exitcond = icmp eq i64 %indvar.next, %count_64
br i1 %exitcond, label %return, label %loop
return:
ret void
}
And the vectorized LLVM code is :
define void @llvm_compute(%struct.llvm_dsp* noalias %obj, i32 %count, <32 x float>** \
noalias %inputs, <32 x float>** noalias %outputs) nounwind readnone ssp { entry:
%input_array_ptr0 = getelementptr inbounds <32 x float>** %inputs, i64 0
%input0 = load <32 x float>** %input_array_ptr0
%input_array_ptr1 = getelementptr inbounds <32 x float>** %inputs, i64 1
%input1 = load <32 x float>** %input_array_ptr1
%input_array_ptr2 = getelementptr inbounds <32 x float>** %inputs, i64 2
%input2 = load <32 x float>** %input_array_ptr2
%input_array_ptr3 = getelementptr inbounds <32 x float>** %inputs, i64 3
%input3 = load <32 x float>** %input_array_ptr3
%output_array_ptr0 = getelementptr inbounds <32 x float>** %outputs, i64 0
%output0 = load <32 x float>** %output_array_ptr0
%out = icmp sgt i32 %count, 0
br i1 %out, label %convert, label %return
convert:
%count_64 = zext i32 %count to i64
br label %loop0
loop0:
%indvar = phi i64 [ 0, %convert ], [ %indvar.next, %loop0 ]
%output_ptr0 = getelementptr <32 x float>* %output0, i64 %indvar
%input_ptr1 = getelementptr <32 x float>* %input1, i64 %indvar
%fVector0 = load <32 x float>* %input_ptr1, align 16;
%input_ptr0 = getelementptr <32 x float>* %input0, i64 %indvar
%fVector1 = load <32 x float>* %input_ptr0, align 16;
%fVector2 = fadd <32 x float> %fVector1, %fVector0;
%input_ptr3 = getelementptr <32 x float>* %input3, i64 %indvar
%fVector3 = load <32 x float>* %input_ptr3, align 16;
%input_ptr2 = getelementptr <32 x float>* %input2, i64 %indvar
%fVector4 = load <32 x float>* %input_ptr2, align 16;
%fVector5 = fadd <32 x float> %fVector4, %fVector3;
%fVector6 = fmul <32 x float> %fVector5, %fVector2;
store <32 x float> %fVector6, <32 x float>* %output_ptr0, align 16
%indvar.next = add i64 %indvar, 1
%exitcond = icmp eq i64 %indvar.next, %count_64
br i1 %exitcond, label %return, label %loop0
return:
ret void
}
We tried to play with the "align" on the load/store or "noalias" on the compute \
function parameters without real change.
Do you see anything clear that not correct in the generated vectorized LLVM code? \
Maybe the memory bandwidth is the limiting factor in this simple example without much \
computation on the samples?
Thanks.
Stphane Letz
_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic