[prev in list] [next in list] [prev in thread] [next in thread] 

List:       llvm-dev
Subject:    Re: [LLVMdev] Vectorized  LLVM IR
From:       Stphane_Letz <letz () grame ! fr>
Date:       2010-05-29 7:42:15
Message-ID: 45E47D93-A61E-4631-8D93-E2C295F1818B () grame ! fr
[Download RAW message or body]

Le 29 mai 2010  01:08, Bill Wendling a crit :

> Hi Stphane,
> 
> The SSE support is the LLVM backend is fine. What is the code that's generated? Do \
> you have some short examples of where LLVM doesn't do as well as the equivalent \
> scalar code? 
> -bw
> 
> On May 28, 2010, at 12:13 PM, Stphane Letz wrote:


We are actually testing LLVM for the Faust language (http://faust.grame.fr/)

Currently Faust generates  C++ class from its .dsp Faust source file. So for the \
simple following Faust example : 

process = (+,+):*;

Which can be displayed as the following processor (takes 4 streams of float samples, \
do a "+" and then a "*" operation on the streams to produce a single output)


["plus.png" (plus.png)]

PNG


IHDR>MIiCCPICC ProfilexYgT˶3!s9EDD IE \
P	*D(  kx}yVW]5+}FLD̍<<p V 
4	6^|(뿚gsPpB \
JFE`DPHJb,7t/po~a1D*5>"HELЂ \
i=F  \
?Ԁ}R4lEMJBׯ)(P{&jb<yglԯ!6W0maZ_X/$	H]!6h#
 MtpKfl`""?`Vy[|+Eܛl`dDA\Y
6K!fm`D˭XEXoiV 
$4 l10!h]Fj ub,I~Eo M߭	 \
mGNEm./<6Xl&Y?}BЪh#.Z	4E \
5Іh}6f9OG$WĤi!ڭG~Yz \
_SyqLlZ|xhX!2se,ir2J 
[7֚KN"rYL,xlGF \
2|i@P`R|o#2B9{Vj@S`	C`<H` U \
,hpa0s0X`  D'BҐA
yBP(D h7TAUP=t:uAס[)@?`LYa^X5`CvpEp7ux'9xP(
 J%@Q^T<**G5 ZPݨ!CjEhYdZ ]с8t&]nDw o# \
Xb<0L.sӎa0X,Ǫc-vl!ۋ}Įp8N4NgqJ\3n7fx/<
 /ǟGktLttZttAtittMѭ	]!PAh!^5+ߤCB" \
}I")b/)qD"H^DR4A!a'C5C('F:FQFCF_trĘLLL]LL+dfEf{hB3̷gXp,b,,A,9,XnLQda19|<@bŲZFeǺ \
 ƖVvmQ,)QbJ1;/!{0{{(Wn`|V18_rRjY۸<R<N<yY5;G3;7O?@@@ \
 ``=5!q!WlV a  } \
""""M"DD5DD~s+vIlFCR<]IIB_"NA$VRC2R,*&U-u_V>"@#)Ci%&6ɾe]$/"%_*?$SAU!J \
sEE+ln/JRJJJIf;;UUUjUUmUnūͪרkj8hjhiԼ]KM+QM볶vq`:BTz	==zT \
A' JF6~2R07j7jeøebnorϔմYYقv^EŸ%e \
i+uVDkg*6R66ݶvv4KKq8ˎXGji'E!gU#bI}nn>nݾOx{ \
yyZ6>=xm|||1Q.cS \
ԕˀ@ÁsAAfu˂߇膔̄ +7
_iy*r3=5EcEcbRcJNi[?%lKLdEw$$IKNr!9z'M*-/}Y23vea>)3gTyV. \
]w˲wʙc)!7>w|޺}})Uʿ] \
PP^^Xx{E!EՊkK%ǥeeelt8pߡ[*u	OTTtVTTWUUU|=td \
 O;ʏa%>v|Ɖ'N8E;5Z3<gতf摳&g;[d[[)s \
fwABEы5#cRإN]V]}."x*BONk+CO=qQc끛f7Խy֭/
 
wQ~Wn={wht?y3?zG<v}dg|IГQO%?[{"%W<^KnP \
͝oONνKx>3M.̕Yّbs?2$gw<7.q.ZVY[qXy5߇~x[ؐiff,5,BJ8$/p9/@
 SA/
%FdW}ba+px}:a#=Leea|eUm;63:o"߈anz1R2?KŔnl \
S^5ǥcH5MΘٙX6cGwt(vLru-rU=ۯmh`DPհk(hEZ \
vaYU]S[GgWWv.;mveQߣ+WlP_>wg!~J{1{	g)Od˕kVW \
VYT[X?X|	NO۞j2o6=kԢjpo[;.|5߽vpGJu> \
Bozݲ7,.ڽ##]>xd7ؓO>'=W_Nd1y~7NTt{ \
3Qٮ>6ju?>?|aAwapfqחwKqK{VH+%ү_a}7ɸğ4@rZan89c5L1VwN'FN \
@Oz,Od}FdNK_foԒ.c$/oХ\JRKQu@^+C)& \
s	dZ~Y;{ewt#=.݀۞~p˄(jDGFDq'/HLLZL~r=DZ^:mKLBΗYvg7.ٱ'*{><|BPeTѓ%}eg;Xu|
 Z%ʹڤFH-[:P԰zȺѸqzͫg[~9p~m \
ů%R'GPe+Wz,:_17:fޭsdM}?xCGcljO>y6܋ \
|_N};<x*fqfvC\ʼGO^[ry򩕺Պ}aSvWFTՃ^)6; \
;0,/BC`8t9ŀJaepbDxy" +8&wJB2r
*juZo\mkԯ58dXhcahmbgej`mmc+jG[AN
C睪s\hn&rh^ϼ}ڶSDGwæ FGҢLyi1b%@ \
Ī$d䙔iZpe>;v,d^޹7yݻrYνl5O!7)_)). \
,S?@<CM*D+qU-5%Gk봏 ԣgk?^}bɰSϘ46
hnM={`۱.?-zY^_AV#2Ci7o3x~ȣQ%>=o}|o'-J̙|L<Ey-m	X5N1ԇ \
QHhx*{ |	d2p'`!]J&1(l \
{3]xar:@^TjA #W&`0=y,' \
R8<N=:x|#5;]]AH8GH/MAўxHޤ10*010y0]eAx \
?0Y| - )n=Üa\8Fnk
^#|G/OJJJJ2ݲUr~
&Jd
Y1~'5*5hgm6(0,7:n|dْu!kvs\6NYΗ\Dܭ<<z@ \
}0(t^ܱēɲ)i/22Yv^9g(7gY>Ҭ>\yzVwCS/6Ԟ𲣹+J[wG=~ \
gZ]A[9k!99i$ľe$7 9BP	ta_`2 # \
rCjQ}4OA \
Ϣ1FًL#aۭ)7w!<Κ)AM{П!$!Ri"#/>eQfK,,'|ȎEa=FaRKsss&rpqr9kɻ \
wU ')#)< R$#&+?h*)/"MM^ZB\BJ^MW3L+I{NkF&ff \
OPJ65v=jg\URz9mO?:((T>l\m9kLzD^BmfrP4;	Y{rK-*l/2*~XdyE}}#u<
 _?YӸL@]96ޝqWNxsjh筫w=1񚧣/0&뚞%i~|tŲ*FcarC2*!m(@f:s(6*ՅG!$ֺtL'Vs/ \
KStPFB0*=>>ѐA!gcb``bEq3Kف<φe;BQcsr!qɣ{/_C`PTWXN6JKI	J
 I+)))QSUTϩjkYi42 \
45?diiƮA񼳬K{װ׶~T΀ACKU#^DҔbOMINJMڞC8Q־l_ɓRQDt
 5K%G>A:yjSYÖs%muLRЃz`M[U;wSGF-^Sy|Ӫl/J_ \
oކMkg>[~vlQyC_|gGZFMí'(+m"!WKb(\k8
 Ho[X$^s~,\ˁg	pHYs,IDATx]iPW \
,pIPѨeΈIEԲ&MJMU$UgϿNUjGR:I4jYFl \
&"QO@gzg^;T*u9gw! \
(!b@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*Gbq@>*{ \
|(1_)PSт0j0!`F ~ RVa \
$iڴiSnn.?vX&4L伎67߿gg'?ʕ+ٳl9s0?t \
99ڵɓ'tLrf \
M$---l̥Km۶oܸ]	egg3ɏ?xbiz}:~kj+<L4G@hbs;xzDL. \
	)%D ȟ%<W q8\sv# 9	0`^F`ڵBXWX|MM-B
B"RnP1AU},%kMAr `&VM! \
==9S$C"8J݅	"Kcɓ'E [ƪW*aDE@NL #@R:1<3D \
#V ܹsErp޼yf-m(3AE/X3 \
HcBy~MY+D)Jc:~p?lLHuq'&$$X΂yj8ζWn#m믿6'zHγj
 D![)ƪ;wXMcROYr1;c!?mȂ!i*fHُ~՟X1 \
Cb7A7$*--M$U	i*..V	EiO7f̘!I*ILLt֭[ܢ~&XuܹoV]Ƚ豤 \
GRjn/-B 1#x	?m@ZʲB0	) \
ʕ+&MzdfT|BN;byO-	VOs贵edd \
N;s!jJ P-  \
ɂM2]\<*@7$+76lC߶\@Ù{MǏ;TKvLLL||>,,,9#ʞ|,U֭/F \
PE2uFޤ<CZQQQ Ї*J'Y#fQtt5' \
>U92Q[>daVI9g=}HfHdJ \
;ELW^EBNNΈkbbBgHsRiy~x3=rH#(//zV	c{w;U%슽&X_	SQ`,? \
j0%(*y&SSoPٟ+-9xa`o0/_~)Ź*55;XLf+a&u捶k؟#ONq/~:r	s6<Y'Lq;ֹD/`'HOO7[
 ̙*_topf]_#(cWWWna@DO{׵gp \
C22iϫUl"S!k֟˙g*B$`n=G#Ƣ@_Tt \
-\㡒OoQݻ"DSֶB$7ivHdXUSS.X%*JtU^ر̙3x)ͩCPo-2s<A \
-r+x}$8*(j-efRMJ	t+zS\!;uNFv<xUT \
}q	s2A19೧^>-x{-cæ;<U3ÿ1~lP8,2L)ƍ2eUƧGI86OL
 NPzgeWҬ>X"uUgUk֬	Z4Cm9n9*kmsݻ׮]kdiOpj92F@;i+U \
额ݣGLM'U{151)# \
pC7GE/E|b^t?*jB7~JJJە[|re451<v \
vM|Fsgy>zW}ΡC&Lhjی	S3.7o6B&f"Uif \
ފ^zHc42xco߾/{" iWȄolD }v
7m;ɪ#ƶT7Q^H؋u \
9xNbh{}'ޭ*0dB'Yu%7B2>|{ \
W@ᨄ\'6'ƪ~a"}s]5yӲָI{GiѨQBĉ \
[Jx nӔƪ{Y˭~v[+>D
=<
UXN$el
odeDgC79=%.
hN6,vacVް  @$GYG'@<yOiWKA*! \
UP	Ei	lCX΍7DDSرc"IcUcc~P8~8Ψ{=Yg=:i \
x_E0&X%ku$\QgnnOՑ&{maY=~8z$C}LJpu$]|-SPe>O8Wm)r?|9yvNu`i1	Vu \
So:髸 &!L!&pa3WִUXIp%=!PY2 \
۩σi7uG<C@NB!`}Rp_X#OîOذ8nX,T){bXkMn@gglDbpW_}~X5X={V \
Ӓ0:rJBA2G@	^)pF>=*:xE>V% \
+XJ?EI[24תb+&ׯ+aP[BښBVڻw/SgZFB#`U0FRE
  Z r^^|݈!xLd*,,ălxY O2w'
'J.BrC!oA"DS+"&U)٪4VJP."HcHd*+!V0-%K \
DAOYtܤD$U	iJLLT	Ei	lC<"6aO@^$nb[j'BH/v}}=-gԩS`}@аZ1_ָiD@ڹ
 xX22z[so*FFcSNNAOY&UT.r
*8UgBb{""DS믿M]aPVvޭ4erd*IY@'n:~Ef^ \
3*++y33	m۶mjw}TTT`+^|Y⬩"ak.@.+V2	mb0{lo|퍊 \
իW&{	-o3Za61X`^dt+//gddd,_l߾MxɆ &.cz
j(R3]ME VL @2DXe(R3XjbA H*`AU"5LE \
VL @2DXe(R3XjbA H7e{IENDB`



For scalar code in C++ code is :

virtual void compute (int count, FAUSTFLOAT** input, FAUSTFLOAT** output) {
		FAUSTFLOAT* input0 = input[0];
		FAUSTFLOAT* input1 = input[1];
		FAUSTFLOAT* input2 = input[2];
		FAUSTFLOAT* input3 = input[3];
		FAUSTFLOAT* output0 = output[0];
		for (int i=0; i<count; i++) {
			output0[i] = (FAUSTFLOAT)(((float)input2[i] + (float)input3[i]) * \
((float)input0[i] + (float)input1[i]));  }
	}

The "vectorized" C++ code is : 

virtual void compute (int fullcount, FAUSTFLOAT** input, FAUSTFLOAT** output) {
		for (int index = 0; index < fullcount; index += 32) {
			int count = min(32, fullcount-index);
			FAUSTFLOAT* input0 = &input[0][index];
			FAUSTFLOAT* input1 = &input[1][index];
			FAUSTFLOAT* input2 = &input[2][index];
			FAUSTFLOAT* input3 = &input[3][index];
			FAUSTFLOAT* output0 = &output[0][index];
			// SECTION : 1
			for (int i=0; i<count; i++) {
				output0[i] = (FAUSTFLOAT)(((float)input2[i] + (float)input3[i]) * \
((float)input0[i] + (float)input1[i]));  }
		}
	}

(so basically the C++ code is separated in "vectors" [here 32 samples] to be computed \
in separated loops that can be auto-vectorized by some compilers like Intel ICC, this \
works quite well...)

The scalar LLVM code is : 

define void @llvm_compute(%struct.llvm_dsp*  %obj, i32 %count, float** noalias \
%inputs, float** noalias %outputs) nounwind readnone ssp {  entry:
	    %input_array_ptr0 = getelementptr inbounds float** %inputs, i64 0
	    %input0 = load float** %input_array_ptr0, align 8
	    %input_array_ptr1 = getelementptr inbounds float** %inputs, i64 1
	    %input1 = load float** %input_array_ptr1, align 8
	    %input_array_ptr2 = getelementptr inbounds float** %inputs, i64 2
	    %input2 = load float** %input_array_ptr2, align 8
	    %input_array_ptr3 = getelementptr inbounds float** %inputs, i64 3
	    %input3 = load float** %input_array_ptr3, align 8
	    %output_array_ptr0 = getelementptr inbounds float** %outputs, i64 0
	    %output0 = load float** %output_array_ptr0, align 8
	%out = icmp sgt i32 %count, 0
	br i1 %out, label %convert, label %return
	convert:
		%count_64 = zext i32 %count to i64
		br label %loop
	loop:
		%indvar = phi i64 [ 0, %convert ], [ %indvar.next, %loop ]
		%output_ptr0 = getelementptr float* %output0, i64 %indvar
		%input_ptr1 = getelementptr float* %input1, i64 %indvar
		%fTemp0 = load float* %input_ptr1, align 4
		%input_ptr0 = getelementptr float* %input0, i64 %indvar
		%fTemp1 = load float* %input_ptr0, align 4
		%fTemp2 = fadd float %fTemp1, %fTemp0
		%input_ptr3 = getelementptr float* %input3, i64 %indvar
		%fTemp3 = load float* %input_ptr3, align 4
		%input_ptr2 = getelementptr float* %input2, i64 %indvar
		%fTemp4 = load float* %input_ptr2, align 4
		%fTemp5 = fadd float %fTemp4, %fTemp3
		%fTemp6 = fmul float %fTemp5, %fTemp2
		store float %fTemp6, float* %output_ptr0, align 4
		%indvar.next = add i64 %indvar, 1
		%exitcond = icmp eq i64 %indvar.next, %count_64
		br i1 %exitcond, label %return, label %loop
	return:
		ret void
}


And the vectorized LLVM code is : 

define void @llvm_compute(%struct.llvm_dsp* noalias %obj, i32 %count, <32 x float>** \
noalias %inputs, <32 x float>** noalias %outputs) nounwind readnone ssp {  entry:
		    %input_array_ptr0 = getelementptr inbounds <32 x float>** %inputs, i64 0
		    %input0 = load <32 x float>** %input_array_ptr0
		    %input_array_ptr1 = getelementptr inbounds <32 x float>** %inputs, i64 1
		    %input1 = load <32 x float>** %input_array_ptr1
		    %input_array_ptr2 = getelementptr inbounds <32 x float>** %inputs, i64 2
		    %input2 = load <32 x float>** %input_array_ptr2
		    %input_array_ptr3 = getelementptr inbounds <32 x float>** %inputs, i64 3
		    %input3 = load <32 x float>** %input_array_ptr3
		    %output_array_ptr0 = getelementptr inbounds <32 x float>** %outputs, i64 0
		    %output0 = load <32 x float>** %output_array_ptr0
		    %out = icmp sgt i32 %count, 0
		    br i1 %out, label %convert, label %return
		convert:
			%count_64 = zext i32 %count to i64
			br label %loop0
		loop0:
			%indvar = phi i64 [ 0, %convert ], [ %indvar.next, %loop0 ]
			%output_ptr0 = getelementptr <32 x float>* %output0, i64 %indvar
			%input_ptr1 = getelementptr <32 x float>* %input1, i64 %indvar
			%fVector0 = load <32 x float>* %input_ptr1, align 16;
			%input_ptr0 = getelementptr <32 x float>* %input0, i64 %indvar
			%fVector1 = load <32 x float>* %input_ptr0, align 16;
			%fVector2 = fadd <32 x float> %fVector1, %fVector0;
			%input_ptr3 = getelementptr <32 x float>* %input3, i64 %indvar
			%fVector3 = load <32 x float>* %input_ptr3, align 16;
			%input_ptr2 = getelementptr <32 x float>* %input2, i64 %indvar
			%fVector4 = load <32 x float>* %input_ptr2, align 16;
			%fVector5 = fadd <32 x float> %fVector4, %fVector3;
			%fVector6 = fmul <32 x float> %fVector5, %fVector2;
			store <32 x float> %fVector6, <32 x float>* %output_ptr0, align 16
		
			%indvar.next = add i64 %indvar, 1
			%exitcond = icmp eq i64 %indvar.next, %count_64
			br i1 %exitcond, label %return, label %loop0
		return:
			ret void
}

We tried to play with the "align" on the load/store or "noalias" on the compute \
function parameters without real change.

Do you see anything clear that not correct in the generated vectorized LLVM code? \
Maybe the memory bandwidth is the limiting factor in this simple example without much \
computation on the samples?

Thanks.

Stphane Letz



_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic