optimization - Vectorizing Code for efficient implementation -
the following iir code. need vectorize code can write neon code efficiently.
example of vectorization non vectorized code
for(i=0;i<100;i++) a[i] =a[i]*b[i]; //only 1 independent multiplication cannot take //advantage of multiple multiplication units
vectorized code
for(i=0;i<25;i++) { a[i*4] =a[i*4]*b[i*4]; //four independent multiplications can use a[(i+1)*4] =a[(i+1)*4]*b[(i+1)*4]; // multiple multiplication units perform a[(i+2)*4] =a[(i+2)*4]*b[(i+2)*4]; //operation in parallel a[(i+3)*4] =a[(i+3)*4]*b[(i+3)*4]; }
please me in vectorizing loop below implement code efficiently using vector capability of hardware (my hardware can perform 4 multiplications simultaneously).
main() { for(j=0;j<numbquad;j++) { for(i=2;i<samples+2 ;i++) { w[i] = x[i-2] + a1[j]* w[i-1] + a2[j]*w[i-2]; y[i-2] = w[i] + b1[j]* w[i-1] + b2[j]*w[i-2]; } w[0]=0; w[1] =0; } }
once have fixed (or verified) equations, should notice there 4 independent multiplications in each round of equation. task becomes in finding proper , least number of instructions permute input vectors x[...], y[...], w[...] register
q0 = | w[i-1] | w[i-2] | w[i-1] | w[i-2]| q1 = | a1[j] | a2[j] | b1[j] | b2[j] | // vld1.32 {d0,d1}, [r1]! q2 = q0 .* q1
a potentially more effective method of wavefront parallelism can achieved inverting loops.
x0 = *x++; w0 = x0 + a*w1 + b*w2; // pipeline warming stage y0 = w0 + c*w1 + d*w2; // [repeat this] // w2 = w1; w1 = w0; w0 = y0 + a*w1 + b*w2; y0 = w0 + c*w1 + d*w2; // w2 = w1; w1 = w0; x0 = *x++; *output++= y0; w0 = x0 + a*w1 + b*w2; y0 = w0 + c*w1 + d*w2; [repeat ends] w0 = y0 + a*w1 + b*w2; // pipeline cooling stage y0 = w0 + c*w1 + d*w2; *output++= y0;
while there still dependencies between x0->w0->y0->w0->y0, there's opportunity of full 2-way parallelism in between lower-case , upper-case expressions. 1 can try rid of shifting values w2=w1; w1=w0;
unrolling loop , doing manual register renaming.
Comments
Post a Comment