optimization - Vectorizing Code for efficient implementation -

the following iir code. need vectorize code can write neon code efficiently.

example of vectorization non vectorized code

for(i=0;i<100;i++) a[i] =a[i]*b[i];     //only 1 independent multiplication cannot take                      //advantage of multiple multiplication units

vectorized code

for(i=0;i<25;i++) { a[i*4] =a[i*4]*b[i*4];                //four  independent multiplications can use a[(i+1)*4] =a[(i+1)*4]*b[(i+1)*4];    // multiple multiplication units perform  a[(i+2)*4] =a[(i+2)*4]*b[(i+2)*4];    //operation in parallel a[(i+3)*4] =a[(i+3)*4]*b[(i+3)*4]; }

please me in vectorizing loop below implement code efficiently using vector capability of hardware (my hardware can perform 4 multiplications simultaneously).

 main()     {         for(j=0;j<numbquad;j++)     {         for(i=2;i<samples+2 ;i++)         {             w[i] = x[i-2] + a1[j]* w[i-1] + a2[j]*w[i-2];             y[i-2] = w[i] + b1[j]* w[i-1] + b2[j]*w[i-2];          }         w[0]=0;         w[1] =0;     }     }

once have fixed (or verified) equations, should notice there 4 independent multiplications in each round of equation. task becomes in finding proper , least number of instructions permute input vectors x[...], y[...], w[...] register

   q0 = | w[i-1] | w[i-2] | w[i-1] | w[i-2]|    q1 = | a1[j]  | a2[j]  | b1[j]  | b2[j] |   // vld1.32 {d0,d1}, [r1]!    q2 =   q0 .* q1

a potentially more effective method of wavefront parallelism can achieved inverting loops.

   x0 = *x++;     w0 =  x0 + a*w1 + b*w2;  // pipeline warming stage    y0 =  w0 + c*w1 + d*w2;  //      [repeat this]      // w2 = w1; w1 = w0;      w0 = y0 + a*w1 + b*w2;      y0 = w0 + c*w1 + d*w2;      // w2 = w1; w1 = w0;       x0 = *x++;      *output++= y0;       w0 = x0 + a*w1 + b*w2;      y0 = w0 + c*w1 + d*w2;    [repeat ends]     w0 = y0 + a*w1 + b*w2;   // pipeline cooling stage    y0 = w0 + c*w1 + d*w2;    *output++= y0;

while there still dependencies between x0->w0->y0->w0->y0, there's opportunity of full 2-way parallelism in between lower-case , upper-case expressions. 1 can try rid of shifting values w2=w1; w1=w0; unrolling loop , doing manual register renaming.

Search This Blog

Bready

optimization - Vectorizing Code for efficient implementation -

Comments

Post a Comment

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

c# - Using multiple datasets in RDLC -