c - loop tiling/blocking for large dense matrix multiplication -


i wondering if show me how use loop tiling/loop blocking large dense matrix multiplication effectively. doing c = ab 1000x1000 matrices. have followed example on wikipedia loop tiling worse results using tiling without.

http://en.wikipedia.org/wiki/loop_tiling

http://software.intel.com/en-us/articles/how-to-use-loop-blocking-to-optimize-memory-use-on-32-bit-intel-architecture

i have provided code below. naive method slow due cache misses. transpose method creates transpose of b in buffer. method gives fastest result (matrix multiplication goes o(n^3) , transpose o(n^2) doing transpose @ least 1000x faster). wiki method without blocking fast , not need buffer. blocking method slower. problem blocking has update block several times. challenge threading/openmp because can cause race conditions if 1 not careful.

i should point out using avx modification of transpose method results faster eigen. however, results sse bit slower eigen think use caching better.

edit: think have idea want do. comes cannon's algorithm 1969.
http://en.wikipedia.org/wiki/matrix_multiplication#communication-avoiding_and_distributed_algorithms

i need block matrix multiplication , have each thread handle sub-matrix of c rather a , b. example if divide matrices 4 blocks. do:

[c1 c2     [a1 a2   [b1 b2  c3 c4]   = a3 a4] x b3 b4] thread 1: c1 = a1b1 + a2b3 thread 2: c2 = a1b2 + a2b4 thread 3: c3 = a3b1 + a4b3 thread 4: c4 = a3b2 + a4b4 

that removes race condition. i'll have think this.

void matrix_mult_naive(const float*a , const float* b, float* c, const int n, const int m, const int k) {     #pragma omp parallel     for(int i=0; i<n; i++) {         for(int j=0; j<k; j++) {             float tmp = 0;             for(int l=0; l<m; l++) {                 tmp += a[m*i+l]*b[k*l+j];             }             c[k*i + j] = tmp;         }     } } void matrix_mult_transpose(const float*a , const float* b, float* c, const int n, const int m, const int k) {     float *b2 = (float*)aligned_malloc(m*k*sizeof(float), 32);     transpose(b, b2, m, k, 1);     #pragma omp parallel     for(int i=0; i<n; i++) {         for(int j=0; j<k; j++) {             float tmp = 0;             for(int l=0; l<m; l++) {                 tmp += a[m*i+l]*b2[m*j+l];             }             c[k*i + j] = tmp;         }     }     aligned_free(b2); }  void matrix_mult_wiki(const float*a , const float* b, float* c, const int n, const int m, const int k) {     for(int i=0; i<n; i++) {         for(int j=0; j<k; j++) {             c[k*i + j] = 0;         }       }     #pragma omp parallel     for(int i=0; i<n; i++) {         for(int l=0; l<m; l++) {             float  = a[m*i+l];             for(int j=0; j<k; j++) {                 c[k*i + j] += a*b[k*l+j];             }         }     } }  void matrix_mult_wiki_block(const float*a , const float* b, float* c, const int n, const int m, const int k) {    const int block_size = 8;  //i have tried several different block sizes    for(int i=0; i<n; i++) {        for(int j=0; j<k; j++) {            c[k*i + j] = 0;        }     }     for(int l2=0; l2<m; l2+=block_size) {         for(int j2=0; j2<k; j2+=block_size) {         #pragma omp parallel             for(int i=0; i<n; i++) {                 for(int l=l2; l<min(m, l2+block_size); l++) {                     for(int j=j2; j<min(k, j2+block_size); j++) {                         c[k*i + j] += a[m*i+l]*b[k*l+j];                     }                 }             }         }     } } 

the best results i've gotten adding 1 more for loop blocks on n, , rearranging loops. hoisted loop-invariant code, compiler's optimizer should automatically. block size should cache line size divided sizeof(float). got ~50% faster transposed approach.

if have pick 1 of avx or blocking, using avx extensions (vfmadd###ps , haddps) still substantially faster. using both best , straightforward add given you're testing if block size multiple of 64 / sizeof(float) == 16 floats == 2 256-bit avx registers.

  • transposed: 1,816,522 ticks
  • tiling: 892,431 (51% faster)
  • avx+tiling: 230,512 (87% faster)

tiling:

void matrix_mult_wiki_block(const float*a , const float* b, float* c,                             const int n, const int m, const int k) {     const int block_size = 64 / sizeof(float); // 64 = common cache line size     for(int i=0; i<n; i++) {         for(int j=0; j<k; j++) {             c[k*i + j] = 0;         }     }     (int i0 = 0; i0 < n; i0 += block_size) {         int imax = i0 + block_size > n ? n : i0 + block_size;          (int j0 = 0; j0 < m; j0 += block_size) {             int jmax = j0 + block_size > m ? m : j0 + block_size;              (int k0 = 0; k0 < k; k0 += block_size) {                 int kmax = k0 + block_size > k ? k : k0 + block_size;                  (int j1 = j0; j1 < jmax; ++j1) {                     int sj = m * j1;                      (int i1 = i0; i1 < imax; ++i1) {                         int mi = m * i1;                         int ki = k * i1;                         int kij = ki + j1;                          (int k1 = k0; k1 < kmax; ++k1) {                             c[kij] += a[mi + k1] * b[sj + k1];                         }                     }                 }             }         }     } } 

as cannon reference, summa algorithm better 1 follow.


in case else optimizing tall-skinny multiplications ({~1e9 x 50} x {50 x 50}, how ended here), transposed approach identical in performance blocked approach n=18 (floats). n=18 pathological case (way worse 17 or 19) , don't quite see cache access patterns cause this. larger n improved blocked approach.


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -