c - loop tiling/blocking for large dense matrix multiplication -
i wondering if show me how use loop tiling/loop blocking large dense matrix multiplication effectively. doing c = ab 1000x1000 matrices. have followed example on wikipedia loop tiling worse results using tiling without.
http://en.wikipedia.org/wiki/loop_tiling
i have provided code below. naive method slow due cache misses. transpose method creates transpose of b in buffer. method gives fastest result (matrix multiplication goes o(n^3) , transpose o(n^2) doing transpose @ least 1000x faster). wiki method without blocking fast , not need buffer. blocking method slower. problem blocking has update block several times. challenge threading/openmp because can cause race conditions if 1 not careful.
i should point out using avx modification of transpose method results faster eigen. however, results sse bit slower eigen think use caching better.
edit: think have idea want do. comes cannon's algorithm 1969.
http://en.wikipedia.org/wiki/matrix_multiplication#communication-avoiding_and_distributed_algorithms
i need block matrix multiplication , have each thread handle sub-matrix of c rather a , b. example if divide matrices 4 blocks. do:
[c1 c2 [a1 a2 [b1 b2 c3 c4] = a3 a4] x b3 b4] thread 1: c1 = a1b1 + a2b3 thread 2: c2 = a1b2 + a2b4 thread 3: c3 = a3b1 + a4b3 thread 4: c4 = a3b2 + a4b4
that removes race condition. i'll have think this.
void matrix_mult_naive(const float*a , const float* b, float* c, const int n, const int m, const int k) { #pragma omp parallel for(int i=0; i<n; i++) { for(int j=0; j<k; j++) { float tmp = 0; for(int l=0; l<m; l++) { tmp += a[m*i+l]*b[k*l+j]; } c[k*i + j] = tmp; } } } void matrix_mult_transpose(const float*a , const float* b, float* c, const int n, const int m, const int k) { float *b2 = (float*)aligned_malloc(m*k*sizeof(float), 32); transpose(b, b2, m, k, 1); #pragma omp parallel for(int i=0; i<n; i++) { for(int j=0; j<k; j++) { float tmp = 0; for(int l=0; l<m; l++) { tmp += a[m*i+l]*b2[m*j+l]; } c[k*i + j] = tmp; } } aligned_free(b2); } void matrix_mult_wiki(const float*a , const float* b, float* c, const int n, const int m, const int k) { for(int i=0; i<n; i++) { for(int j=0; j<k; j++) { c[k*i + j] = 0; } } #pragma omp parallel for(int i=0; i<n; i++) { for(int l=0; l<m; l++) { float = a[m*i+l]; for(int j=0; j<k; j++) { c[k*i + j] += a*b[k*l+j]; } } } } void matrix_mult_wiki_block(const float*a , const float* b, float* c, const int n, const int m, const int k) { const int block_size = 8; //i have tried several different block sizes for(int i=0; i<n; i++) { for(int j=0; j<k; j++) { c[k*i + j] = 0; } } for(int l2=0; l2<m; l2+=block_size) { for(int j2=0; j2<k; j2+=block_size) { #pragma omp parallel for(int i=0; i<n; i++) { for(int l=l2; l<min(m, l2+block_size); l++) { for(int j=j2; j<min(k, j2+block_size); j++) { c[k*i + j] += a[m*i+l]*b[k*l+j]; } } } } } }
the best results i've gotten adding 1 more for
loop blocks on n
, , rearranging loops. hoisted loop-invariant code, compiler's optimizer should automatically. block size should cache line size divided sizeof(float)
. got ~50% faster transposed approach.
if have pick 1 of avx or blocking, using avx extensions (vfmadd###ps
, haddps
) still substantially faster. using both best , straightforward add given you're testing if block size multiple of 64 / sizeof(float)
== 16 floats == 2 256-bit avx registers.
- transposed: 1,816,522 ticks
- tiling: 892,431 (51% faster)
- avx+tiling: 230,512 (87% faster)
tiling:
void matrix_mult_wiki_block(const float*a , const float* b, float* c, const int n, const int m, const int k) { const int block_size = 64 / sizeof(float); // 64 = common cache line size for(int i=0; i<n; i++) { for(int j=0; j<k; j++) { c[k*i + j] = 0; } } (int i0 = 0; i0 < n; i0 += block_size) { int imax = i0 + block_size > n ? n : i0 + block_size; (int j0 = 0; j0 < m; j0 += block_size) { int jmax = j0 + block_size > m ? m : j0 + block_size; (int k0 = 0; k0 < k; k0 += block_size) { int kmax = k0 + block_size > k ? k : k0 + block_size; (int j1 = j0; j1 < jmax; ++j1) { int sj = m * j1; (int i1 = i0; i1 < imax; ++i1) { int mi = m * i1; int ki = k * i1; int kij = ki + j1; (int k1 = k0; k1 < kmax; ++k1) { c[kij] += a[mi + k1] * b[sj + k1]; } } } } } } }
as cannon reference, summa algorithm better 1 follow.
in case else optimizing tall-skinny multiplications ({~1e9 x 50} x {50 x 50}, how ended here), transposed approach identical in performance blocked approach n=18 (floats). n=18 pathological case (way worse 17 or 19) , don't quite see cache access patterns cause this. larger n improved blocked approach.
Comments
Post a Comment