CUDA, low performance in storing data in shared memroy -

here problem, in order speed project, want save value generated inside kernel shared memory, however, found takes such long time save value. if remove "this line"(see codes below), i.e., remove "this line", fast save value(100 times speed-up!).

extern __shared__ int sh_try[];  __global__ void xxxkernel (...) {   float v, e0, e1;   float t;   int count(0);   (...)   {      v = fetchtexture();      e0 = fetchtexture();      e1 = fetchtexture();      t = somedevicefunction(v, e0, e1);      if (t>0.0 && t < 1.0)  <========== <this line>        count++;   }   sh_try[threadidx.x] = count; }  main() {   sth..   start timing:    xxxkernel<<<griddim.x, blockdim.x, blockdim.x*sizeof(int)>>> (...);    cudadevicesynchronize();    end timing.   sth... }

in order figure out problem, simplify codes save data shared mem. , stop. know shared mem. efficient mem. besides register, wonder if high latency normal or i've done sth wrong. please give me advice!!! thank guys in advance!!!

trudi

update: if replace shared memory global mem., takes same time, 33ms without "this line", 297ms it. normal saving data global mem. takes same time shared mem.? part of 'compiler optimization'?

i have checked other similar problems on stackoverflow also, i.e., there huge time gap between saving data shared memory or not, may caused compiler optimization, since pointless calculating data not saving them, compiler 'removed' pointless code.

i not sure if share same reason, since line changes game hypothesis - "this line", when comment it, variable 'count' increases in every iteration, when uncomment it, increases when t meaningful.

any ideas? please...

frequently, when large performance changes seen result of relatively small code changes (such adding or deleting line of code in kernel), performance changes not due actual performance impact of line of code, due compiler making different optimization decisions, can result in wholesale additions or deletions of machine code in kernels.

a relatively easy way confirm @ generated machine code. example, if size of generated machine code changes substantially due addition or deletion of single line of source code, may case compiler made optimization decision drastically affected code.

although it's not machine code, these purposes reasonable proxy @ generated ptx code, intermediate code compiler creates.

you can generated ptx adding -ptx switch compile command:

nvcc -ptx mycode.cu

this generate file called mycode.ptx can inspect. naturally if regular compile command requires switches (e.g -i/path/to/include/files) command may require same switches. nvcc manual provides more information on code generation options, , there ptx manual learn ptx, may able rough idea based on size of generated ptx (e.g. number of lines in .ptx file).

Search This Blog

Bready

CUDA, low performance in storing data in shared memroy -

Comments

Post a Comment

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

c# - Using multiple datasets in RDLC -