malloc - CUDA dynamic allocation and coalescence (compute capability >2.0) -
maybe can me out. use dynamic allocation in cuda kernel simple reason each block require significant amount of global memory scratchpad , number of blocks in order of 4000. statically allocating scratchpad have preference, not possible due memory size restrictions. figured in case dynamic allocation useful amount of memory number of active blocks, order 100-200. side note , let know not have preference too, seems me way forward.
coming point, according cuda c programming guide section b.17;
the cuda in-kernel malloc() function allocates @ least size bytes device heap , returns pointer allocated memory or null if insufficient memory exists fulfill request. returned pointer guaranteed aligned 16-byte boundary.
and according f.4.2;
a cache line 128 bytes , maps 128 byte aligned segment in device memory. memory accesses cached in both l1 , l2 serviced 128-byte memory transactions whereas memory accesses cached in l2 serviced 32-byte memory transactions. caching in l2 can therefore reduce over-fetch, example, in case of scattered memory accesses.
and 5.3.2;
any address of variable residing in global memory or returned 1 of memory allocation routines driver or runtime api aligned @ least 256 bytes.
so mean kernel allocations not correctly aligned warp parallel access coalesced consecutive floats dynamic allocation possible cudamalloc() ?
Comments
Post a Comment