malloc - CUDA dynamic allocation and coalescence (compute capability >2.0) -

- September 15, 2012

maybe can me out. use dynamic allocation in cuda kernel simple reason each block require significant amount of global memory scratchpad , number of blocks in order of 4000. statically allocating scratchpad have preference, not possible due memory size restrictions. figured in case dynamic allocation useful amount of memory number of active blocks, order 100-200. side note , let know not have preference too, seems me way forward.

coming point, according cuda c programming guide section b.17;

the cuda in-kernel malloc() function allocates @ least size bytes device heap , returns pointer allocated memory or null if insufficient memory exists fulfill request. returned pointer guaranteed aligned 16-byte boundary.

and according f.4.2;

a cache line 128 bytes , maps 128 byte aligned segment in device memory. memory accesses cached in both l1 , l2 serviced 128-byte memory transactions whereas memory accesses cached in l2 serviced 32-byte memory transactions. caching in l2 can therefore reduce over-fetch, example, in case of scattered memory accesses.

and 5.3.2;

any address of variable residing in global memory or returned 1 of memory allocation routines driver or runtime api aligned @ least 256 bytes.

so mean kernel allocations not correctly aligned warp parallel access coalesced consecutive floats dynamic allocation possible cudamalloc() ?

Search This Blog

If code

malloc - CUDA dynamic allocation and coalescence (compute capability >2.0) -

Comments

Post a Comment

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -