Strided Accesses

Although the relaxed coalescing restrictions for devices with compute capability 1.2 or higher achieve one-half full bandwidth for the offset copy case just described, performance on such devices can degrade when successive threads access memory locations that have non-unit strides. This pattern occurs frequently when dealing with multidimensional data or matrices; for example, when a half warp of threads accesses matrix elements column-wise and the matrix is stored in row-major order.

To illustrate the effect of strided access on effective bandwidth, see A kernel to illustrate non-unit stride data copy kernel strideCopy(), which copies data with a stride of stride elements between threads from idata to odata.

A kernel to illustrate non-unit stride data copy

__global__ void strideCopy(float *odata, float* idata, int stride)
{
 int xid = (blockIdx.x*blockDim.x + threadIdx.x)*stride;
 odata[xid] = idata[xid];
}

Figure 1 illustrates a situation that can be created using the code in A kernel to illustrate non-unit stride data copy; namely, threads within a half warp access memory with a stride of 2. This action is coalesced into a single 128-byte transaction on an NVIDIA GeForce GTX 280 (compute capability 1.3).

Figure 1. A half warp accessing memory with a stride of 2

Although a stride of 2 results in a single transaction, note that half the elements in the transaction are not used and represent wasted bandwidth. As the stride increases, the effective bandwidth decreases until the point where 16 transactions are issued for the 16 threads in a half warp, as indicated in Figure 2.

Figure 2. Performance of strideCopy Kernel

Note, however, that on the NVIDIA GTX 8800 device (compute capability 1.0), any non-unit stride results in 16 separate transactions per half warp.

As illustrated in Figure 2, non-unit stride global memory accesses should be avoided whenever possible. One method for doing so utilizes shared memory, which is discussed in the next section.