Writing CUDA Kernels

Introduction

CUDA has an execution model unlike the traditional sequential model used for programming CPUs. In CUDA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modeled by defining a thread hierarchy of grid, blocks and threads.

Numba’s CUDA support exposes facilities to declare and manage this hierarchy of threads. The facilities are largely similar to those exposed by NVidia’s CUDA C language.

Numba also exposes three kinds of GPU memory: global device memory (the large, relatively slow off-chip memory that’s connected to the GPU itself), on-chip shared memory and local memory. For all but the simplest algorithms, it is important that you carefully consider how to use and access memory in order to minimize bandwidth requirements and contention.

Kernel declaration

A kernel function is a GPU function that is meant to be called from CPU code (*). It gives it two fundamental characteristics:

  • kernels cannot explicitly return a value; all result data must be written to an array passed to the function (if computing a scalar, you will probably pass a one-element array);

  • kernels explicitly declare their thread hierarchy when called: i.e. the number of thread blocks and the number of threads per block (note that while a kernel is compiled once, it can be called multiple times with different block sizes or grid sizes).

At first sight, writing a CUDA kernel with Numba looks very much like writing a JIT function for the CPU:

@cuda.jit
def increment_by_one(an_array):
    """
    Increment all array elements by one.
    """
    # code elided here; read further for different implementations

(*) Note: newer CUDA devices support device-side kernel launching; this feature is called dynamic parallelism but Numba does not support it currently)

Kernel invocation

A kernel is typically launched in the following way:

threadsperblock = 32
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock
increment_by_one[blockspergrid, threadsperblock](an_array)

We notice two steps here:

  • Instantiate the kernel proper, by specifying a number of blocks (or “blocks per grid”), and a number of threads per block. The product of the two will give the total number of threads launched. Kernel instantiation is done by taking the compiled kernel function (here increment_by_one) and indexing it with a tuple of integers.

  • Running the kernel, by passing it the input array (and any separate output arrays if necessary). Kernels run asynchronously: launches queue their execution on the device and then return immediately. You can use cuda.synchronize() to wait for all previous kernel launches to finish executing.

Note

Passing an array that resides in host memory will implicitly cause a copy back to the host, which will be synchronous. In this case, the kernel launch will not return until the data is copied back, and therefore appears to execute synchronously.

Choosing the block size

It might seem curious to have a two-level hierarchy when declaring the number of threads needed by a kernel. The block size (i.e. number of threads per block) is often crucial:

  • On the software side, the block size determines how many threads share a given area of shared memory.

  • On the hardware side, the block size must be large enough for full occupation of execution units; recommendations can be found in the CUDA C Programming Guide.

Multi-dimensional blocks and grids

To help deal with multi-dimensional arrays, CUDA allows you to specify multi-dimensional blocks and grids. In the example above, you could make blockspergrid and threadsperblock tuples of one, two or three integers. Compared to 1D declarations of equivalent sizes, this doesn’t change anything to the efficiency or behaviour of generated code, but can help you write your algorithms in a more natural way.

Thread positioning

When running a kernel, the kernel function’s code is executed by every thread once. It therefore has to know which thread it is in, in order to know which array element(s) it is responsible for (complex algorithms may define more complex responsibilities, but the underlying principle is the same).

One way is for the thread to determine its position in the grid and block and manually compute the corresponding array position:

@cuda.jit
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

Note

Unless you are sure the block size and grid size is a divisor of your array size, you must check boundaries as shown above.

threadIdx, blockIdx, blockDim and gridDim are special objects provided by the CUDA backend for the sole purpose of knowing the geometry of the thread hierarchy and the position of the current thread within that geometry.

These objects can be 1D, 2D or 3D, depending on how the kernel was invoked. To access the value at each dimension, use the x, y and z attributes of these objects, respectively.

numba.cuda.threadIdx

The thread indices in the current thread block. For 1D blocks, the index (given by the x attribute) is an integer spanning the range from 0 inclusive to numba.cuda.blockDim exclusive. A similar rule exists for each dimension when more than one dimension is used.

numba.cuda.blockDim

The shape of the block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).

numba.cuda.blockIdx

The block indices in the grid of threads launched a kernel. For a 1D grid, the index (given by the x attribute) is an integer spanning the range from 0 inclusive to numba.cuda.gridDim exclusive. A similar rule exists for each dimension when more than one dimension is used.

numba.cuda.gridDim

The shape of the grid of blocks, i.e. the total number of blocks launched by this kernel invocation, as declared when instantiating the kernel.

Absolute positions

Simple algorithms will tend to always use thread indices in the same way as shown in the example above. Numba provides additional facilities to automate such calculations:

numba.cuda.grid(ndim)

Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.

numba.cuda.gridsize(ndim)

Return the absolute size (or shape) in threads of the entire grid of blocks. ndim has the same meaning as in grid() above.

With these functions, the incrementation example can become:

@cuda.jit
def increment_by_one(an_array):
    pos = cuda.grid(1)
    if pos < an_array.size:
        an_array[pos] += 1

The same example for a 2D array and grid of threads would be:

@cuda.jit
def increment_a_2D_array(an_array):
    x, y = cuda.grid(2)
    if x < an_array.shape[0] and y < an_array.shape[1]:
       an_array[x, y] += 1

Note the grid computation when instantiating the kernel must still be done manually, for example:

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
increment_a_2D_array[blockspergrid, threadsperblock](an_array)

Further Reading

Please refer to the the CUDA C Programming Guide for a detailed discussion of CUDA programming.