CUDA Kernel API¶

Kernel declaration¶

The @cuda.jit decorator is used to create a CUDA dispatcher object that can be configured and launched:

numba.cuda.jit(func_or_sig=None, argtypes=None, device=False, inline=False, link=[], debug=None, opt=True, **kws)¶

JIT compile a python function conforming to the CUDA Python specification. If a signature is supplied, then a function is returned that takes a function to compile.

Parameters

func_or_sig –
A function to JIT compile, or a signature of a function to compile. If a function is supplied, then a numba.cuda.compiler.AutoJitCUDAKernel is returned. If a signature is supplied, then a function is returned. The returned function accepts another function, which it will compile and then return a numba.cuda.compiler.AutoJitCUDAKernel.

Note

A kernel cannot have any return value.
device (bool) – Indicates whether this is a device function.
bind (bool) – (Deprecated) Force binding to CUDA context immediately
link (list) – A list of files containing PTX source to link with the function
debug – If True, check for exceptions thrown when executing the kernel. Since this degrades performance, this should only be used for debugging purposes. Defaults to False. (The default value can be overridden by setting environment variable NUMBA_CUDA_DEBUGINFO=1.)
fastmath – If true, enables flush-to-zero and fused-multiply-add, disables precise division and square root. This parameter has no effect on device function, whose fastmath setting depends on the kernel function from which they are called.
max_registers – Limit the kernel to using at most this number of registers per thread. Useful for increasing occupancy.
opt (bool) – Whether to compile from LLVM IR to PTX with optimization enabled. When True, -opt=3 is passed to NVVM. When False, -opt=0 is passed to NVVM. Defaults to True.

Dispatcher objects¶

The usual syntax for configuring a Dispatcher with a launch configuration uses subscripting, with the arguments being as in the following:

# func is some function decorated with @cuda.jit
func[griddim, blockdim, stream, sharedmem]

The griddim and blockdim arguments specify the size of the grid and thread blocks, and may be either integers or tuples of length up to 3. The stream parameter is an optional stream on which the kernel will be launched, and the sharedmem parameter specifies the size of dynamic shared memory in bytes.

Subscripting the Dispatcher returns a configuration object that can be called with the kernel arguments:

configured = func[griddim, blockdim, stream, sharedmem]
configured(x, y, z)

However, it is more idiomatic to configure and call the kernel within a single statement:

func[griddim, blockdim, stream, sharedmem](x, y, z)

This is similar to launch configuration in CUDA C/C++:

func<<<griddim, blockdim, sharedmem, stream>>>(x, y, z)

Note

The order of stream and sharedmem are reversed in Numba compared to in CUDA C/C++.

Dispatcher objects also provide several utility methods for inspection and creating a specialized instance:

class numba.cuda.compiler.Dispatcher(func, sigs, bind, targetoptions)¶

CUDA Dispatcher object. When configured and called, the dispatcher will specialize itself for the given arguments (if no suitable specialized version already exists) & compute capability, and launch on the device associated with the current context.

Dispatcher objects are not to be constructed by the user, but instead are created using the numba.cuda.jit() decorator.

property extensions¶

A list of objects that must have a prepare_args function. When a specialized kernel is called, each argument will be passed through to the prepare_args (from the last object in this list to the first). The arguments to prepare_args are:

ty the numba type of the argument
val the argument value itself
stream the CUDA stream used for the current call to the kernel
retr a list of zero-arg functions that you may want to append post-call cleanup work to.

The prepare_args function must return a tuple (ty, val), which will be passed in turn to the next right-most extension. After all the extensions have been called, the resulting (ty, val) will be passed into Numba’s default argument marshalling logic.

forall(ntasks, tpb=0, stream=0, sharedmem=0)¶

Returns a 1D-configured kernel for a given number of tasks.

This assumes that:

the kernel maps the Global Thread ID cuda.grid(1) to tasks on a 1-1 basis.
the kernel checks that the Global Thread ID is upper-bounded by ntasks, and does nothing if it is not.

Parameters

ntasks – The number of tasks.
tpb – The size of a block. An appropriate value is chosen if this parameter is not supplied.
stream – The stream on which the configured kernel will be launched.
sharedmem – The number of bytes of dynamic shared memory required by the kernel.

Returns

A configured kernel, ready to launch on a set of arguments.

inspect_asm(signature=None, compute_capability=None)¶: Return the generated PTX assembly code for all signatures encountered thus far, or the PTX assembly code for a specific signature and compute_capability if given. If the dispatcher is specialized, the assembly code for the single specialization is returned.

inspect_llvm(signature=None, compute_capability=None)¶: Return the LLVM IR for all signatures encountered thus far, or the LLVM IR for a specific signature and compute_capability if given. If the dispatcher is specialized, the IR for the single specialization is returned.

inspect_sass(signature=None, compute_capability=None)¶

Return the generated SASS code for all signatures encountered thus far, or the SASS code for a specific signature and compute_capability if given.

Requires nvdisasm to be available on the PATH.

inspect_types(file=None)¶: Produce a dump of the Python source of this function annotated with the corresponding Numba IR and type information. The dump is written to file, or sys.stdout if file is None.

specialize(*args)¶: Create a new instance of this dispatcher specialized for the given args.

property specialized¶: True if the Dispatcher has been specialized.

Intrinsic Attributes and Functions¶

The remainder of the attributes and functions in this section may only be called from within a CUDA Kernel.

Thread Indexing¶

numba.cuda.threadIdx¶: The thread indices in the current thread block, accessed through the attributes x, y, and z. Each index is an integer spanning the range from 0 inclusive to the corresponding value of the attribute in numba.cuda.blockDim exclusive.

numba.cuda.blockIdx¶: The block indices in the grid of thread blocks, accessed through the attributes x, y, and z. Each index is an integer spanning the range from 0 inclusive to the corresponding value of the attribute in numba.cuda.gridDim exclusive.

numba.cuda.blockDim¶: The shape of a block of threads, as declared when instantiating the kernel. This value is the same for all threads in a given kernel, even if they belong to different blocks (i.e. each block is “full”).

numba.cuda.gridDim¶: The shape of the grid of blocks, accessed through the attributes x, y, and z.

numba.cuda.laneid¶: The thread index in the current warp, as an integer spanning the range from 0 inclusive to the numba.cuda.warpsize exclusive.

numba.cuda.warpsize¶: The size in threads of a warp on the GPU. Currently this is always 32.

numba.cuda.grid(ndim)¶

Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel. If ndim is 1, a single integer is returned. If ndim is 2 or 3, a tuple of the given number of integers is returned.

Computation of the first integer is as follows:

cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x

and is similar for the other two indices, but using the y and z attributes.

numba.cuda.gridsize(ndim)¶

Return the absolute size (or shape) in threads of the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel.

Computation of the first integer is as follows:

cuda.blockDim.x * cuda.gridDim.x

and is similar for the other two indices, but using the y and z attributes.

Memory Management¶

numba.cuda.shared.array(shape, dtype)¶

Creates an array in the local memory space of the CUDA kernel with the given shape and dtype.

Returns an array with its content uninitialized.

Note

All threads in the same thread block sees the same array.

numba.cuda.local.array(shape, dtype)¶

Creates an array in the local memory space of the CUDA kernel with the given shape and dtype.

Returns an array with its content uninitialized.

Note

Each thread sees a unique array.

numba.cuda.const.array_like(ary)¶

Copies the ary into constant memory space on the CUDA kernel at compile time.

Returns an array like the ary argument.

Note

All threads and blocks see the same array.

Synchronization and Atomic Operations¶

numba.cuda.atomic.add(array, idx, value)¶

Perform array[idx] += value. Support int32, int64, float32 and float64 only. The idx argument can be an integer or a tuple of integer indices for indexing into multiple dimensional arrays. The number of element in idx must match the number of dimension of array.