cuda-0.10.1.0: FFI binding to the CUDA interface for programming NVIDIA GPUs

Copyright [2009..2018] Trevor L. McDonell BSD None Haskell98

Foreign.CUDA.Analysis.Occupancy

Description

Occupancy calculations for CUDA kernels

Determining Registers Per Thread and Shared Memory Per Block

To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option

--ptxas-options=-v

to nvcc. This will output information about register, local memory, shared memory, and constant memory usage for each kernel in the .cu file. Alternatively, you can compile with the -cubin option to nvcc. This will generate a .cubin file, which you can open in a text editor. Look for the code section with your kernel's name. Within the curly braces ({ ... }) for that code block, you will see a line with reg = X, where x is the number of registers used by your kernel. You can also see the amount of shared memory used as smem = Y. However, if your kernel declares any external shared memory that is allocated dynamically, you will need to add the number in the .cubin file to the amount you dynamically allocate at run time to get the correct shared memory usage.

Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth bound, then increasing occupancy will not necessarily increase performance. If a kernel invocation is already running at least one thread block per multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. As with any optimization, you should experiment to see how changes affect the *wall clock time* of the kernel execution. For bandwidth bound applications, on the other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance.

Synopsis

# Documentation

data Occupancy Source #

Constructors

 Occupancy FieldsactiveThreads :: !IntActive threads per multiprocessoractiveThreadBlocks :: !IntActive thread blocks per multiprocessoractiveWarps :: !IntActive warps per multiprocessoroccupancy100 :: !DoubleOccupancy of each multiprocessor (percent)
Instances
 Source # Instance detailsDefined in Foreign.CUDA.Analysis.Occupancy Methods Source # Instance detailsDefined in Foreign.CUDA.Analysis.Occupancy Methods Source # Instance detailsDefined in Foreign.CUDA.Analysis.Occupancy MethodsshowList :: [Occupancy] -> ShowS #

Arguments

 :: DeviceProperties Properties of the card in question -> Int Threads per block -> Int Registers per thread -> Int Shared memory per block (bytes) -> Occupancy

Calculate occupancy data for a given GPU and kernel resource usage

Arguments

 :: DeviceProperties Architecture to optimise for -> (Int -> Int) Register count as a function of thread block size -> (Int -> Int) Shared memory usage (bytes) as a function of thread block size -> (Int, Occupancy)

Optimise multiprocessor occupancy as a function of thread block size and resource usage. This returns the smallest satisfying block size in increments of a single warp.

Arguments

 :: DeviceProperties Architecture to optimise for -> [Int] Thread block sizes to consider -> (Int -> Int) Register count as a function of thread block size -> (Int -> Int) Shared memory usage (bytes) as a function of thread block size -> (Int, Occupancy)

As optimalBlockSize, but with a generator that produces the specific thread block sizes that should be tested. The generated list can produce values in any order, but the last satisfying block size will be returned. Hence, values should be monotonically decreasing to return the smallest block size yielding maximum occupancy, and vice-versa.

Arguments

 :: DeviceProperties Properties of the card in question -> Int Threads per block -> Int Registers per thread -> Int Shared memory per block (bytes) -> Int Maximum number of resident blocks

Determine the maximum number of CTAs that can be run simultaneously for a given kernel / device combination.

Increments in powers-of-two, over the range of supported thread block sizes for the given device.

Increments in the warp size of the device, over the range of supported thread block sizes.

Decrements in powers-of-two, over the range of supported thread block sizes for the given device.

Decrements in the warp size of the device, over the range of supported thread block sizes.