cuda-0.6.0.0: FFI binding to the CUDA interface for programming NVIDIA GPUs

Copyright(c) [2009..2012] Trevor L. McDonell
LicenseBSD
Safe HaskellNone
LanguageHaskell98

Foreign.CUDA.Analysis.Occupancy

Description

Occupancy calculations for CUDA kernels

http://developer.download.nvidia.com/compute/cuda/3_0/sdk/docs/CUDA_Occupancy_calculator.xls

Determining Registers Per Thread and Shared Memory Per Block

To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option

--ptxas-options=-v

to nvcc. This will output information about register, local memory, shared memory, and constant memory usage for each kernel in the .cu file. Alternatively, you can compile with the -cubin option to nvcc. This will generate a .cubin file, which you can open in a text editor. Look for the code section with your kernel's name. Within the curly braces ({ ... }) for that code block, you will see a line with reg = X, where x is the number of registers used by your kernel. You can also see the amount of shared memory used as smem = Y. However, if your kernel declares any external shared memory that is allocated dynamically, you will need to add the number in the .cubin file to the amount you dynamically allocate at run time to get the correct shared memory usage.

Notes About Occupancy

Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth bound, then increasing occupancy will not necessarily increase performance. If a kernel invocation is already running at least one thread block per multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. As with any optimization, you should experiment to see how changes affect the *wall clock time* of the kernel execution. For bandwidth bound applications, on the other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance.

Synopsis

Documentation

data Occupancy Source

Constructors

Occupancy 

Fields

activeThreads :: !Int

Active threads per multiprocessor

activeThreadBlocks :: !Int

Active thread blocks per multiprocessor

activeWarps :: !Int

Active warps per multiprocessor

occupancy100 :: !Double

Occupancy of each multiprocessor (percent)

occupancy Source

Arguments

:: DeviceProperties

Properties of the card in question

-> Int

Threads per block

-> Int

Registers per thread

-> Int

Shared memory per block (bytes)

-> Occupancy 

Calculate occupancy data for a given GPU and kernel resource usage

optimalBlockSize Source

Arguments

:: DeviceProperties

Architecture to optimise for

-> (Int -> Int)

Register count as a function of thread block size

-> (Int -> Int)

Shared memory usage (bytes) as a function of thread block size

-> (Int, Occupancy) 

Optimise multiprocessor occupancy as a function of thread block size and resource usage. This returns the smallest satisfying block size in increments of a single warp.

optimalBlockSizeBy :: DeviceProperties -> (DeviceProperties -> [Int]) -> (Int -> Int) -> (Int -> Int) -> (Int, Occupancy) Source

As optimalBlockSize, but with a generator that produces the specific thread block sizes that should be tested. The generated list can produce values in any order, but the last satisfying block size will be returned. Hence, values should be monotonically decreasing to return the smallest block size yielding maximum occupancy, and vice-versa.

maxResidentBlocks Source

Arguments

:: DeviceProperties

Properties of the card in question

-> Int

Threads per block

-> Int

Registers per thread

-> Int

Shared memory per block (bytes)

-> Int

Maximum number of resident blocks

Determine the maximum number of CTAs that can be run simultaneously for a given kernel / device combination.

incPow2 :: DeviceProperties -> [Int] Source

Increments in powers-of-two, over the range of supported thread block sizes for the given device.

incWarp :: DeviceProperties -> [Int] Source

Increments in the warp size of the device, over the range of supported thread block sizes.

decPow2 :: DeviceProperties -> [Int] Source

Decrements in powers-of-two, over the range of supported thread block sizes for the given device.

decWarp :: DeviceProperties -> [Int] Source

Decrements in the warp size of the device, over the range of supported thread block sizes.