| Copyright | (c) [2009..2012] Trevor L. McDonell | 
|---|---|
| License | BSD | 
| Safe Haskell | None | 
| Language | Haskell98 | 
Foreign.CUDA.Analysis.Occupancy
Description
Occupancy calculations for CUDA kernels
http://developer.download.nvidia.com/compute/cuda/3_0/sdk/docs/CUDA_Occupancy_calculator.xls
Determining Registers Per Thread and Shared Memory Per Block
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option
--ptxas-options=-v
to nvcc.  This will output information about register, local memory, shared
 memory, and constant memory usage for each kernel in the .cu file.
 Alternatively, you can compile with the -cubin option to nvcc.  This will
 generate a .cubin file, which you can open in a text editor.  Look for the
 code section with your kernel's name.  Within the curly braces ({ ... })
 for that code block, you will see a line with reg = X, where x is the
 number of registers used by your kernel.  You can also see the amount of
 shared memory used as smem = Y.  However, if your kernel declares any
 external shared memory that is allocated dynamically, you will need to add
 the number in the .cubin file to the amount you dynamically allocate at run
 time to get the correct shared memory usage.
Notes About Occupancy
Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth bound, then increasing occupancy will not necessarily increase performance. If a kernel invocation is already running at least one thread block per multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. As with any optimization, you should experiment to see how changes affect the *wall clock time* of the kernel execution. For bandwidth bound applications, on the other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance.
- data Occupancy = Occupancy {- activeThreads :: !Int
- activeThreadBlocks :: !Int
- activeWarps :: !Int
- occupancy100 :: !Double
 
- occupancy :: DeviceProperties -> Int -> Int -> Int -> Occupancy
- optimalBlockSize :: DeviceProperties -> (Int -> Int) -> (Int -> Int) -> (Int, Occupancy)
- optimalBlockSizeBy :: DeviceProperties -> (DeviceProperties -> [Int]) -> (Int -> Int) -> (Int -> Int) -> (Int, Occupancy)
- maxResidentBlocks :: DeviceProperties -> Int -> Int -> Int -> Int
- incPow2 :: DeviceProperties -> [Int]
- incWarp :: DeviceProperties -> [Int]
- decPow2 :: DeviceProperties -> [Int]
- decWarp :: DeviceProperties -> [Int]
Documentation
Constructors
| Occupancy | |
| Fields 
 | |
Arguments
| :: DeviceProperties | Properties of the card in question | 
| -> Int | Threads per block | 
| -> Int | Registers per thread | 
| -> Int | Shared memory per block (bytes) | 
| -> Occupancy | 
Calculate occupancy data for a given GPU and kernel resource usage
Arguments
| :: DeviceProperties | Architecture to optimise for | 
| -> (Int -> Int) | Register count as a function of thread block size | 
| -> (Int -> Int) | Shared memory usage (bytes) as a function of thread block size | 
| -> (Int, Occupancy) | 
Optimise multiprocessor occupancy as a function of thread block size and resource usage. This returns the smallest satisfying block size in increments of a single warp.
optimalBlockSizeBy :: DeviceProperties -> (DeviceProperties -> [Int]) -> (Int -> Int) -> (Int -> Int) -> (Int, Occupancy) Source
As optimalBlockSize, but with a generator that produces the specific thread
 block sizes that should be tested. The generated list can produce values in
 any order, but the last satisfying block size will be returned. Hence, values
 should be monotonically decreasing to return the smallest block size yielding
 maximum occupancy, and vice-versa.
Arguments
| :: DeviceProperties | Properties of the card in question | 
| -> Int | Threads per block | 
| -> Int | Registers per thread | 
| -> Int | Shared memory per block (bytes) | 
| -> Int | Maximum number of resident blocks | 
Determine the maximum number of CTAs that can be run simultaneously for a given kernel / device combination.
incPow2 :: DeviceProperties -> [Int] Source
Increments in powers-of-two, over the range of supported thread block sizes for the given device.
incWarp :: DeviceProperties -> [Int] Source
Increments in the warp size of the device, over the range of supported thread block sizes.
decPow2 :: DeviceProperties -> [Int] Source
Decrements in powers-of-two, over the range of supported thread block sizes for the given device.
decWarp :: DeviceProperties -> [Int] Source
Decrements in the warp size of the device, over the range of supported thread block sizes.