cuda- FFI binding to the CUDA interface for programming NVIDIA GPUs

Copyright[2009..2018] Trevor L. McDonell
Safe HaskellNone



This module defines an interface to the CUDA driver API. The Driver API is a lower-level interface to CUDA devices than that provided by the Runtime API. Using the Driver API, the programmer must deal explicitly with operations such as initialisation, context management, and loading (kernel) modules. Although more difficult to use initially, the Driver API provides more control over how CUDA is used. Furthermore, since it does not require compiling and linking the program with nvcc, the Driver API provides better inter-language compatibility.

The following is a short tutorial on using the Driver API. The steps can be copied into a file, or run directly in ghci, in which case ghci should be launched with the option -fno-ghci-sandbox. This is because CUDA maintains CPU-local state, so operations should always be run from a bound thread.

Using the Driver API

Before any operation can be performed, the Driver API must be initialised:

>>> import Foreign.CUDA.Driver
>>> initialise []

Next, we must select a GPU that we will execute operations on. Each GPU is assigned a unique identifier (beginning at zero). We can get a handle to a compute device at a given ordinal using the device operation. Given a device handle, we can query the properties of that device using props. The number of available CUDA-capable devices is given via count. For example:

>>> count
>>> dev0 <- device 0
>>> props dev0
DeviceProperties {deviceName = "GeForce GT 650M", computeCapability = 3.0, ...}

This package also includes the executable 'nvidia-device-query', which when executed displays the key properties of all available devices. See Foreign.CUDA.Driver.Device for additional operations to query the capabilities or status of a device.

Once you have chosen a device to use, the next step is to create a CUDA context. A context is associated with a particular device, and all operations, such as memory allocation and kernel execution, take place with respect to that context. For example, to create a new execution context on CUDA device 0:

>>> ctx <- create dev0 []

The second argument is a set of ContextFlags which control how the context behaves in various situations, for example, whether or not the CPU should actively spin when waiting for results from the GPU (SchedSpin), or to yield control to other threads instead (SchedYield).

The newly created context is now the active context, and all subsequent operations take place within that context. More than one context can be created per device, but resources, such as memory allocated in the GPU, are unique to each context. The module Foreign.CUDA.Driver.Context contains operations for managing multiple contexts. Some devices allow data to be shared between contexts without copying, see Foreign.CUDA.Driver.Context.Peer for more information.

Once the context is no longer needed, it should be destroyed in order to free up any resources that were allocated to it.

>>> destroy ctx

Each device also has a unique context which is used by the Runtime API. This context can be accessed with the module Foreign.CUDA.Driver.Context.Primary.

Executing kernels onto the GPU

Once the Driver API is initialised and an execution context is created on the GPU, we can begin to interact with it.

At an example, we'll step through executing the CUDA equivalent of the following Haskell function, which element-wise adds the elements of two arrays:

>>> vecAdd xs ys = zipWith (+) xs ys

The following CUDA kernel can be used to implement this on the GPU:

extern "C" __global__ void vecAdd(float *xs, float *ys, float *zs, int N)
    int ix = blockIdx.x * blockDim.x + threadIdx.x;

    if ( ix < N ) {
        zs[ix] = xs[ix] + ys[ix];

Here, the global keyword marks the function as a kernel that should be computed on the GPU in data parallel. When we execute this function on the GPU, (at least) N threads will execute N individual instances of the kernel function vecAdd. Each thread will operate on a single element of each input array to create a single value in the result. See the CUDA programming guide for more details.

We can save this to a file, and compile it using nvcc into a form that we can then load onto the GPU and execute:

$ nvcc --ptx

The module Foreign.CUDA.Driver.Module contains functions for loading the resulting .ptx file (or .cubin files) into the running program.

>>> mdl <- loadFile "vector_add.ptx"

Once finished with the module, it is also a good idea to unload it.

Modules may export kernel functions, global variables, and texture references. Before we can execute our function, we need to look it up in the module by name.

>>> vecAdd <- getFun mdl "vecAdd"

Given this reference to our kernel function, we are almost ready to execute it on the device using launchKernel, but first, we must create some data that we can execute the function on.

Transferring data to and from the GPU

GPUs typically have their own memory which is separate from the CPU's memory, and we need to explicitly copy data back and forth between these two regions. The module Foreign.CUDA.Driver.Marshal provides functions for allocating memory on the GPU, and copying data between the CPU and GPU, as well as directly between multiple GPUs.

For simplicity, we'll use standard Haskell lists for our input and output data structure. Note however that this will have significantly lower effective bandwidth than reading a single contiguous region of memory, so for most practical purposes you will want to use some kind of unboxed array.

>>> let xs = [1..1024]   :: [Float]
>>> let ys = [2,4..2048] :: [Float]

In CUDA, like C, all memory management is explicit, and arrays on the device must be explicitly allocated and freed. As mentioned previously, data transfer is also explicit. However, we do provide convenience functions for combined allocation and marshalling, as well as bracketed operations.

>>> xs_dev <- newListArray xs
>>> ys_dev <- newListArray ys
>>> zs_dev <- mallocArray 1024 :: IO (DevicePtr Float)

After executing the kernel (see next section), we transfer the result back to the host, and free the memory that was allocated on the GPU.

>>> zs <- peekListArray 1024 zs_dev
>>> free xs_dev
>>> free ys_dev
>>> free zs_dev
Piecing it all together

Finally, we have everything in place to execute our operation on the GPU. Launching a kernel on the GPU consists of creating many threads on the GPU which all execute the same function, and each thread has a unique identifier in the grid/block hierarchy which can be used to identify exactly which element this thread should process (the blockIdx and threadIdx parameters that we saw earlier, respectively).

To execute our function, we will use a grid of 4 blocks, each containing 256 threads. Thus, a total of 1024 threads will be launched, which will each compute a single element of the output array (recall that our input arrays each have 1024 elements). The module Foreign.CUDA.Analysis.Occupancy contains functions to help determine the ideal thread block size for a given kernel and GPU combination.

>>> launchKernel vecAdd (4,1,1) (256,1,1) 0 Nothing [VArg xs_dev, VArg ys_dev, VArg zs_dev, IArg 1024]

Note that kernel execution is asynchronous, so we should also wait for the operation to complete before attempting to read the results back.

>>> sync

And that's it!

Next steps

As mentioned at the end of the previous section, kernels on the GPU are executed asynchronously with respect to the host, and other operations such as data transfers can also be executed asynchronously. This allows the CPU to continue doing other work while the GPU is busy. Events can be used to check whether an operation has completed yet.

It is also possible to execute multiple kernels or data transfers concurrently with each other, by assigning those operations to different execution Streams. Used in conjunction with Events, operations will be scheduled efficiently only once all dependencies (in the form of Events) have been cleared.

See Foreign.CUDA.Driver.Event and Foreign.CUDA.Driver.Stream for more information on this topic.



newtype Context Source #

A device context




Eq Context Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Base


(==) :: Context -> Context -> Bool #

(/=) :: Context -> Context -> Bool #

Show Context Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Base

create :: Device -> [ContextFlag] -> IO Context Source #

Create a new CUDA context and associate it with the calling thread. The context is created with a usage count of one, and the caller of create must call destroy when done using the context. If a context is already current to the thread, it is supplanted by the newly created context and must be restored by a subsequent call to pop.

attach :: Context -> [ContextFlag] -> IO () Source #

Deprecated: as of CUDA-4.0

Increments the usage count of the context. API: no context flags are currently supported, so this parameter must be empty.

detach :: Context -> IO () Source #

Deprecated: as of CUDA-4.0

Detach the context, and destroy if no longer used

destroy :: Context -> IO () Source #

Destroy the specified context, regardless of how many threads it is current to. The context will be poped from the current thread's context stack, but if it is current on any other threads it will remain current to those threads, and attempts to access it will result in an error.

pop :: IO Context Source #

Pop the current CUDA context from the CPU thread. The context may then be attached to a different CPU thread by calling push.

push :: Context -> IO () Source #

Push the given context onto the CPU's thread stack of current contexts. The specified context becomes the CPU thread's current context, so all operations that operate on the current context are affected.

sync :: IO () Source #

Block until the device has completed all preceding requests. If the context was created with the SchedBlockingSync flag, the CPU thread will block until the GPU has finished its work.

type StreamPriority = Int Source #

Priority of an execution stream. Work submitted to a higher priority stream may preempt execution of work already executing in a lower priority stream. Lower numbers represent higher priorities.

data PeerFlag Source #

Possible option values for direct peer memory access

accessible :: Device -> Device -> IO Bool Source #

Queries if the first device can directly access the memory of the second. If direct access is possible, it can then be enabled with add.

Requires CUDA-4.0.

add :: Context -> [PeerFlag] -> IO () Source #

If the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context.

Note that access is unidirectional, and in order to access memory in the current context from the peer context, a separate symmetric call to add is required.

Requires CUDA-4.0.

remove :: Context -> IO () Source #

Disable direct memory access from the current context to the supplied peer context, and unregisters any registered allocations.

Requires CUDA-4.0.

data Cache Source #

Device cache configuration preference

Enum Cache Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Config

Eq Cache Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Config


(==) :: Cache -> Cache -> Bool #

(/=) :: Cache -> Cache -> Bool #

Show Cache Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Config


showsPrec :: Int -> Cache -> ShowS #

show :: Cache -> String #

showList :: [Cache] -> ShowS #

setLimit :: Limit -> Int -> IO () Source #

Specify the size of the call stack, for compute 2.0 devices.

Requires CUDA-3.1.

getCache :: IO Cache Source #

On devices where the L1 cache and shared memory use the same hardware resources, this function returns the preferred cache configuration for the current context.

Requires CUDA-3.2.

setCache :: Cache -> IO () Source #

On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the current context. This is only a preference.

Any function configuration set via setCacheConfigFun will be preferred over this context-wide setting.

Requires CUDA-3.2.

getSharedMem :: IO SharedMem Source #

Return the current size of the shared memory banks in the current context. On devices with configurable shared memory banks, setSharedMem can be used to change the configuration, so that subsequent kernel launches will by default us the new bank size. On devices without configurable shared memory, this function returns the fixed bank size of the hardware.

Requires CUDA-4.2

setSharedMem :: SharedMem -> IO () Source #

On devices with configurable shared memory banks, this function will set the context's shared memory bank size that will be used by default for subsequent kernel launches.

Changing the shared memory configuration between launches may insert a device synchronisation.

Shared memory bank size does not affect shared memory usage or kernel occupancy, but may have major effects on performance. Larger bank sizes allow for greater potential bandwidth to shared memory, but change the kinds of accesses which result in bank conflicts.

Requires CUDA-4.2

getStreamPriorityRange :: IO (StreamPriority, StreamPriority) Source #

Returns the numerical values that correspond to the greatest and least priority execution streams in the current context respectively. Stream priorities follow the convention that lower numerical numbers correspond to higher priorities. The range of meaningful stream priorities is given by the inclusive range [greatestPriority,leastPriority].

Requires CUDA-5.5.