Copyright	[2008..2014] Manuel M T Chakravarty, Gabriele Keller [2008..2009] Sean Lee [2009..2014] Trevor L. McDonell
License	BSD3
Maintainer	Trevor L. McDonell <tmcdonell@cse.unsw.edu.au>
Stability	experimental
Portability	non-portable (GHC extensions)
Safe Haskell	None
Language	Haskell98

Data.Array.Accelerate.CUDA

Contents

Synchronous execution
Asynchronous execution
Execution contexts

Description

This module implements the CUDA backend for the embedded array language Accelerate. Expressions are on-line translated into CUDA code, compiled, and executed in parallel on the GPU.

The accelerate-cuda library is hosted at: https://github.com/AccelerateHS/accelerate-cuda. Comments, bug reports, and patches, are always welcome.

Data transfer:

GPUs typically have their own attached memory, which is separate from the computer's main memory. Hence, every use operation implies copying data to the device, and every run operation must copy the results of a computation back to the host.

Thus, it is best to keep all computations in the Acc meta-language form and only run the computation once at the end, to avoid transferring (unused) intermediate results.

Note that once an array has been transferred to the GPU, it will remain there for as long as that array remains alive on the host. Any subsequent calls to use will find the array cached on the device and not re-transfer the data.

Caching and performance:

When the program runs, the Accelerate library evaluates the expression passed to run to make a series of CUDA kernels. Each kernel takes some arrays as inputs and produces arrays as output. Each kernel is a piece of CUDA code that has to be compiled and loaded onto the GPU; this can take a while, so we remember which kernels we have seen before and try to re-use them.

The goal is to make kernels that can be re-used. If we don't, the overhead of compiling new kernels can ruin performance.

For example, consider the following implementation of the function drop for vectors:

drop :: Elt e => Exp Int -> Acc (Vector e) -> Acc (Vector e)
drop n arr =
  let n' = the (unit n)
  in  backpermute (ilift1 (subtract n') (shape arr)) (ilift1 (+ n')) arr

Why did we go to the trouble of converting the n value into a scalar array using unit, and then immediately extracting that value using the?

We can look at the expression Accelerate sees by evaluating the argument to run. Here is what a typical call to drop evaluates to:

>>> drop (constant 4) (use (fromList (Z:.10) [1..]))
let a0 = use (Array (Z :. 10) [1,2,3,4,5,6,7,8,9,10]) in
let a1 = unit 4
in backpermute
     (let x0 = Z in x0 :. (indexHead (shape a0)) - (a1!x0))
     (\x0 -> let x1 = Z in x1 :. (indexHead x0) + (a1!x1))
     a0

The important thing to note is the line let a1 = unit 4. This corresponds to the scalar array we created for the n argument to drop and it is outside the call to backpermute. The backpermute function is what turns into a CUDA kernel, and to ensure that we get the same kernel each time we need the arguments to it to remain constant.

Let us see what happens if we change drop to instead use its argument n directly:

>>> drop (constant 4) (use (fromList (Z:.10) [1..]))
let a0 = use (Array (Z :. 10) [1,2,3,4,5,6,7,8,9,10])
in backpermute (Z :. -4 + (indexHead (shape a0))) (\x0 -> Z :. 4 + (indexHead x0)) a0

Instead of n being outside the call to backpermute, it is now embedded in it. This will defeat Accelerate's caching of CUDA kernels. Whenever the value of n changes, a new kernel will need to be compiled.

The rule of thumb is to make sure that any arguments that change are always passed in as arrays, not embedded in the code as constants.

How can you tell if you got it wrong? One way is to look at the code directly, as in this example. Another is to use the debugging options provided by the library. See debugging options below.

Hardware support:

CUDA devices are categorised into different 'compute capabilities', indicating what operations are supported by the hardware. For example, double precision arithmetic is only supported on devices of compute capability 1.3 or higher.

Devices generally perform best when dealing with (tuples of) 32-bit types, so be cautious when introducing 8-, 16-, or 64-bit elements. Keep in mind the size of Int and Word changes depending on the architecture GHC runs on.

In particular:

Double precision requires compute-1.3.
Bool is represented internally using Word8, Char by Word32.
If the permutation function to permute resolves to non-unique indices, the combination function requires compute-1.1.

Debugging options:

When the library is installed with the -fdebug flag, a few extra debugging options are available, input via the command line arguments. The most useful ones are:

-dverbose: Print some information on the type and capabilities of the GPU being used.
-ddump-cc: Print information about the CUDA kernels as they are compiled and run. Using this option will indicate whether your program is generating the number of kernels that you were expecting. Note that compiled kernels are cached in your home directory, and the generated code will only be displayed if it was not located in this persistent cache. To clear the cache and always print the generated code, use -fflush-cache as well.
-ddump-exec: Print each kernel as it is being executed, with timing information.

See the accelerate-cuda.cabal file for the full list of options.

Automatic Graphics Switching on Mac OS X:

Some Apple computers contain two graphics processors: a low-power integrated graphics chipset, as well as a higher-performance NVIDIA GPU. The latter is of course the one we want to use. Usually Mac OS X detects whenever a program attempts to run a CUDA function and switches to the NVIDIA GPU automatically.

However, sometimes this does not work correctly and the problem can manifest in several ways:

The program may report an error such as "No CUDA-capable device is available" or "invalid context handle".
For programs that also use OpenGL, the graphics switching might occur and the Accelerate computation complete as expected, but no OpenGL updates appear on screen.

There are several solutions:

Use a tool such as gfxCardStatus to manually select either the integrated or discrete GPU: http://gfx.io
Disable automatic graphics switching in the Energy Saver pane of System Preferences. Since this disables use of the low-power integrated GPU, this can decrease battery life.
When executing the program, disable the RTS clock by appending +RTS -V0 to the command line arguments. This disables the RTS clock and all timers that depend on it: the context switch timer and the heap profiling timer. Context switches still happen, but deterministically and at a rate much faster than normal. Automatic graphics switching will work correctly, but this method has the disadvantage of reducing performance of the program.

Synopsis

Documentation

class (Typeable * (ArrRepr a), Typeable * (ArrRepr' a), Typeable * a) => Arrays a

Minimal complete definition

arrays, arrays', toArr, toArr', fromArr, fromArr'

Instances

Arrays ()
(Arrays b, Arrays a) => Arrays (b, a)
(Shape sh, Elt e) => Arrays (Array sh e)
(Arrays c, Arrays b, Arrays a) => Arrays (c, b, a)
(Arrays d, Arrays c, Arrays b, Arrays a) => Arrays (d, c, b, a)
(Arrays e, Arrays d, Arrays c, Arrays b, Arrays a) => Arrays (e, d, c, b, a)
(Arrays f, Arrays e, Arrays d, Arrays c, Arrays b, Arrays a) => Arrays (f, e, d, c, b, a)
(Arrays g, Arrays f, Arrays e, Arrays d, Arrays c, Arrays b, Arrays a) => Arrays (g, f, e, d, c, b, a)
(Arrays h, Arrays g, Arrays f, Arrays e, Arrays d, Arrays c, Arrays b, Arrays a) => Arrays (h, g, f, e, d, c, b, a)
(Arrays i, Arrays h, Arrays g, Arrays f, Arrays e, Arrays d, Arrays c, Arrays b, Arrays a) => Arrays (i, h, g, f, e, d, c, b, a)

Synchronous execution

run :: Arrays a => Acc a -> a Source

Compile and run a complete embedded array program using the CUDA backend. This will select the fastest device available on which to execute computations, based on compute capability and estimated maximum GFLOPS.

Note that it is recommended you use run1 whenever possible.

run1 :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> b Source

Prepare and execute an embedded array program of one argument.

This function can be used to improve performance in cases where the array program is constant between invocations, because it allows us to bypass all front-end conversion stages and move directly to the execution phase. If you have a computation applied repeatedly to different input data, use this. If the function is only evaluated once, this is equivalent to run.

To use run1 you must express your program as a function of one argument. If your program takes more than one argument, you can use lift and unlift to tuple up the arguments.

At an example, once your program is expressed as a function of one argument, instead of the usual:

step :: Acc (Vector a) -> Acc (Vector b)
step = ...

simulate :: Vector a -> Vector b
simulate xs = run $ step (use xs)

Instead write:

simulate xs = run1 step xs

You can use the debugging options to check whether this is working successfully by, for example, observing no output from the -ddump-cc flag at the second and subsequent invocations.

See the programs in the 'accelerate-examples' package for examples.

stream :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> [a] -> [b] Source

Stream a lazily read list of input arrays through the given program, collecting results as we go.

runIn :: Arrays a => Context -> Acc a -> a Source

As run, but execute using the specified device context rather than using the default, automatically selected device.

Contexts passed to this function may all refer to the same device, or to separate devices of differing compute capabilities.

Note that each thread has a stack of current contexts, and calling create pushes the new context on top of the stack and makes it current with the calling thread. You should call pop to make the context floating before passing it to runIn, which will make it current for the duration of evaluating the expression. See the CUDA C Programming Guide (G.1) for more information.

run1In :: (Arrays a, Arrays b) => Context -> (Acc a -> Acc b) -> a -> b Source

As run1, but execute in the specified context.

streamIn :: (Arrays a, Arrays b) => Context -> (Acc a -> Acc b) -> [a] -> [b] Source

As stream, but execute in the specified context.

Asynchronous execution

data Async a Source

wait :: Async a -> IO a Source

Block the calling thread until the computation completes, then return the result.

poll :: Async a -> IO (Maybe a) Source

Test whether the asynchronous computation has already completed. If so, return the result, else Nothing.

cancel :: Async a -> IO () Source

Cancel a running asynchronous computation.

runAsync :: Arrays a => Acc a -> Async a Source

As run, but allow the computation to continue running in a thread and return immediately without waiting for the result. The status of the computation can be queried using wait, poll, and cancel.

Note that a CUDA Context can be active on only one host thread at a time. If you want to execute multiple computations in parallel, use runAsyncIn.

run1Async :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> Async b Source

As run1, but the computation is executed asynchronously.

runAsyncIn :: Arrays a => Context -> Acc a -> Async a Source

As runIn, but execute asynchronously. Be sure not to destroy the context, or attempt to attach it to a different host thread, before all outstanding operations have completed.

run1AsyncIn :: (Arrays a, Arrays b) => Context -> (Acc a -> Acc b) -> a -> Async b Source

As run1In, but execute asynchronously.

Execution contexts

data Context Source

The execution context

Instances

Eq Context
MonadReader Context CIO

create :: Device -> [ContextFlag] -> IO Context Source

Create a new CUDA context associated with the calling thread

destroy :: Context -> IO () Source

Destroy the specified context. This will fail if the context is more than single attachment.