Copyright	[2014..2017] Trevor L. McDonell [2014..2014] Vinod Grover (NVIDIA Corporation)
License	BSD3
Maintainer	Trevor L. McDonell <tmcdonell@cse.unsw.edu.au>
Stability	experimental
Portability	non-portable (GHC extensions)
Safe Haskell	None
Language	Haskell2010

Data.Array.Accelerate.LLVM.PTX

Contents

Synchronous execution
Asynchronous execution
Execution targets
Controlling host-side allocation

Description

This module implements a backend for the Accelerate language targeting NVPTX for execution on NVIDIA GPUs. Expressions are on-line translated into LLVM code, which is just-in-time executed in parallel on the GPU.

Synopsis

Documentation

data Acc a :: * -> * #

Accelerate is an embedded language that distinguishes between vanilla arrays (e.g. in Haskell memory on the CPU) and embedded arrays (e.g. in device memory on a GPU), as well as the computations on both of these. Since Accelerate is an embedded language, programs written in Accelerate are not compiled by the Haskell compiler (GHC). Rather, each Accelerate backend is a runtime compiler which generates and executes parallel SIMD code of the target language at application runtime.

The type constructor Acc represents embedded collective array operations. A term of type Acc a is an Accelerate program which, once executed, will produce a value of type a (an Array or a tuple of Arrays). Collective operations of type Acc a comprise many scalar expressions, wrapped in type constructor Exp, which will be executed in parallel. Although collective operations comprise many scalar operations executed in parallel, scalar operations cannot initiate new collective operations: this stratification between scalar operations in Exp and array operations in Acc helps statically exclude nested data parallelism, which is difficult to execute efficiently on constrained hardware such as GPUs.

For example, to compute a vector dot product we could write:

dotp :: Num a => Vector a -> Vector a -> Acc (Scalar a)
dotp xs ys =
  let
      xs' = use xs
      ys' = use ys
  in
  fold (+) 0 ( zipWith (*) xs' ys' )

The function dotp consumes two one-dimensional arrays (Vectors) of values, and produces a single (Scalar) result as output. As the return type is wrapped in the type Acc, we see that it is an embedded Accelerate computation - it will be evaluated in the object language of dynamically generated parallel code, rather than the meta language of vanilla Haskell.

As the arguments to dotp are plain Haskell arrays, to make these available to Accelerate computations they must be embedded with the use function.

An Accelerate backend is used to evaluate the embedded computation and return the result back to vanilla Haskell. Calling the run function of a backend will generate code for the target architecture, compile, and execute it. For example, the following backends are available:

accelerate-llvm-native: for execution on multicore CPUs
accelerate-llvm-ptx: for execution on NVIDIA CUDA-capable GPUs

See also Exp, which encapsulates embedded scalar computations.

Fusion:

Array computations of type Acc will be subject to array fusion; Accelerate will combine individual Acc computations into a single computation, which reduces the number of traversals over the input data and thus improves performance. As such, it is often useful to have some intuition on when fusion should occur.

The main idea is to first partition array operations into two categories:

Element-wise operations, such as map, generate, and backpermute. Each element of these operations can be computed independently of all others.
Collective operations such as fold, scanl, and stencil. To compute each output element of these operations requires reading multiple elements from the input array(s).

Element-wise operations fuse together whenever the consumer operation uses a single element of the input array. Element-wise operations can both fuse their inputs into themselves, as well be fused into later operations. Both these examples should fuse into a single loop:

map -> reverse -> reshape -> map -> map

map -> backpermute ->
                      zipWith -> map
          generate ->

If the consumer operation uses more than one element of the input array (typically, via generate indexing an array multiple times), then the input array will be completely evaluated first; no fusion occurs in this case, because fusing the first operation into the second implies duplicating work.

On the other hand, collective operations can fuse their input arrays into themselves, but on output always evaluate to an array; collective operations will not be fused into a later step. For example:

     use ->
            zipWith -> fold |-> map
generate ->

Here the element-wise sequence (use + generate + zipWith) will fuse into a single operation, which then fuses into the collective fold operation. At this point in the program the fold must now be evaluated. In the final step the map reads in the array produced by fold. As there is no fusion between the fold and map steps, this program consists of two "loops"; one for the use + generate + zipWith + fold step, and one for the final map step.

You can see how many operations will be executed in the fused program by Show-ing the Acc program, or by using the debugging option -ddump-dot to save the program as a graphviz DOT file.

As a special note, the operations unzip and reshape, when applied to a real array, are executed in constant time, so in this situation these operations will not be fused.

Tips:

Since Acc represents embedded computations that will only be executed when evaluated by a backend, we can programatically generate these computations using the meta language Haskell; for example, unrolling loops or embedding input values into the generated code.
It is usually best to keep all intermediate computations in Acc, and only run the computation at the very end to produce the final result. This enables optimisations between intermediate results (e.g. array fusion) and, if the target architecture has a separate memory space as is the case of GPUs, to prevent excessive data transfers.

Instances

Arrays b => Afunction (Acc b)
Associated Types type AfunctionR (Acc b) :: * Methods aconvert :: Config -> Layout aenv aenv -> Acc b -> OpenAfun aenv (AfunctionR (Acc b))
(Arrays a, Afunction r) => Afunction (Acc a -> r)
Associated Types type AfunctionR (Acc a -> r) :: * Methods aconvert :: Config -> Layout aenv aenv -> (Acc a -> r) -> OpenAfun aenv (AfunctionR (Acc a -> r))
type AfunctionR (Acc b)
type AfunctionR (Acc b) = b
type AfunctionR (Acc a -> r)
type AfunctionR (Acc a -> r) = a -> AfunctionR r

class (Typeable * a, Typeable * (ArrRepr a)) => Arrays a #

Arrays consists of nested tuples of individual Arrays, currently up to 15-elements wide. Accelerate computations can thereby return multiple results.

Minimal complete definition

arrays, flavour, toArr, fromArr

Instances

Arrays ()
Methods arrays :: () -> ArraysR (ArrRepr ()) flavour :: () -> ArraysFlavour () toArr :: ArrRepr () -> () fromArr :: () -> ArrRepr ()
(Arrays a, Arrays b) => Arrays (a, b)
Methods arrays :: (a, b) -> ArraysR (ArrRepr (a, b)) flavour :: (a, b) -> ArraysFlavour (a, b) toArr :: ArrRepr (a, b) -> (a, b) fromArr :: (a, b) -> ArrRepr (a, b)
(Shape sh, Elt e) => Arrays (Array sh e)
Methods arrays :: Array sh e -> ArraysR (ArrRepr (Array sh e)) flavour :: Array sh e -> ArraysFlavour (Array sh e) toArr :: ArrRepr (Array sh e) -> Array sh e fromArr :: Array sh e -> ArrRepr (Array sh e)
(Arrays a, Arrays b, Arrays c) => Arrays (a, b, c)
Methods arrays :: (a, b, c) -> ArraysR (ArrRepr (a, b, c)) flavour :: (a, b, c) -> ArraysFlavour (a, b, c) toArr :: ArrRepr (a, b, c) -> (a, b, c) fromArr :: (a, b, c) -> ArrRepr (a, b, c)
(Arrays a, Arrays b, Arrays c, Arrays d) => Arrays (a, b, c, d)
Methods arrays :: (a, b, c, d) -> ArraysR (ArrRepr (a, b, c, d)) flavour :: (a, b, c, d) -> ArraysFlavour (a, b, c, d) toArr :: ArrRepr (a, b, c, d) -> (a, b, c, d) fromArr :: (a, b, c, d) -> ArrRepr (a, b, c, d)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e) => Arrays (a, b, c, d, e)
Methods arrays :: (a, b, c, d, e) -> ArraysR (ArrRepr (a, b, c, d, e)) flavour :: (a, b, c, d, e) -> ArraysFlavour (a, b, c, d, e) toArr :: ArrRepr (a, b, c, d, e) -> (a, b, c, d, e) fromArr :: (a, b, c, d, e) -> ArrRepr (a, b, c, d, e)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f) => Arrays (a, b, c, d, e, f)
Methods arrays :: (a, b, c, d, e, f) -> ArraysR (ArrRepr (a, b, c, d, e, f)) flavour :: (a, b, c, d, e, f) -> ArraysFlavour (a, b, c, d, e, f) toArr :: ArrRepr (a, b, c, d, e, f) -> (a, b, c, d, e, f) fromArr :: (a, b, c, d, e, f) -> ArrRepr (a, b, c, d, e, f)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g) => Arrays (a, b, c, d, e, f, g)
Methods arrays :: (a, b, c, d, e, f, g) -> ArraysR (ArrRepr (a, b, c, d, e, f, g)) flavour :: (a, b, c, d, e, f, g) -> ArraysFlavour (a, b, c, d, e, f, g) toArr :: ArrRepr (a, b, c, d, e, f, g) -> (a, b, c, d, e, f, g) fromArr :: (a, b, c, d, e, f, g) -> ArrRepr (a, b, c, d, e, f, g)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h) => Arrays (a, b, c, d, e, f, g, h)
Methods arrays :: (a, b, c, d, e, f, g, h) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h)) flavour :: (a, b, c, d, e, f, g, h) -> ArraysFlavour (a, b, c, d, e, f, g, h) toArr :: ArrRepr (a, b, c, d, e, f, g, h) -> (a, b, c, d, e, f, g, h) fromArr :: (a, b, c, d, e, f, g, h) -> ArrRepr (a, b, c, d, e, f, g, h)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i) => Arrays (a, b, c, d, e, f, g, h, i)
Methods arrays :: (a, b, c, d, e, f, g, h, i) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i)) flavour :: (a, b, c, d, e, f, g, h, i) -> ArraysFlavour (a, b, c, d, e, f, g, h, i) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i) -> (a, b, c, d, e, f, g, h, i) fromArr :: (a, b, c, d, e, f, g, h, i) -> ArrRepr (a, b, c, d, e, f, g, h, i)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j) => Arrays (a, b, c, d, e, f, g, h, i, j)
Methods arrays :: (a, b, c, d, e, f, g, h, i, j) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j)) flavour :: (a, b, c, d, e, f, g, h, i, j) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j) -> (a, b, c, d, e, f, g, h, i, j) fromArr :: (a, b, c, d, e, f, g, h, i, j) -> ArrRepr (a, b, c, d, e, f, g, h, i, j)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k) => Arrays (a, b, c, d, e, f, g, h, i, j, k)
Methods arrays :: (a, b, c, d, e, f, g, h, i, j, k) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k)) flavour :: (a, b, c, d, e, f, g, h, i, j, k) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k) -> (a, b, c, d, e, f, g, h, i, j, k) fromArr :: (a, b, c, d, e, f, g, h, i, j, k) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l)
Methods arrays :: (a, b, c, d, e, f, g, h, i, j, k, l) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l) -> (a, b, c, d, e, f, g, h, i, j, k, l) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m)
Methods arrays :: (a, b, c, d, e, f, g, h, i, j, k, l, m) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l, m) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l, m) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m) -> (a, b, c, d, e, f, g, h, i, j, k, l, m) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l, m) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m, Arrays n) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m, n)
Methods arrays :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l, m, n) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n) -> (a, b, c, d, e, f, g, h, i, j, k, l, m, n) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n)
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m, Arrays n, Arrays o) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o)
Methods arrays :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) -> (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o)

Synchronous execution

run :: Arrays a => Acc a -> a Source #

Compile and run a complete embedded array program.

Note that it is recommended that you use run1 whenever possible.

runWith :: Arrays a => PTX -> Acc a -> a Source #

As run, but execute using the specified target rather than using the default, automatically selected device.

Contexts passed to this function may all target to the same device, or to separate devices of differing compute capabilities.

run1 :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> b Source #

Prepare and execute an embedded array program of one argument.

This function can be used to improve performance in cases where the array program is constant between invocations, because it enables us to bypass front-end conversion stages and move directly to the execution phase. If you have a computation applied repeatedly to different input data, use this, specifying any changing aspects of the computation via the input parameter. If the function is only evaluated once, this is equivalent to run.

To use run1 effectively you must express your program as a function of one argument. If your program takes more than one argument, you can use lift and unlift to tuple up the arguments.

At an example, once your program is expressed as a function of one argument, instead of the usual:

step :: Acc (Vector a) -> Acc (Vector b)
step = ...

simulate :: Vector a -> Vector b
simulate xs = run $ step (use xs)

Instead write:

simulate xs = run1 step xs

You can use the debugging options to check whether this is working successfully by, for example, observing no output from the -ddump-cc flag at the second and subsequent invocations.

See the programs in the 'accelerate-examples' package for examples.

run1With :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> a -> b Source #

As run1, but execute using the specified target rather than using the default, automatically selected device.

stream :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> [a] -> [b] Source #

Stream a lazily read list of input arrays through the given program, collecting results as we go.

streamWith :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> [a] -> [b] Source #

As stream, but execute using the specified target.

Asynchronous execution

data Async a :: * -> * #

wait :: Async a -> IO a #

Block the calling thread until the computation completes, then return the result.

poll :: Async a -> IO (Maybe a) #

Test whether the asynchronous computation has already completed. If so, return the result, else Nothing.

cancel :: Async a -> IO () #

Cancel a running asynchronous computation.

runAsync :: Arrays a => Acc a -> IO (Async a) Source #

As run, but run the computation asynchronously and return immediately without waiting for the result. The status of the computation can be queried using wait, poll, and cancel.

Note that a CUDA context can be active on only one host thread at a time. If you want to execute multiple computations in parallel, on the same or different devices, use runAsyncWith.

runAsyncWith :: Arrays a => PTX -> Acc a -> IO (Async a) Source #

As runWith, but execute asynchronously. Be sure not to destroy the context, or attempt to attach it to a different host thread, before all outstanding operations have completed.

run1Async :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> IO (Async b) Source #

As run1, but the computation is executed asynchronously.

run1AsyncWith :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> a -> IO (Async b) Source #

As run1With, but execute asynchronously.

Execution targets

data PTX Source #

The PTX execution target for NVIDIA GPUs.

The execution target carries state specific for the current execution context. The data here --- device memory and execution streams --- are implicitly tied to this CUDA execution context.

Don't store anything here that is independent of the context, for example state related to [persistent] kernel caching should _not_ go here.

Instances

Target PTX Source #
Methods targetTriple :: PTX -> Maybe String targetDataLayout :: PTX -> Maybe DataLayout
data ExecutableR PTX
data ExecutableR PTX = PTXR { ptxKernel :: ![Kernel] ptxModule :: !(Lifetime Module) }
data KernelMetadata PTX
data KernelMetadata PTX = KM_PTX LaunchConfig
type ArgR PTX
type ArgR PTX = FunParam
type EventR PTX
type EventR PTX = Event
type StreamR PTX
type StreamR PTX = Stream
type RemotePtr (LLVM PTX)
type RemotePtr (LLVM PTX) = DevicePtr

createTargetForDevice :: Device -> DeviceProperties -> [ContextFlag] -> IO PTX Source #

Create a new PTX execution target for the given device

createTargetFromContext :: Context -> IO PTX Source #

Create a PTX execute target for the given device context

Controlling host-side allocation

registerPinnedAllocator :: IO () Source #

Configure the default execution target to allocate all future host-side arrays using (CUDA) pinned memory. Any newly allocated arrays will be page-locked and directly accessible from the device, enabling high-speed (asynchronous) DMA.

Note that since the amount of available pageable memory will be reduced, overall system performance can suffer.

registerPinnedAllocatorWith :: PTX -> IO () Source #

As with registerPinnedAllocator, but configure the given execution context.