accelerate-llvm-ptx-1.3.0.0: Accelerate backend for NVIDIA GPUs
Copyright[2014..2020] The Accelerate Team
LicenseBSD3
MaintainerTrevor L. McDonell <trevor.mcdonell@gmail.com>
Stabilityexperimental
Portabilitynon-portable (GHC extensions)
Safe HaskellNone
LanguageHaskell2010

Data.Array.Accelerate.LLVM.PTX

Description

This module implements a backend for the Accelerate language targeting NVPTX for execution on NVIDIA GPUs. Expressions are on-line translated into LLVM code, which is just-in-time executed in parallel on the GPU.

Synopsis

Documentation

data Acc a #

Accelerate is an embedded language that distinguishes between vanilla arrays (e.g. in Haskell memory on the CPU) and embedded arrays (e.g. in device memory on a GPU), as well as the computations on both of these. Since Accelerate is an embedded language, programs written in Accelerate are not compiled by the Haskell compiler (GHC). Rather, each Accelerate backend is a runtime compiler which generates and executes parallel SIMD code of the target language at application runtime.

The type constructor Acc represents embedded collective array operations. A term of type Acc a is an Accelerate program which, once executed, will produce a value of type a (an Array or a tuple of Arrays). Collective operations of type Acc a comprise many scalar expressions, wrapped in type constructor Exp, which will be executed in parallel. Although collective operations comprise many scalar operations executed in parallel, scalar operations cannot initiate new collective operations: this stratification between scalar operations in Exp and array operations in Acc helps statically exclude nested data parallelism, which is difficult to execute efficiently on constrained hardware such as GPUs.

A simple example

As a simple example, to compute a vector dot product we can write:

dotp :: Num a => Vector a -> Vector a -> Acc (Scalar a)
dotp xs ys =
  let
      xs' = use xs
      ys' = use ys
  in
  fold (+) 0 ( zipWith (*) xs' ys' )

The function dotp consumes two one-dimensional arrays (Vectors) of values, and produces a single (Scalar) result as output. As the return type is wrapped in the type Acc, we see that it is an embedded Accelerate computation - it will be evaluated in the object language of dynamically generated parallel code, rather than the meta language of vanilla Haskell.

As the arguments to dotp are plain Haskell arrays, to make these available to Accelerate computations they must be embedded with the use function.

An Accelerate backend is used to evaluate the embedded computation and return the result back to vanilla Haskell. Calling the run function of a backend will generate code for the target architecture, compile, and execute it. For example, the following backends are available:

See also Exp, which encapsulates embedded scalar computations.

Avoiding nested parallelism

As mentioned above, embedded scalar computations of type Exp can not initiate further collective operations.

Suppose we wanted to extend our above dotp function to matrix-vector multiplication. First, let's rewrite our dotp function to take Acc arrays as input (which is typically what we want):

dotp :: Num a => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )

We might then be inclined to lift our dot-product program to the following (incorrect) matrix-vector product, by applying dotp to each row of the input matrix:

mvm_ndp :: Num a => Acc (Matrix a) -> Acc (Vector a) -> Acc (Vector a)
mvm_ndp mat vec =
  let Z :. rows :. cols  = unlift (shape mat)  :: Z :. Exp Int :. Exp Int
  in  generate (index1 rows)
               (\row -> the $ dotp vec (slice mat (lift (row :. All))))

Here, we use generate to create a one-dimensional vector by applying at each index a function to slice out the corresponding row of the matrix to pass to the dotp function. However, since both generate and slice are data-parallel operations, and moreover that slice depends on the argument row given to it by the generate function, this definition requires nested data-parallelism, and is thus not permitted. The clue that this definition is invalid is that in order to create a program which will be accepted by the type checker, we must use the function the to retrieve the result of the dotp operation, effectively concealing that dotp is a collective array computation in order to match the type expected by generate, which is that of scalar expressions. Additionally, since we have fooled the type-checker, this problem will only be discovered at program runtime.

In order to avoid this problem, we can make use of the fact that operations in Accelerate are rank polymorphic. The fold operation reduces along the innermost dimension of an array of arbitrary rank, reducing the rank (dimensionality) of the array by one. Thus, we can replicate the input vector to as many rows there are in the input matrix, and perform the dot-product of the vector with every row simultaneously:

mvm :: A.Num a => Acc (Matrix a) -> Acc (Vector a) -> Acc (Vector a)
mvm mat vec =
  let Z :. rows :. cols = unlift (shape mat) :: Z :. Exp Int :. Exp Int
      vec'              = A.replicate (lift (Z :. rows :. All)) vec
  in
  A.fold (+) 0 ( A.zipWith (*) mat vec' )

Note that the intermediate, replicated array vec' is never actually created in memory; it will be fused directly into the operation which consumes it. We discuss fusion next.

Fusion

Array computations of type Acc will be subject to array fusion; Accelerate will combine individual Acc computations into a single computation, which reduces the number of traversals over the input data and thus improves performance. As such, it is often useful to have some intuition on when fusion should occur.

The main idea is to first partition array operations into two categories:

  1. Element-wise operations, such as map, generate, and backpermute. Each element of these operations can be computed independently of all others.
  2. Collective operations such as fold, scanl, and stencil. To compute each output element of these operations requires reading multiple elements from the input array(s).

Element-wise operations fuse together whenever the consumer operation uses a single element of the input array. Element-wise operations can both fuse their inputs into themselves, as well be fused into later operations. Both these examples should fuse into a single loop:

If the consumer operation uses more than one element of the input array (typically, via generate indexing an array multiple times), then the input array will be completely evaluated first; no fusion occurs in this case, because fusing the first operation into the second implies duplicating work.

On the other hand, collective operations can fuse their input arrays into themselves, but on output always evaluate to an array; collective operations will not be fused into a later step. For example:

Here the element-wise sequence (use + generate + zipWith) will fuse into a single operation, which then fuses into the collective fold operation. At this point in the program the fold must now be evaluated. In the final step the map reads in the array produced by fold. As there is no fusion between the fold and map steps, this program consists of two "loops"; one for the use + generate + zipWith + fold step, and one for the final map step.

You can see how many operations will be executed in the fused program by Show-ing the Acc program, or by using the debugging option -ddump-dot to save the program as a graphviz DOT file.

As a special note, the operations unzip and reshape, when applied to a real array, are executed in constant time, so in this situation these operations will not be fused.

Tips
  • Since Acc represents embedded computations that will only be executed when evaluated by a backend, we can programatically generate these computations using the meta language Haskell; for example, unrolling loops or embedding input values into the generated code.
  • It is usually best to keep all intermediate computations in Acc, and only run the computation at the very end to produce the final result. This enables optimisations between intermediate results (e.g. array fusion) and, if the target architecture has a separate memory space, as is the case of GPUs, to prevent excessive data transfers.

Instances

Instances details
Arrays b => Afunction (Acc b) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

Associated Types

type AfunctionR (Acc b) #

type ArraysFunctionR (Acc b)

Methods

afunctionRepr :: AfunctionRepr (Acc b) (AfunctionR (Acc b)) (ArraysFunctionR (Acc b))

convertOpenAfun :: HasCallStack => Config -> ArrayLayout aenv aenv -> Acc b -> OpenAfun aenv (ArraysFunctionR (Acc b))

(Arrays a, Arrays b, ApplyAcc t) => ApplyAcc ((Acc a -> Acc b) -> t) 
Instance details

Defined in Data.Array.Accelerate.Smart

Associated Types

type FromApplyAcc ((Acc a -> Acc b) -> t)

Methods

applyAcc :: FromApplyAcc ((Acc a -> Acc b) -> t) -> (Acc a -> Acc b) -> t

(Arrays a, ApplyAcc t) => ApplyAcc (Acc a -> t) 
Instance details

Defined in Data.Array.Accelerate.Smart

Associated Types

type FromApplyAcc (Acc a -> t)

Methods

applyAcc :: FromApplyAcc (Acc a -> t) -> Acc a -> t

(Arrays a, Afunction r) => Afunction (Acc a -> r) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

Associated Types

type AfunctionR (Acc a -> r) #

type ArraysFunctionR (Acc a -> r)

Methods

afunctionRepr :: AfunctionRepr (Acc a -> r) (AfunctionR (Acc a -> r)) (ArraysFunctionR (Acc a -> r))

convertOpenAfun :: HasCallStack => Config -> ArrayLayout aenv aenv -> (Acc a -> r) -> OpenAfun aenv (ArraysFunctionR (Acc a -> r))

type ArraysFunctionR (Acc b) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

type ArraysFunctionR (Acc b) = ArraysR b
type AfunctionR (Acc b) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

type AfunctionR (Acc b) = b
type ArraysFunctionR (Acc a -> r) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

type ArraysFunctionR (Acc a -> r) = ArraysR a -> ArraysFunctionR r
type FromApplyAcc ((Acc a -> Acc b) -> t) 
Instance details

Defined in Data.Array.Accelerate.Smart

type FromApplyAcc ((Acc a -> Acc b) -> t) = (SmartAcc (ArraysR a) -> SmartAcc (ArraysR b)) -> FromApplyAcc t
type FromApplyAcc (Acc a -> t) 
Instance details

Defined in Data.Array.Accelerate.Smart

type FromApplyAcc (Acc a -> t) = SmartAcc (ArraysR a) -> FromApplyAcc t
type AfunctionR (Acc a -> r) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

type AfunctionR (Acc a -> r) = a -> AfunctionR r

class Arrays a #

The Arrays class characterises the types which can appear in collective Accelerate computations of type Acc.

Arrays consists of nested tuples of individual Arrays, currently up to 16-elements wide. Accelerate computations can thereby return multiple results.

Instances

Instances details
Arrays () 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR ()

Methods

arraysR :: ArraysR (ArraysR ())

toArr :: ArraysR () -> ()

fromArr :: () -> ArraysR ()

(Arrays x0, Arrays x1) => Arrays (x0, x1) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1)

Methods

arraysR :: ArraysR (ArraysR (x0, x1))

toArr :: ArraysR (x0, x1) -> (x0, x1)

fromArr :: (x0, x1) -> ArraysR (x0, x1)

(Shape sh, Elt e) => Arrays (Array sh e) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (Array sh e)

Methods

arraysR :: ArraysR (ArraysR (Array sh e))

toArr :: ArraysR (Array sh e) -> Array sh e

fromArr :: Array sh e -> ArraysR (Array sh e)

(Arrays x0, Arrays x1, Arrays x2) => Arrays (x0, x1, x2) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2))

toArr :: ArraysR (x0, x1, x2) -> (x0, x1, x2)

fromArr :: (x0, x1, x2) -> ArraysR (x0, x1, x2)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3) => Arrays (x0, x1, x2, x3) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3))

toArr :: ArraysR (x0, x1, x2, x3) -> (x0, x1, x2, x3)

fromArr :: (x0, x1, x2, x3) -> ArraysR (x0, x1, x2, x3)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4) => Arrays (x0, x1, x2, x3, x4) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4))

toArr :: ArraysR (x0, x1, x2, x3, x4) -> (x0, x1, x2, x3, x4)

fromArr :: (x0, x1, x2, x3, x4) -> ArraysR (x0, x1, x2, x3, x4)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5) => Arrays (x0, x1, x2, x3, x4, x5) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5) -> (x0, x1, x2, x3, x4, x5)

fromArr :: (x0, x1, x2, x3, x4, x5) -> ArraysR (x0, x1, x2, x3, x4, x5)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6) => Arrays (x0, x1, x2, x3, x4, x5, x6) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6) -> (x0, x1, x2, x3, x4, x5, x6)

fromArr :: (x0, x1, x2, x3, x4, x5, x6) -> ArraysR (x0, x1, x2, x3, x4, x5, x6)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7) -> (x0, x1, x2, x3, x4, x5, x6, x7)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7, Arrays x8) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7, x8) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8) -> (x0, x1, x2, x3, x4, x5, x6, x7, x8)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7, x8) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7, Arrays x8, Arrays x9) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9) -> (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7, Arrays x8, Arrays x9, Arrays x10) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10) -> (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7, Arrays x8, Arrays x9, Arrays x10, Arrays x11) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11) -> (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7, Arrays x8, Arrays x9, Arrays x10, Arrays x11, Arrays x12) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12) -> (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7, Arrays x8, Arrays x9, Arrays x10, Arrays x11, Arrays x12, Arrays x13) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13) -> (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7, Arrays x8, Arrays x9, Arrays x10, Arrays x11, Arrays x12, Arrays x13, Arrays x14) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14) -> (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14)

(Arrays x0, Arrays x1, Arrays x2, Arrays x3, Arrays x4, Arrays x5, Arrays x6, Arrays x7, Arrays x8, Arrays x9, Arrays x10, Arrays x11, Arrays x12, Arrays x13, Arrays x14, Arrays x15) => Arrays (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15) 
Instance details

Defined in Data.Array.Accelerate.Sugar.Array

Associated Types

type ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15)

Methods

arraysR :: ArraysR (ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15))

toArr :: ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15) -> (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15)

fromArr :: (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15) -> ArraysR (x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15)

class Afunction f #

Minimal complete definition

afunctionRepr, convertOpenAfun

Instances

Instances details
Arrays b => Afunction (Acc b) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

Associated Types

type AfunctionR (Acc b) #

type ArraysFunctionR (Acc b)

Methods

afunctionRepr :: AfunctionRepr (Acc b) (AfunctionR (Acc b)) (ArraysFunctionR (Acc b))

convertOpenAfun :: HasCallStack => Config -> ArrayLayout aenv aenv -> Acc b -> OpenAfun aenv (ArraysFunctionR (Acc b))

(Arrays a, Afunction r) => Afunction (Acc a -> r) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

Associated Types

type AfunctionR (Acc a -> r) #

type ArraysFunctionR (Acc a -> r)

Methods

afunctionRepr :: AfunctionRepr (Acc a -> r) (AfunctionR (Acc a -> r)) (ArraysFunctionR (Acc a -> r))

convertOpenAfun :: HasCallStack => Config -> ArrayLayout aenv aenv -> (Acc a -> r) -> OpenAfun aenv (ArraysFunctionR (Acc a -> r))

type family AfunctionR f #

Instances

Instances details
type AfunctionR (Acc b) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

type AfunctionR (Acc b) = b
type AfunctionR (Acc a -> r) 
Instance details

Defined in Data.Array.Accelerate.Trafo.Sharing

type AfunctionR (Acc a -> r) = a -> AfunctionR r

Synchronous execution

run :: Arrays a => Acc a -> a Source #

Compile and run a complete embedded array program.

This will execute using the first available CUDA device. If you wish to run on a specific device, use runWith.

The result is copied back to the host only once the arrays are demanded (or the result is forced to normal form). For results consisting of multiple components (a tuple of arrays or array of tuples) this applies per primitive array. Evaluating the result of run to WHNF will initiate the computation, but does not copy the results back from the device.

NOTE: it is recommended to use runN or runQ whenever possible.

runWith :: Arrays a => PTX -> Acc a -> a Source #

As run, but execute using the specified target rather than using the default, automatically selected device.

run1 :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> b Source #

This is runN, specialised to an array program of one argument.

run1With :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> a -> b Source #

As run1, but execute using the specified target rather than using the default, automatically selected device.

runN :: forall f. Afunction f => f -> AfunctionR f Source #

Prepare and execute an embedded array program.

This function can be used to improve performance in cases where the array program is constant between invocations, because it enables us to bypass front-end conversion stages and move directly to the execution phase. If you have a computation applied repeatedly to different input data, use this, specifying any changing aspects of the computation via the input parameters. If the function is only evaluated once, this is equivalent to run.

In order to use runN you must express your Accelerate program as a function of array terms:

f :: (Arrays a, Arrays b, ... Arrays c) => Acc a -> Acc b -> ... -> Acc c

This function then returns the compiled version of f:

runN f :: (Arrays a, Arrays b, ... Arrays c) => a -> b -> ... -> c

At an example, rather than:

step :: Acc (Vector a) -> Acc (Vector b)
step = ...

simulate :: Vector a -> Vector b
simulate xs = run $ step (use xs)

Instead write:

simulate = runN step

You can use the debugging options to check whether this is working successfully. For example, running with the -ddump-phases flag should show that the compilation steps only happen once, not on the second and subsequent invocations of simulate. Note that this typically relies on GHC knowing that it can lift out the function returned by runN and reuse it.

As with run, the resulting array(s) are only copied back to the host once they are actually demanded (forced to normal form). Thus, splitting a program into multiple runN steps does not imply transferring intermediate computations back and forth between host and device. However note that Accelerate is not able to optimise (fuse) across separate runN invocations.

See the programs in the 'accelerate-examples' package for examples.

See also runQ, which compiles the Accelerate program at _Haskell_ compile time, thus eliminating the runtime overhead altogether.

runNWith :: forall f. Afunction f => PTX -> f -> AfunctionR f Source #

As runN, but execute using the specified target device.

stream :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> [a] -> [b] Source #

Stream a lazily read list of input arrays through the given program, collecting results as we go.

streamWith :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> [a] -> [b] Source #

As stream, but execute using the specified target.

Asynchronous execution

data Async a #

wait :: Async a -> IO a #

Block the calling thread until the computation completes, then return the result.

poll :: Async a -> IO (Maybe a) #

Test whether the asynchronous computation has already completed. If so, return the result, else Nothing.

cancel :: Async a -> IO () #

Cancel a running asynchronous computation.

runAsync :: Arrays a => Acc a -> IO (Async a) Source #

As run, but run the computation asynchronously and return immediately without waiting for the result. The status of the computation can be queried using wait, poll, and cancel.

This will run on the first available CUDA device. If you wish to run on a specific device, use runAsyncWith.

runAsyncWith :: Arrays a => PTX -> Acc a -> IO (Async a) Source #

As runWith, but execute asynchronously. Be sure not to destroy the context, or attempt to attach it to a different host thread, before all outstanding operations have completed.

run1Async :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> IO (Async b) Source #

As run1, but the computation is executed asynchronously.

run1AsyncWith :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> a -> IO (Async b) Source #

As run1With, but execute asynchronously.

runNAsync :: (Afunction f, RunAsync r, ArraysFunctionR f ~ RunAsyncR r) => f -> r Source #

As runN, but execute asynchronously.

runNAsyncWith :: (Afunction f, RunAsync r, ArraysFunctionR f ~ RunAsyncR r) => PTX -> f -> r Source #

As runNWith, but execute asynchronously.

Ahead-of-time compilation

runQ :: Afunction f => f -> ExpQ Source #

Ahead-of-time compilation for an embedded array program.

This function will generate, compile, and link into the final executable, code to execute the given Accelerate computation at Haskell compile time. This eliminates any runtime overhead associated with the other run* operations. The generated code will be compiled for the current (default) GPU architecture.

Since the Accelerate program will be generated at Haskell compile time, construction of the Accelerate program, in particular via meta-programming, will be limited to operations available to that phase. Also note that any arrays which are embedded into the program via use will be stored as part of the final executable.

Usage of this function in your program is similar to that of runN. First, express your Accelerate program as a function of array terms:

f :: (Arrays a, Arrays b, ... Arrays c) => Acc a -> Acc b -> ... -> Acc c

This function then returns a compiled version of f as a Template Haskell splice, to be added into your program at Haskell compile time:

{-# LANGUAGE TemplateHaskell #-}

f' :: a -> b -> ... -> c
f' = $( runQ f )

Note that at the splice point the usage of f must monomorphic; i.e. the types a, b and c must be at some known concrete type.

See the lulesh-accelerate project for an example.

Note:

Due to GHC#13587, this currently must be as an untyped splice.

The correct type of this function is similar to that of runN:

runQ :: Afunction f => f -> Q (TExp (AfunctionR f))

Since: 1.1.0.0

runQWith :: Afunction f => f -> ExpQ Source #

Ahead-of-time analogue of runNWith. See runQ for more information.

NOTE: The supplied (at runtime) target must be compatible with the architecture that this function was compiled for (the defaultTarget of the compiling machine). Running on a device with the same compute capability is best, but this should also be forward compatible to newer architectures.

The correct type of this function is:

runQWith :: Afunction f => f -> Q (TExp (PTX -> AfunctionR f))

Since: 1.1.0.0

runQAsync :: Afunction f => f -> ExpQ Source #

Ahead-of-time analogue of runNAsync. See runQ for more information.

The correct type of this function is:

runQAsync :: (Afunction f, RunAsync r, AfunctionR f ~ RunAsyncR r) => f -> Q (TExp r)

Since: 1.1.0.0

runQAsyncWith :: Afunction f => f -> ExpQ Source #

Ahead-of-time analogue of runNAsyncWith. See runQWith for more information.

The correct type of this function is:

runQAsyncWith :: (Afunction f, RunAsync r, AfunctionR f ~ RunAsyncR r) => f -> Q (TExp (PTX -> r))

Since: 1.1.0.0

Execution targets

data PTX Source #

The PTX execution target for NVIDIA GPUs.

The execution target carries state specific for the current execution context. The data here --- device memory and execution streams --- are implicitly tied to this CUDA execution context.

Don't store anything here that is independent of the context, for example state related to [persistent] kernel caching should _not_ go here.

Instances

Instances details
Skeleton PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.CodeGen

Methods

generate :: UID -> Gamma aenv -> ArrayR (Array sh e) -> IRFun1 PTX aenv (sh -> e) -> CodeGen PTX (IROpenAcc PTX aenv (Array sh e))

transform :: UID -> Gamma aenv -> ArrayR (Array sh a) -> ArrayR (Array sh' b) -> IRFun1 PTX aenv (sh' -> sh) -> IRFun1 PTX aenv (a -> b) -> CodeGen PTX (IROpenAcc PTX aenv (Array sh' b))

map :: UID -> Gamma aenv -> ArrayR (Array sh a) -> TypeR b -> IRFun1 PTX aenv (a -> b) -> CodeGen PTX (IROpenAcc PTX aenv (Array sh b))

fold :: UID -> Gamma aenv -> ArrayR (Array sh e) -> IRFun2 PTX aenv (e -> e -> e) -> Maybe (IRExp PTX aenv e) -> MIRDelayed PTX aenv (Array (sh, Int) e) -> CodeGen PTX (IROpenAcc PTX aenv (Array sh e))

foldSeg :: UID -> Gamma aenv -> ArrayR (Array (sh, Int) e) -> IntegralType i -> IRFun2 PTX aenv (e -> e -> e) -> Maybe (IRExp PTX aenv e) -> MIRDelayed PTX aenv (Array (sh, Int) e) -> MIRDelayed PTX aenv (Segments i) -> CodeGen PTX (IROpenAcc PTX aenv (Array (sh, Int) e))

scan :: UID -> Gamma aenv -> ArrayR (Array (sh, Int) e) -> Direction -> IRFun2 PTX aenv (e -> e -> e) -> Maybe (IRExp PTX aenv e) -> MIRDelayed PTX aenv (Array (sh, Int) e) -> CodeGen PTX (IROpenAcc PTX aenv (Array (sh, Int) e))

scan' :: UID -> Gamma aenv -> ArrayR (Array (sh, Int) e) -> Direction -> IRFun2 PTX aenv (e -> e -> e) -> IRExp PTX aenv e -> MIRDelayed PTX aenv (Array (sh, Int) e) -> CodeGen PTX (IROpenAcc PTX aenv (Array (sh, Int) e, Array sh e))

permute :: UID -> Gamma aenv -> ArrayR (Array sh e) -> ShapeR sh' -> IRPermuteFun PTX aenv (e -> e -> e) -> IRFun1 PTX aenv (sh -> PrimMaybe sh') -> MIRDelayed PTX aenv (Array sh e) -> CodeGen PTX (IROpenAcc PTX aenv (Array sh' e))

backpermute :: UID -> Gamma aenv -> ArrayR (Array sh e) -> ShapeR sh' -> IRFun1 PTX aenv (sh' -> sh) -> CodeGen PTX (IROpenAcc PTX aenv (Array sh' e))

stencil1 :: UID -> Gamma aenv -> StencilR sh a stencil -> TypeR b -> IRFun1 PTX aenv (stencil -> b) -> IRBoundary PTX aenv (Array sh a) -> MIRDelayed PTX aenv (Array sh a) -> CodeGen PTX (IROpenAcc PTX aenv (Array sh b))

stencil2 :: UID -> Gamma aenv -> StencilR sh a stencil1 -> StencilR sh b stencil2 -> TypeR c -> IRFun2 PTX aenv (stencil1 -> stencil2 -> c) -> IRBoundary PTX aenv (Array sh a) -> MIRDelayed PTX aenv (Array sh a) -> IRBoundary PTX aenv (Array sh b) -> MIRDelayed PTX aenv (Array sh b) -> CodeGen PTX (IROpenAcc PTX aenv (Array sh c))

Persistent PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Compile.Cache

Embed PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Embed

Methods

embedForTarget :: PTX -> ObjectR PTX -> Q (TExp (ExecutableR PTX))

Execute PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute

Methods

map :: Maybe (a :~: b) -> ArrayR (Array sh a) -> TypeR b -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> Array sh a -> Par PTX (FutureR PTX (Array sh b))

generate :: ArrayR (Array sh e) -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> sh -> Par PTX (FutureR PTX (Array sh e))

transform :: ArrayR (Array sh a) -> ArrayR (Array sh' b) -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> sh' -> Array sh a -> Par PTX (FutureR PTX (Array sh' b))

backpermute :: ArrayR (Array sh e) -> ShapeR sh' -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> sh' -> Array sh e -> Par PTX (FutureR PTX (Array sh' e))

fold :: HasInitialValue -> ArrayR (Array sh e) -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> Delayed (Array (sh, Int) e) -> Par PTX (FutureR PTX (Array sh e))

foldSeg :: IntegralType i -> HasInitialValue -> ArrayR (Array (sh, Int) e) -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> Delayed (Array (sh, Int) e) -> Delayed (Segments i) -> Par PTX (FutureR PTX (Array (sh, Int) e))

scan :: Direction -> HasInitialValue -> ArrayR (Array (sh, Int) e) -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> Delayed (Array (sh, Int) e) -> Par PTX (FutureR PTX (Array (sh, Int) e))

scan' :: Direction -> ArrayR (Array (sh, Int) e) -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> Delayed (Array (sh, Int) e) -> Par PTX (FutureR PTX (Array (sh, Int) e, Array sh e))

permute :: Bool -> ArrayR (Array sh e) -> ShapeR sh' -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> Array sh' e -> Delayed (Array sh e) -> Par PTX (FutureR PTX (Array sh' e))

stencil1 :: TypeR a -> ArrayR (Array sh b) -> sh -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> Delayed (Array sh a) -> Par PTX (FutureR PTX (Array sh b))

stencil2 :: TypeR a -> TypeR b -> ArrayR (Array sh c) -> sh -> ExecutableR PTX -> Gamma aenv -> ValR PTX aenv -> Delayed (Array sh a) -> Delayed (Array sh b) -> Par PTX (FutureR PTX (Array sh c))

aforeign :: String -> ArraysR as -> ArraysR bs -> (as -> Par PTX (FutureR PTX bs)) -> as -> Par PTX (FutureR PTX bs)

Link PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Link

Associated Types

data ExecutableR PTX

Methods

linkForTarget :: ObjectR PTX -> LLVM PTX (ExecutableR PTX)

Compile PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Compile

Associated Types

data ObjectR PTX

Methods

compileForTarget :: PreOpenAcc DelayedOpenAcc aenv a -> Gamma aenv -> LLVM PTX (ObjectR PTX)

Marshal PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Marshal

Associated Types

type ArgR PTX

Methods

marshalInt :: Int -> ArgR PTX

marshalScalarData' :: SingleType e -> ScalarArrayData e -> Par PTX (DList (ArgR PTX))

Foreign PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Foreign

Methods

foreignAcc :: Foreign asm => asm (a -> b) -> Maybe (a -> Par PTX (FutureR PTX b))

foreignExp :: Foreign asm => asm (x -> y) -> Maybe (IRFun1 PTX () (x -> y))

Intrinsic PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.CodeGen.Intrinsic

Target PTX Source # 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Target

Remote PTX

Remote memory management for the PTX target. Data can be copied asynchronously using multiple execution engines whenever possible.

Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Array.Data

Methods

allocateRemote :: ArrayR (Array sh e) -> sh -> Par PTX (Array sh e)

useRemoteR :: SingleType e -> Int -> ArrayData e -> Par PTX (FutureR PTX (ArrayData e))

copyToRemoteR :: SingleType e -> Int -> ArrayData e -> Par PTX (FutureR PTX (ArrayData e))

copyToHostR :: SingleType e -> Int -> ArrayData e -> Par PTX (FutureR PTX (ArrayData e))

copyToPeerR :: PTX -> SingleType e -> Int -> ArrayData e -> Par PTX (FutureR PTX (ArrayData e))

useRemoteAsync :: ArraysR arrs -> arrs -> Par PTX (FutureArraysR PTX arrs)

copyToRemoteAsync :: ArraysR arrs -> arrs -> Par PTX (FutureArraysR PTX arrs)

copyToHostAsync :: ArraysR arrs -> arrs -> Par PTX (FutureArraysR PTX arrs)

copyToPeerAsync :: PTX -> ArraysR arrs -> arrs -> Par PTX (FutureArraysR PTX arrs)

indexRemoteAsync :: TypeR e -> Array sh e -> Int -> Par PTX (FutureR PTX e)

Async PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

Associated Types

data Par PTX :: Type -> Type

type FutureR PTX :: Type -> Type

Methods

new :: HasCallStack => Par PTX (FutureR PTX a)

put :: HasCallStack => FutureR PTX a -> a -> Par PTX ()

get :: HasCallStack => FutureR PTX a -> Par PTX a

fork :: Par PTX () -> Par PTX ()

liftPar :: HasCallStack => LLVM PTX a -> Par PTX a

block :: HasCallStack => FutureR PTX a -> Par PTX a

spawn :: HasCallStack => Par PTX a -> Par PTX a

newFull :: HasCallStack => a -> Par PTX (FutureR PTX a)

MonadReader ParState (Par PTX) Source # 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

Methods

ask :: Par PTX ParState #

local :: (ParState -> ParState) -> Par PTX a -> Par PTX a #

reader :: (ParState -> a) -> Par PTX a #

MonadState PTX (Par PTX) Source # 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

Methods

get :: Par PTX PTX #

put :: PTX -> Par PTX () #

state :: (PTX -> (a, PTX)) -> Par PTX a #

Monad (Par PTX) Source # 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

Methods

(>>=) :: Par PTX a -> (a -> Par PTX b) -> Par PTX b #

(>>) :: Par PTX a -> Par PTX b -> Par PTX b #

return :: a -> Par PTX a #

Functor (Par PTX) Source # 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

Methods

fmap :: (a -> b) -> Par PTX a -> Par PTX b #

(<$) :: a -> Par PTX b -> Par PTX a #

Applicative (Par PTX) Source # 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

Methods

pure :: a -> Par PTX a #

(<*>) :: Par PTX (a -> b) -> Par PTX a -> Par PTX b #

liftA2 :: (a -> b -> c) -> Par PTX a -> Par PTX b -> Par PTX c #

(*>) :: Par PTX a -> Par PTX b -> Par PTX b #

(<*) :: Par PTX a -> Par PTX b -> Par PTX a #

RemoteMemory (LLVM PTX) 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Array.Remote

Associated Types

type RemotePtr (LLVM PTX) :: Type -> Type

Methods

mallocRemote :: Int -> LLVM PTX (Maybe (RemotePtr (LLVM PTX) Word8))

pokeRemote :: SingleType e -> Int -> RemotePtr (LLVM PTX) (ScalarArrayDataR e) -> ArrayData e -> LLVM PTX ()

peekRemote :: SingleType e -> Int -> RemotePtr (LLVM PTX) (ScalarArrayDataR e) -> MutableArrayData e -> LLVM PTX ()

castRemotePtr :: RemotePtr (LLVM PTX) a -> RemotePtr (LLVM PTX) b

totalRemoteMem :: LLVM PTX Int64

availableRemoteMem :: LLVM PTX Int64

remoteAllocationSize :: LLVM PTX Int

MonadIO (Par PTX) Source # 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

Methods

liftIO :: IO a -> Par PTX a #

data ExecutableR PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Link

data ExecutableR PTX = PTXR {}
data ObjectR PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Compile

data ObjectR PTX = ObjectR {}
type ArgR PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Marshal

type ArgR PTX = FunParam
type FutureR PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

type FutureR PTX = Future
newtype Par PTX a 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async

newtype Par PTX a = Par {}
data KernelMetadata PTX 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.CodeGen.Base

data KernelMetadata PTX = KM_PTX LaunchConfig
type RemotePtr (LLVM PTX) 
Instance details

Defined in Data.Array.Accelerate.LLVM.PTX.Array.Remote

type RemotePtr (LLVM PTX) = DevicePtr

createTargetForDevice :: Device -> DeviceProperties -> [ContextFlag] -> IO PTX Source #

Create a new PTX execution target for the given device

createTargetFromContext :: Context -> IO PTX Source #

Create a PTX execute target for the given device context

Controlling host-side allocation

registerPinnedAllocatorWith :: HasCallStack => PTX -> IO () Source #

Configure the default execution target to allocate all future host-side arrays using (CUDA) pinned memory. Any newly allocated arrays will be page-locked and directly accessible from the device, enabling high-speed (asynchronous) DMA.

Note that since the amount of available pageable memory will be reduced, overall system performance can suffer.

registerPinnedAllocator :: IO () registerPinnedAllocator = registerPinnedAllocatorWith defaultTarget

All future array allocations will use pinned memory associated with the given execution context. These arrays will be directly accessible from the device, enabling high-speed asynchronous DMA.

Note that since the amount of available pageable memory will be reduced, overall system performance can suffer.