Copyright | [2014..2017] Trevor L. McDonell [2014..2014] Vinod Grover (NVIDIA Corporation) |
---|---|
License | BSD3 |
Maintainer | Trevor L. McDonell <tmcdonell@cse.unsw.edu.au> |
Stability | experimental |
Portability | non-portable (GHC extensions) |
Safe Haskell | None |
Language | Haskell2010 |
This module implements a backend for the Accelerate language targeting NVPTX for execution on NVIDIA GPUs. Expressions are on-line translated into LLVM code, which is just-in-time executed in parallel on the GPU.
Synopsis
- data Acc a
- class (Typeable a, Typeable (ArrRepr a)) => Arrays a
- class Afunction f
- type family AfunctionR f :: Type
- run :: Arrays a => Acc a -> a
- runWith :: Arrays a => PTX -> Acc a -> a
- run1 :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> b
- run1With :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> a -> b
- runN :: Afunction f => f -> AfunctionR f
- runNWith :: Afunction f => PTX -> f -> AfunctionR f
- stream :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> [a] -> [b]
- streamWith :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> [a] -> [b]
- data Async a
- wait :: Async a -> IO a
- poll :: Async a -> IO (Maybe a)
- cancel :: Async a -> IO ()
- runAsync :: Arrays a => Acc a -> IO (Async a)
- runAsyncWith :: Arrays a => PTX -> Acc a -> IO (Async a)
- run1Async :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> IO (Async b)
- run1AsyncWith :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> a -> IO (Async b)
- runNAsync :: (Afunction f, RunAsync r, AfunctionR f ~ RunAsyncR r) => f -> r
- runNAsyncWith :: (Afunction f, RunAsync r, AfunctionR f ~ RunAsyncR r) => PTX -> f -> r
- runQ :: Afunction f => f -> ExpQ
- runQWith :: Afunction f => f -> ExpQ
- runQAsync :: Afunction f => f -> ExpQ
- runQAsyncWith :: Afunction f => f -> ExpQ
- data PTX
- createTargetForDevice :: Device -> DeviceProperties -> [ContextFlag] -> IO PTX
- createTargetFromContext :: Context -> IO PTX
- registerPinnedAllocatorWith :: PTX -> IO ()
Documentation
Accelerate is an embedded language that distinguishes between vanilla arrays (e.g. in Haskell memory on the CPU) and embedded arrays (e.g. in device memory on a GPU), as well as the computations on both of these. Since Accelerate is an embedded language, programs written in Accelerate are not compiled by the Haskell compiler (GHC). Rather, each Accelerate backend is a runtime compiler which generates and executes parallel SIMD code of the target language at application runtime.
The type constructor Acc
represents embedded collective array operations.
A term of type Acc a
is an Accelerate program which, once executed, will
produce a value of type a
(an Array
or a tuple of Arrays
). Collective
operations of type Acc a
comprise many scalar expressions, wrapped in
type constructor Exp
, which will be executed in parallel. Although
collective operations comprise many scalar operations executed in parallel,
scalar operations cannot initiate new collective operations: this
stratification between scalar operations in Exp
and array operations in
Acc
helps statically exclude nested data parallelism, which is difficult
to execute efficiently on constrained hardware such as GPUs.
- A simple example
As a simple example, to compute a vector dot product we can write:
dotp :: Num a => Vector a -> Vector a -> Acc (Scalar a) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 ( zipWith (*) xs' ys' )
The function dotp
consumes two one-dimensional arrays (Vector
s) of
values, and produces a single (Scalar
) result as output. As the return type
is wrapped in the type Acc
, we see that it is an embedded Accelerate
computation - it will be evaluated in the object language of dynamically
generated parallel code, rather than the meta language of vanilla Haskell.
As the arguments to dotp
are plain Haskell arrays, to make these available
to Accelerate computations they must be embedded with the
use
function.
An Accelerate backend is used to evaluate the embedded computation and return
the result back to vanilla Haskell. Calling the run
function of a backend
will generate code for the target architecture, compile, and execute it. For
example, the following backends are available:
- accelerate-llvm-native: for execution on multicore CPUs
- accelerate-llvm-ptx: for execution on NVIDIA CUDA-capable GPUs
See also Exp
, which encapsulates embedded scalar computations.
- Avoiding nested parallelism
As mentioned above, embedded scalar computations of type Exp
can not
initiate further collective operations.
Suppose we wanted to extend our above dotp
function to matrix-vector
multiplication. First, let's rewrite our dotp
function to take Acc
arrays
as input (which is typically what we want):
dotp :: Num a => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a) dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )
We might then be inclined to lift our dot-product program to the following
(incorrect) matrix-vector product, by applying dotp
to each row of the
input matrix:
mvm_ndp :: Num a => Acc (Matrix a) -> Acc (Vector a) -> Acc (Vector a) mvm_ndp mat vec = let Z :. rows :. cols = unlift (shape mat) :: Z :. Exp Int :. Exp Int in generate (index1 rows) (\row -> the $ dotp vec (slice mat (lift (row :. All))))
Here, we use generate
to create a one-dimensional
vector by applying at each index a function to slice
out the corresponding row
of the matrix to pass to the dotp
function.
However, since both generate
and
slice
are data-parallel operations, and moreover that
slice
depends on the argument row
given to it by
the generate
function, this definition requires
nested data-parallelism, and is thus not permitted. The clue that this
definition is invalid is that in order to create a program which will be
accepted by the type checker, we must use the function
the
to retrieve the result of the dotp
operation,
effectively concealing that dotp
is a collective array computation in order
to match the type expected by generate
, which is that
of scalar expressions. Additionally, since we have fooled the type-checker,
this problem will only be discovered at program runtime.
In order to avoid this problem, we can make use of the fact that operations
in Accelerate are rank polymorphic. The fold
operation reduces along the innermost dimension of an array of arbitrary
rank, reducing the rank (dimensionality) of the array by one. Thus, we can
replicate
the input vector to as many rows
there
are in the input matrix, and perform the dot-product of the vector with every
row simultaneously:
mvm :: A.Num a => Acc (Matrix a) -> Acc (Vector a) -> Acc (Vector a) mvm mat vec = let Z :. rows :. cols = unlift (shape mat) :: Z :. Exp Int :. Exp Int vec' = A.replicate (lift (Z :. rows :. All)) vec in A.fold (+) 0 ( A.zipWith (*) mat vec' )
Note that the intermediate, replicated array vec'
is never actually created
in memory; it will be fused directly into the operation which consumes it. We
discuss fusion next.
- Fusion
Array computations of type Acc
will be subject to array fusion;
Accelerate will combine individual Acc
computations into a single
computation, which reduces the number of traversals over the input data and
thus improves performance. As such, it is often useful to have some intuition
on when fusion should occur.
The main idea is to first partition array operations into two categories:
- Element-wise operations, such as
map
,generate
, andbackpermute
. Each element of these operations can be computed independently of all others. - Collective operations such as
fold
,scanl
, andstencil
. To compute each output element of these operations requires reading multiple elements from the input array(s).
Element-wise operations fuse together whenever the consumer operation uses a single element of the input array. Element-wise operations can both fuse their inputs into themselves, as well be fused into later operations. Both these examples should fuse into a single loop:
If the consumer operation uses more than one element of the input array
(typically, via generate
indexing an array multiple
times), then the input array will be completely evaluated first; no fusion
occurs in this case, because fusing the first operation into the second
implies duplicating work.
On the other hand, collective operations can fuse their input arrays into themselves, but on output always evaluate to an array; collective operations will not be fused into a later step. For example:
Here the element-wise sequence (use
+ generate
+ zipWith
) will
fuse into a single operation, which then fuses into the collective
fold
operation. At this point in the program the
fold
must now be evaluated. In the final step the
map
reads in the array produced by
fold
. As there is no fusion between the
fold
and map
steps, this
program consists of two "loops"; one for the use
+ generate
+ zipWith
+ fold
step, and one for the final
map
step.
You can see how many operations will be executed in the fused program by
Show
-ing the Acc
program, or by using the debugging option -ddump-dot
to save the program as a graphviz DOT file.
As a special note, the operations unzip
and
reshape
, when applied to a real array, are executed
in constant time, so in this situation these operations will not be fused.
- Tips
- Since
Acc
represents embedded computations that will only be executed when evaluated by a backend, we can programatically generate these computations using the meta language Haskell; for example, unrolling loops or embedding input values into the generated code. - It is usually best to keep all intermediate computations in
Acc
, and onlyrun
the computation at the very end to produce the final result. This enables optimisations between intermediate results (e.g. array fusion) and, if the target architecture has a separate memory space, as is the case of GPUs, to prevent excessive data transfers.
Instances
Arrays b => Afunction (Acc b) | |
Defined in Data.Array.Accelerate.Trafo.Sharing type AfunctionR (Acc b) :: Type # aconvert :: Config -> Layout aenv aenv -> Acc b -> OpenAfun aenv (AfunctionR (Acc b)) | |
(Arrays a, Afunction r) => Afunction (Acc a -> r) | |
Defined in Data.Array.Accelerate.Trafo.Sharing type AfunctionR (Acc a -> r) :: Type # aconvert :: Config -> Layout aenv aenv -> (Acc a -> r) -> OpenAfun aenv (AfunctionR (Acc a -> r)) | |
type AfunctionR (Acc b) | |
Defined in Data.Array.Accelerate.Trafo.Sharing | |
type AfunctionR (Acc a -> r) | |
Defined in Data.Array.Accelerate.Trafo.Sharing |
class (Typeable a, Typeable (ArrRepr a)) => Arrays a #
Arrays
consists of nested tuples of individual Array
s, currently up to
15-elements wide. Accelerate computations can thereby return multiple
results.
arrays, flavour, toArr, fromArr
Instances
Arrays () | |
Defined in Data.Array.Accelerate.Array.Sugar | |
(Arrays a, Arrays b) => Arrays (a, b) | |
Defined in Data.Array.Accelerate.Array.Sugar | |
(Shape sh, Elt e) => Arrays (Array sh e) | |
(Arrays a, Arrays b, Arrays c) => Arrays (a, b, c) | |
Defined in Data.Array.Accelerate.Array.Sugar | |
(Arrays a, Arrays b, Arrays c, Arrays d) => Arrays (a, b, c, d) | |
Defined in Data.Array.Accelerate.Array.Sugar | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e) => Arrays (a, b, c, d, e) | |
Defined in Data.Array.Accelerate.Array.Sugar | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f) => Arrays (a, b, c, d, e, f) | |
Defined in Data.Array.Accelerate.Array.Sugar | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g) => Arrays (a, b, c, d, e, f, g) | |
Defined in Data.Array.Accelerate.Array.Sugar | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h) => Arrays (a, b, c, d, e, f, g, h) | |
Defined in Data.Array.Accelerate.Array.Sugar | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i) => Arrays (a, b, c, d, e, f, g, h, i) | |
Defined in Data.Array.Accelerate.Array.Sugar arrays :: (a, b, c, d, e, f, g, h, i) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i)) flavour :: (a, b, c, d, e, f, g, h, i) -> ArraysFlavour (a, b, c, d, e, f, g, h, i) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i) -> (a, b, c, d, e, f, g, h, i) fromArr :: (a, b, c, d, e, f, g, h, i) -> ArrRepr (a, b, c, d, e, f, g, h, i) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j) => Arrays (a, b, c, d, e, f, g, h, i, j) | |
Defined in Data.Array.Accelerate.Array.Sugar arrays :: (a, b, c, d, e, f, g, h, i, j) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j)) flavour :: (a, b, c, d, e, f, g, h, i, j) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j) -> (a, b, c, d, e, f, g, h, i, j) fromArr :: (a, b, c, d, e, f, g, h, i, j) -> ArrRepr (a, b, c, d, e, f, g, h, i, j) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k) => Arrays (a, b, c, d, e, f, g, h, i, j, k) | |
Defined in Data.Array.Accelerate.Array.Sugar arrays :: (a, b, c, d, e, f, g, h, i, j, k) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k)) flavour :: (a, b, c, d, e, f, g, h, i, j, k) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k) -> (a, b, c, d, e, f, g, h, i, j, k) fromArr :: (a, b, c, d, e, f, g, h, i, j, k) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l) | |
Defined in Data.Array.Accelerate.Array.Sugar arrays :: (a, b, c, d, e, f, g, h, i, j, k, l) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l) -> (a, b, c, d, e, f, g, h, i, j, k, l) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m) | |
Defined in Data.Array.Accelerate.Array.Sugar arrays :: (a, b, c, d, e, f, g, h, i, j, k, l, m) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l, m) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l, m) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m) -> (a, b, c, d, e, f, g, h, i, j, k, l, m) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l, m) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m, Arrays n) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m, n) | |
Defined in Data.Array.Accelerate.Array.Sugar arrays :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l, m, n) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n) -> (a, b, c, d, e, f, g, h, i, j, k, l, m, n) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m, Arrays n, Arrays o) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) | |
Defined in Data.Array.Accelerate.Array.Sugar arrays :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) -> (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m, Arrays n, Arrays o, Arrays p) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) | |
Defined in Data.Array.Accelerate.Array.Sugar arrays :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) -> ArraysR (ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p)) flavour :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) -> ArraysFlavour (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) toArr :: ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) -> (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) fromArr :: (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) -> ArrRepr (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) |
aconvert
Instances
Arrays b => Afunction (Acc b) | |
Defined in Data.Array.Accelerate.Trafo.Sharing type AfunctionR (Acc b) :: Type # aconvert :: Config -> Layout aenv aenv -> Acc b -> OpenAfun aenv (AfunctionR (Acc b)) | |
(Arrays a, Afunction r) => Afunction (Acc a -> r) | |
Defined in Data.Array.Accelerate.Trafo.Sharing type AfunctionR (Acc a -> r) :: Type # aconvert :: Config -> Layout aenv aenv -> (Acc a -> r) -> OpenAfun aenv (AfunctionR (Acc a -> r)) |
type family AfunctionR f :: Type #
Instances
type AfunctionR (Acc b) | |
Defined in Data.Array.Accelerate.Trafo.Sharing | |
type AfunctionR (Acc a -> r) | |
Defined in Data.Array.Accelerate.Trafo.Sharing |
Synchronous execution
run :: Arrays a => Acc a -> a Source #
Compile and run a complete embedded array program.
This will execute using the first available CUDA device. If you wish to run
on a specific device, use runWith
.
The result is copied back to the host only once the arrays are demanded (or
the result is forced to normal form). For results consisting of multiple
components (a tuple of arrays or array of tuples) this applies per primitive
array. Evaluating the result of run
to WHNF will initiate the computation,
but does not copy the results back from the device.
NOTE: it is recommended to use runN
or runQ
whenever possible.
runWith :: Arrays a => PTX -> Acc a -> a Source #
As run
, but execute using the specified target rather than using the
default, automatically selected device.
run1 :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> b Source #
This is runN
, specialised to an array program of one argument.
run1With :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> a -> b Source #
As run1
, but execute using the specified target rather than using the
default, automatically selected device.
runN :: Afunction f => f -> AfunctionR f Source #
Prepare and execute an embedded array program.
This function can be used to improve performance in cases where the array
program is constant between invocations, because it enables us to bypass
front-end conversion stages and move directly to the execution phase. If you
have a computation applied repeatedly to different input data, use this,
specifying any changing aspects of the computation via the input parameters.
If the function is only evaluated once, this is equivalent to run
.
In order to use runN
you must express your Accelerate program as a function
of array terms:
f :: (Arrays a, Arrays b, ... Arrays c) => Acc a -> Acc b -> ... -> Acc c
This function then returns the compiled version of f
:
runN f :: (Arrays a, Arrays b, ... Arrays c) => a -> b -> ... -> c
At an example, rather than:
step :: Acc (Vector a) -> Acc (Vector b) step = ... simulate :: Vector a -> Vector b simulate xs = run $ step (use xs)
Instead write:
simulate = runN step
You can use the debugging options to check whether this is working
successfully. For example, running with the -ddump-phases
flag should show
that the compilation steps only happen once, not on the second and subsequent
invocations of simulate
. Note that this typically relies on GHC knowing
that it can lift out the function returned by runN
and reuse it.
As with run
, the resulting array(s) are only copied back to the host once
they are actually demanded (forced to normal form). Thus, splitting a program
into multiple runN
steps does not imply transferring intermediate
computations back and forth between host and device. However note that
Accelerate is not able to optimise (fuse) across separate runN
invocations.
See the programs in the 'accelerate-examples' package for examples.
See also runQ
, which compiles the Accelerate program at _Haskell_ compile
time, thus eliminating the runtime overhead altogether.
runNWith :: Afunction f => PTX -> f -> AfunctionR f Source #
As runN
, but execute using the specified target device.
stream :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> [a] -> [b] Source #
Stream a lazily read list of input arrays through the given program, collecting results as we go.
streamWith :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> [a] -> [b] Source #
As stream
, but execute using the specified target.
Asynchronous execution
Block the calling thread until the computation completes, then return the result.
poll :: Async a -> IO (Maybe a) #
Test whether the asynchronous computation has already completed. If so,
return the result, else Nothing
.
runAsync :: Arrays a => Acc a -> IO (Async a) Source #
As run
, but run the computation asynchronously and return immediately
without waiting for the result. The status of the computation can be queried
using wait
, poll
, and cancel
.
This will run on the first available CUDA device. If you wish to run on
a specific device, use runAsyncWith
.
runAsyncWith :: Arrays a => PTX -> Acc a -> IO (Async a) Source #
As runWith
, but execute asynchronously. Be sure not to destroy the context,
or attempt to attach it to a different host thread, before all outstanding
operations have completed.
run1Async :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> IO (Async b) Source #
As run1
, but the computation is executed asynchronously.
run1AsyncWith :: (Arrays a, Arrays b) => PTX -> (Acc a -> Acc b) -> a -> IO (Async b) Source #
As run1With
, but execute asynchronously.
runNAsync :: (Afunction f, RunAsync r, AfunctionR f ~ RunAsyncR r) => f -> r Source #
As runN
, but execute asynchronously.
runNAsyncWith :: (Afunction f, RunAsync r, AfunctionR f ~ RunAsyncR r) => PTX -> f -> r Source #
As runNWith
, but execute asynchronously.
Ahead-of-time compilation
runQ :: Afunction f => f -> ExpQ Source #
Ahead-of-time compilation for an embedded array program.
This function will generate, compile, and link into the final executable,
code to execute the given Accelerate computation at Haskell compile time.
This eliminates any runtime overhead associated with the other run*
operations. The generated code will be compiled for the current (default) GPU
architecture.
Since the Accelerate program will be generated at Haskell compile time,
construction of the Accelerate program, in particular via meta-programming,
will be limited to operations available to that phase. Also note that any
arrays which are embedded into the program via use
will be stored as part of the final executable.
Usage of this function in your program is similar to that of runN
. First,
express your Accelerate program as a function of array terms:
f :: (Arrays a, Arrays b, ... Arrays c) => Acc a -> Acc b -> ... -> Acc c
This function then returns a compiled version of f
as a Template Haskell
splice, to be added into your program at Haskell compile time:
{-# LANGUAGE TemplateHaskell #-} f' :: a -> b -> ... -> c f' = $( runQ f )
Note that at the splice point the usage of f
must monomorphic; i.e. the
types a
, b
and c
must be at some known concrete type.
See the lulesh-accelerate project for an example.
- Note:
Due to GHC#13587, this currently must be as an untyped splice.
The correct type of this function is similar to that of runN
:
runQ :: Afunction f => f -> Q (TExp (AfunctionR f))
Since: 1.1.0.0
runQWith :: Afunction f => f -> ExpQ Source #
Ahead-of-time analogue of runNWith
. See runQ
for more information.
NOTE: The supplied (at runtime) target must be compatible with the
architecture that this function was compiled for (the defaultTarget
of the
compiling machine). Running on a device with the same compute capability is
best, but this should also be forward compatible to newer architectures.
The correct type of this function is:
runQWith :: Afunction f => f -> Q (TExp (PTX -> AfunctionR f))
Since: 1.1.0.0
runQAsyncWith :: Afunction f => f -> ExpQ Source #
Ahead-of-time analogue of runNAsyncWith
. See runQWith
for more information.
The correct type of this function is:
runQAsyncWith :: (Afunction f, RunAsync r, AfunctionR f ~ RunAsyncR r) => f -> Q (TExp (PTX -> r))
Since: 1.1.0.0
Execution targets
The PTX execution target for NVIDIA GPUs.
The execution target carries state specific for the current execution context. The data here --- device memory and execution streams --- are implicitly tied to this CUDA execution context.
Don't store anything here that is independent of the context, for example state related to [persistent] kernel caching should _not_ go here.
Instances
Skeleton PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.CodeGen generate :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun1 PTX aenv (sh -> e) -> CodeGen (IROpenAcc PTX aenv (Array sh e)) transform :: (Shape sh, Shape sh', Elt a, Elt b) => PTX -> UID -> Gamma aenv -> IRFun1 PTX aenv (sh' -> sh) -> IRFun1 PTX aenv (a -> b) -> IRDelayed PTX aenv (Array sh a) -> CodeGen (IROpenAcc PTX aenv (Array sh' b)) map :: (Shape sh, Elt a, Elt b) => PTX -> UID -> Gamma aenv -> IRFun1 PTX aenv (a -> b) -> IRDelayed PTX aenv (Array sh a) -> CodeGen (IROpenAcc PTX aenv (Array sh b)) fold :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRExp PTX aenv e -> IRDelayed PTX aenv (Array (sh :. Int) e) -> CodeGen (IROpenAcc PTX aenv (Array sh e)) fold1 :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRDelayed PTX aenv (Array (sh :. Int) e) -> CodeGen (IROpenAcc PTX aenv (Array sh e)) foldSeg :: (Shape sh, Elt e, Elt i, IsIntegral i) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRExp PTX aenv e -> IRDelayed PTX aenv (Array (sh :. Int) e) -> IRDelayed PTX aenv (Segments i) -> CodeGen (IROpenAcc PTX aenv (Array (sh :. Int) e)) fold1Seg :: (Shape sh, Elt e, Elt i, IsIntegral i) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRDelayed PTX aenv (Array (sh :. Int) e) -> IRDelayed PTX aenv (Segments i) -> CodeGen (IROpenAcc PTX aenv (Array (sh :. Int) e)) scanl :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRExp PTX aenv e -> IRDelayed PTX aenv (Array (sh :. Int) e) -> CodeGen (IROpenAcc PTX aenv (Array (sh :. Int) e)) scanl' :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRExp PTX aenv e -> IRDelayed PTX aenv (Array (sh :. Int) e) -> CodeGen (IROpenAcc PTX aenv (Array (sh :. Int) e, Array sh e)) scanl1 :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRDelayed PTX aenv (Array (sh :. Int) e) -> CodeGen (IROpenAcc PTX aenv (Array (sh :. Int) e)) scanr :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRExp PTX aenv e -> IRDelayed PTX aenv (Array (sh :. Int) e) -> CodeGen (IROpenAcc PTX aenv (Array (sh :. Int) e)) scanr' :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRExp PTX aenv e -> IRDelayed PTX aenv (Array (sh :. Int) e) -> CodeGen (IROpenAcc PTX aenv (Array (sh :. Int) e, Array sh e)) scanr1 :: (Shape sh, Elt e) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (e -> e -> e) -> IRDelayed PTX aenv (Array (sh :. Int) e) -> CodeGen (IROpenAcc PTX aenv (Array (sh :. Int) e)) permute :: (Shape sh, Shape sh', Elt e) => PTX -> UID -> Gamma aenv -> IRPermuteFun PTX aenv (e -> e -> e) -> IRFun1 PTX aenv (sh -> sh') -> IRDelayed PTX aenv (Array sh e) -> CodeGen (IROpenAcc PTX aenv (Array sh' e)) backpermute :: (Shape sh, Shape sh', Elt e) => PTX -> UID -> Gamma aenv -> IRFun1 PTX aenv (sh' -> sh) -> IRDelayed PTX aenv (Array sh e) -> CodeGen (IROpenAcc PTX aenv (Array sh' e)) stencil :: (Stencil sh a stencil, Elt b) => PTX -> UID -> Gamma aenv -> IRFun1 PTX aenv (stencil -> b) -> IRBoundary PTX aenv (Array sh a) -> IRDelayed PTX aenv (Array sh a) -> CodeGen (IROpenAcc PTX aenv (Array sh b)) stencil2 :: (Stencil sh a stencil1, Stencil sh b stencil2, Elt c) => PTX -> UID -> Gamma aenv -> IRFun2 PTX aenv (stencil1 -> stencil2 -> c) -> IRBoundary PTX aenv (Array sh a) -> IRDelayed PTX aenv (Array sh a) -> IRBoundary PTX aenv (Array sh b) -> IRDelayed PTX aenv (Array sh b) -> CodeGen (IROpenAcc PTX aenv (Array sh c)) | |
Persistent PTX | |
Embed PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.Embed | |
Execute PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.Execute map :: (Shape sh, Elt b) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> sh -> LLVM PTX (Array sh b) generate :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> sh -> LLVM PTX (Array sh e) transform :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> sh -> LLVM PTX (Array sh e) backpermute :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> sh -> LLVM PTX (Array sh e) fold :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> LLVM PTX (Array sh e) fold1 :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> LLVM PTX (Array sh e) foldSeg :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> DIM1 -> LLVM PTX (Array (sh :. Int) e) fold1Seg :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> DIM1 -> LLVM PTX (Array (sh :. Int) e) scanl :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> LLVM PTX (Array (sh :. Int) e) scanl1 :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> LLVM PTX (Array (sh :. Int) e) scanl' :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> LLVM PTX (Array (sh :. Int) e, Array sh e) scanr :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> LLVM PTX (Array (sh :. Int) e) scanr1 :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> LLVM PTX (Array (sh :. Int) e) scanr' :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> (sh :. Int) -> LLVM PTX (Array (sh :. Int) e, Array sh e) permute :: (Shape sh, Shape sh', Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> Bool -> sh -> Array sh' e -> LLVM PTX (Array sh' e) stencil1 :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> sh -> LLVM PTX (Array sh e) stencil2 :: (Shape sh, Elt e) => ExecutableR PTX -> Gamma aenv -> AvalR PTX aenv -> StreamR PTX -> sh -> sh -> LLVM PTX (Array sh e) aforeign :: (Arrays as, Arrays bs) => String -> (StreamR PTX -> as -> LLVM PTX bs) -> StreamR PTX -> as -> LLVM PTX bs | |
Link PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.Link linkForTarget :: ObjectR PTX -> LLVM PTX (ExecutableR PTX) | |
Compile PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.Compile compileForTarget :: DelayedOpenAcc aenv a -> Gamma aenv -> LLVM PTX (ObjectR PTX) | |
Foreign PTX | |
Intrinsic PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.CodeGen.Intrinsic intrinsicForTarget :: PTX -> HashMap ShortByteString Label | |
Target PTX Source # | |
Defined in Data.Array.Accelerate.LLVM.PTX.Target targetTriple :: PTX -> Maybe ShortByteString targetDataLayout :: PTX -> Maybe DataLayout | |
Remote PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.Array.Data allocateRemote :: (Shape sh, Elt e) => sh -> LLVM PTX (Array sh e) useRemoteR :: (ArrayElt e, ArrayPtrs e ~ Ptr a, Storable a, Typeable a, Typeable e) => Int -> Maybe (StreamR PTX) -> ArrayData e -> LLVM PTX () copyToRemoteR :: (ArrayElt e, ArrayPtrs e ~ Ptr a, Storable a, Typeable a, Typeable e) => Int -> Int -> Maybe (StreamR PTX) -> ArrayData e -> LLVM PTX () copyToHostR :: (ArrayElt e, ArrayPtrs e ~ Ptr a, Storable a, Typeable a, Typeable e) => Int -> Int -> Maybe (StreamR PTX) -> ArrayData e -> LLVM PTX () copyToPeerR :: (ArrayElt e, ArrayPtrs e ~ Ptr a, Storable a, Typeable a, Typeable e) => Int -> Int -> PTX -> Maybe (StreamR PTX) -> ArrayData e -> LLVM PTX () indexRemote :: Array sh e -> Int -> LLVM PTX e | |
Async PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async | |
Marshalable PTX Int | |
Marshalable PTX Int32 | |
ArrayElt e => Marshalable PTX (ArrayData e) | |
RemoteMemory (LLVM PTX) | |
Defined in Data.Array.Accelerate.LLVM.PTX.Array.Remote mallocRemote :: Int -> LLVM PTX (Maybe (RemotePtr (LLVM PTX) Word8)) pokeRemote :: PrimElt e a => Int -> RemotePtr (LLVM PTX) a -> ArrayData e -> LLVM PTX () peekRemote :: PrimElt e a => Int -> RemotePtr (LLVM PTX) a -> MutableArrayData e -> LLVM PTX () castRemotePtr :: proxy (LLVM PTX) -> RemotePtr (LLVM PTX) a -> RemotePtr (LLVM PTX) b totalRemoteMem :: LLVM PTX Int64 | |
data ExecutableR PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.Link | |
data ObjectR PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.Compile data ObjectR PTX = ObjectR {
| |
type ArgR PTX | |
type EventR PTX Source # | |
Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async | |
type StreamR PTX Source # | |
Defined in Data.Array.Accelerate.LLVM.PTX.Execute.Async | |
data KernelMetadata PTX | |
Defined in Data.Array.Accelerate.LLVM.PTX.CodeGen.Base | |
type RemotePtr (LLVM PTX) | |
Defined in Data.Array.Accelerate.LLVM.PTX.Array.Remote |
createTargetForDevice :: Device -> DeviceProperties -> [ContextFlag] -> IO PTX Source #
Create a new PTX execution target for the given device
createTargetFromContext :: Context -> IO PTX Source #
Create a PTX execute target for the given device context
Controlling host-side allocation
registerPinnedAllocatorWith :: PTX -> IO () Source #
Configure the default execution target to allocate all future host-side arrays using (CUDA) pinned memory. Any newly allocated arrays will be page-locked and directly accessible from the device, enabling high-speed (asynchronous) DMA.
Note that since the amount of available pageable memory will be reduced, overall system performance can suffer.
registerPinnedAllocator :: IO () registerPinnedAllocator = registerPinnedAllocatorWith defaultTarget
All future array allocations will use pinned memory associated with the given execution context. These arrays will be directly accessible from the device, enabling high-speed asynchronous DMA.
Note that since the amount of available pageable memory will be reduced, overall system performance can suffer.