Which target are we ultimately generating code for? While most of the kernels code is the same, there are some cases where we generate special code based on the ultimate low-level API we are targeting.

Constructors

CUDA
OpenCL
HIP

data KernelEnv Source #

Constructors

KernelEnv
Fields kernelAtomics :: AtomicBinOp kernelConstants :: KernelConstants kernelLocks :: Map VName Locks

blockReduce :: TExp Int32 -> Lambda GPUMem -> [VName] -> InKernelGen () Source #

blockScan :: Maybe (TExp Int32 -> TExp Int32 -> TExp Bool) -> TExp Int64 -> TExp Int64 -> Lambda GPUMem -> [VName] -> InKernelGen () Source #

blockLoop :: IntExp t => TExp t -> (TExp t -> InKernelGen ()) -> InKernelGen () Source #

Assign iterations of a for-loop to threads in the threadblock. The passed-in function is invoked with the (symbolic) iteration. For multidimensional loops, use blockCoverSpace.

isActive :: [(VName, SubExp)] -> TExp Bool Source #

sKernel :: Operations GPUMem KernelEnv KernelOp -> (KernelConstants -> TExp Int32) -> String -> VName -> KernelAttrs -> InKernelGen () -> CallKernelGen () Source #

sKernelThread :: String -> VName -> KernelAttrs -> InKernelGen () -> CallKernelGen () Source #

data KernelAttrs Source #

Various extra configuration of the kernel being generated.

Constructors

KernelAttrs

Fields

kAttrFailureTolerant :: Bool
Can this kernel execute correctly even if previous kernels failed?
kAttrCheckSharedMemory :: Bool
Does whatever launch this kernel check for shared memory capacity itself?
kAttrNumBlocks :: Count NumBlocks SubExp
Number of blocks.
kAttrBlockSize :: Count BlockSize SubExp
Block size.
kAttrConstExps :: Map VName KernelConstExp
Variables that are specially in scope inside the kernel. Operationally, these will be available at kernel compile time (which happens at run-time, with access to machine-specific information).

defKernelAttrs :: Count NumBlocks SubExp -> Count BlockSize SubExp -> KernelAttrs Source #

The default kernel attributes.

lvlKernelAttrs :: SegLevel -> CallKernelGen KernelAttrs Source #

Compute kernel attributes from SegLevel; including synthesising block-size and thread count if no grid is provided.

allocLocal :: AllocCompiler GPUMem r KernelOp Source #

compileThreadResult :: SegSpace -> PatElem LetDecMem -> KernelResult -> InKernelGen () Source #

virtualiseBlocks :: SegVirt -> TExp Int32 -> (TExp Int32 -> InKernelGen ()) -> InKernelGen () Source #

For many kernels, we may not have enough physical blocks to cover the logical iteration space. Some blocks thus have to perform double duty; we put an outer loop to accomplish this. The advantage over just launching a bazillion threads is that the cost of memory expansion should be proportional to the number of *physical* threads (hardware parallelism), not the amount of application parallelism.

kernelLoop :: IntExp t => TExp t -> TExp t -> TExp t -> (TExp t -> InKernelGen ()) -> InKernelGen () Source #

Assign iterations of a for-loop to all threads in the kernel. The passed-in function is invoked with the (symbolic) iteration. The body must contain thread-level code. For multidimensional loops, use blockCoverSpace.

blockCoverSpace :: IntExp t => [TExp t] -> ([TExp t] -> InKernelGen ()) -> InKernelGen () Source #

Iterate collectively though a multidimensional space, such that all threads in the block participate. The passed-in function is invoked with a (symbolic) point in the index space.

fenceForArrays :: [VName] -> InKernelGen Fence Source #

If we are touching these arrays, which kind of fence do we need?

updateAcc :: VName -> [SubExp] -> [SubExp] -> InKernelGen () Source #

genZeroes :: String -> Int -> CallKernelGen VName Source #

Generate a constant device array of 32-bit integer zeroes with the given number of elements. Initialised with a replicate.

isPrimParam :: Typed p => Param p -> Bool Source #

kernelConstToExp :: KernelConstExp -> CallKernelGen Exp Source #

getChunkSize :: [Type] -> KernelConstExp Source #

Given available register and a list of parameter types, compute the largest available chunk size given the parameters for which we want chunking and the available resources. Used in compileSegScan, and compileSegRed (with primitive non-commutative operators only).

Host-level bulk operations

sReplicate :: VName -> SubExp -> CallKernelGen () Source #

Perform a Replicate with a kernel.

sIota :: VName -> TExp Int64 -> Exp -> Exp -> IntType -> CallKernelGen () Source #

Perform an Iota with a kernel.

Atomics

type AtomicBinOp = BinOp -> Maybe (VName -> VName -> Count Elements (TExp Int64) -> Exp -> AtomicOp) Source #

Is there an atomic BinOp corresponding to this BinOp?

atomicUpdateLocking :: AtomicBinOp -> Lambda GPUMem -> AtomicUpdate GPUMem KernelEnv Source #

Do an atomic update corresponding to a binary operator lambda.

data Locking Source #

Locking strategy used for an atomic update.

Constructors

Locking

Fields

lockingArray :: VName
Array containing the lock.
lockingIsUnlocked :: TExp Int32
Value for us to consider the lock free.
lockingToLock :: TExp Int32
What to write when we lock it.
lockingToUnlock :: TExp Int32
What to write when we unlock it.
lockingMapping :: [TExp Int64] -> [TExp Int64]
A transformation from the logical lock index to the physical position in the array. This can also be used to make the lock array smaller.

data AtomicUpdate rep r Source #

The mechanism that will be used for performing the atomic update. Approximates how efficient it will be. Ordered from most to least efficient.

Constructors

AtomicPrim (DoAtomicUpdate rep r)	Supported directly by primitive.
AtomicCAS (DoAtomicUpdate rep r)	Can be done by efficient swaps.
AtomicLocking (Locking -> DoAtomicUpdate rep r)	Requires explicit locking.

type DoAtomicUpdate rep r = Space -> [VName] -> [TExp Int64] -> ImpM rep r KernelOp () Source #

A function for generating code for an atomic update. Assumes that the bucket is in-bounds.