Safe Haskell | Safe-Inferred |
---|---|
Language | GHC2021 |
We generate code for non-segmented/single-segment SegRed using the basic approach outlined in the paper "Design and GPGPU Performance of Futhark’s Redomap Construct" (ARRAY '16). The main deviations are:
- While we still use two-phase reduction, we use only a single kernel, with the final threadblock to write a result (tracked via an atomic counter) performing the final reduction as well.
- Instead of depending on storage layout transformations to handle
non-commutative reductions efficiently, we slide a
tblocksize
-sized window over the input, and perform a parallel reduction for each window. This sacrifices the notion of efficient sequentialisation, but is sometimes faster and definitely simpler and more predictable (and uses less auxiliary storage).
For segmented reductions we use the approach from "Strategies for
Regular Segmented Reductions on GPU" (FHPC '17). This involves
having two different strategies, and dynamically deciding which one
to use based on the number of segments and segment size. We use the
(static) tblock_size
to decide which of the following two
strategies to choose:
- Large: uses one or more blocks to process a single segment. If multiple blocks are used per segment, the intermediate reduction results must be recursively reduced, until there is only a single value per segment.
Each thread can read multiple elements, which will greatly increase performance; however, if the reduction is non-commutative we will have to use a less efficient traversal (with interim block-wide reductions) to enable coalesced memory accesses, just as in the non-segmented case.
- Small: is used to let each block process *multiple* segments within a block. We will only use this approach when we can process at least two segments within a single block. In those cases, we would allocate a whole block per segment with the large strategy, but at most 50% of the threads in the block would have any element to read, which becomes highly inefficient.
An optimization specfically targeted at non-segmented and large-segments segmented reductions with non-commutative is made: The stage one main loop is essentially stripmined by a factor *chunk*, inserting collective copies via local memory of each reduction parameter going into the intra-block (partial) reductions. This saves a factor *chunk* number of intra-block reductions at the cost of some overhead in collective copies.
Synopsis
- compileSegRed :: Pat LetDecMem -> SegLevel -> SegSpace -> [SegBinOp GPUMem] -> KernelBody GPUMem -> CallKernelGen ()
- compileSegRed' :: Pat LetDecMem -> KernelGrid -> SegSpace -> [SegBinOp GPUMem] -> DoSegBody -> CallKernelGen ()
- type DoSegBody = ([(SubExp, [TExp Int64])] -> InKernelGen ()) -> InKernelGen ()
Documentation
compileSegRed :: Pat LetDecMem -> SegLevel -> SegSpace -> [SegBinOp GPUMem] -> KernelBody GPUMem -> CallKernelGen () Source #
Compile SegRed
instance to host-level code with calls to
various kernels.
compileSegRed' :: Pat LetDecMem -> KernelGrid -> SegSpace -> [SegBinOp GPUMem] -> DoSegBody -> CallKernelGen () Source #
Like compileSegRed
, but where the body is a monadic action.
type DoSegBody = ([(SubExp, [TExp Int64])] -> InKernelGen ()) -> InKernelGen () Source #