Safe Haskell	None
Language	Haskell2010

Futhark.CodeGen.ImpGen.Kernels.Transpose

Synopsis

data TransposeType
type TransposeArgs = (VName, Exp, VName, Exp, Exp, Exp, Exp, Exp, Exp, Exp, Exp, VName)
mapTranspose :: Exp -> TransposeArgs -> PrimType -> TransposeType -> KernelCode
mapTransposeKernel :: String -> Integer -> TransposeArgs -> PrimType -> TransposeType -> Kernel

Documentation

data TransposeType Source #

Which form of transposition to generate code for.

Constructors

TransposeNormal
TransposeLowWidth
TransposeLowHeight
TransposeSmall	For small arrays that do not benefit from coalescing.

Instances

Eq TransposeType Source #
Instance details Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose Methods (==) :: TransposeType -> TransposeType -> Bool # (/=) :: TransposeType -> TransposeType -> Bool #
Ord TransposeType Source #
Instance details Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose Methods compare :: TransposeType -> TransposeType -> Ordering # (<) :: TransposeType -> TransposeType -> Bool # (<=) :: TransposeType -> TransposeType -> Bool # (>) :: TransposeType -> TransposeType -> Bool # (>=) :: TransposeType -> TransposeType -> Bool # max :: TransposeType -> TransposeType -> TransposeType # min :: TransposeType -> TransposeType -> TransposeType #
Show TransposeType Source #
Instance details Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose Methods showsPrec :: Int -> TransposeType -> ShowS # show :: TransposeType -> String # showList :: [TransposeType] -> ShowS #

type TransposeArgs = (VName, Exp, VName, Exp, Exp, Exp, Exp, Exp, Exp, Exp, Exp, VName) Source #

mapTranspose :: Exp -> TransposeArgs -> PrimType -> TransposeType -> KernelCode Source #

Generate a transpose kernel. There is special support to handle input arrays with low width, low height, or both.

Normally when transposing a [2][n] array we would use a FUT_BLOCK_DIM x FUT_BLOCK_DIM group to process a [2][FUT_BLOCK_DIM] slice of the input array. This would mean that many of the threads in a group would be inactive. We try to remedy this by using a special kernel that will process a larger part of the input, by using more complex indexing. In our example, we could use all threads in a group if we are processing (2/FUT_BLOCK_DIM) as large a slice of each rows per group. The variable mulx contains this factor for the kernel to handle input arrays with low height.

See issue #308 on GitHub for more details.

These kernels are optimized to ensure all global reads and writes are coalesced, and to avoid bank conflicts in shared memory. Each thread group transposes a 2D tile of block_dim*2 by block_dim*2 elements. The size of a thread group is block_dim/2 by block_dim*2, meaning that each thread will process 4 elements in a 2D tile. The shared memory array containing the 2D tile consists of block_dim*2 by block_dim*2+1 elements. Padding each row with an additional element prevents bank conflicts from occuring when the tile is accessed column-wise.

Note that input_size and output_size may not equal width*height if we are dealing with a truncated array - this happens sometimes for coalescing optimisations.

mapTransposeKernel :: String -> Integer -> TransposeArgs -> PrimType -> TransposeType -> Kernel Source #