futhark-0.16.3: An optimising compiler for a functional, array-oriented language.
Safe HaskellSafe-Inferred



Carefully optimised implementations of GPU transpositions. Written in ImpCode so we can compile it to both CUDA and OpenCL.



type TransposeArgs = (VName, Exp, VName, Exp, Exp, Exp, Exp, Exp, Exp, Exp, Exp, VName) Source #

The types of the arguments accepted by a transposition function.

mapTransposeKernel :: String -> Integer -> TransposeArgs -> PrimType -> TransposeType -> Kernel Source #

Generate a transpose kernel. There is special support to handle input arrays with low width, low height, or both.

Normally when transposing a [2][n] array we would use a FUT_BLOCK_DIM x FUT_BLOCK_DIM group to process a [2][FUT_BLOCK_DIM] slice of the input array. This would mean that many of the threads in a group would be inactive. We try to remedy this by using a special kernel that will process a larger part of the input, by using more complex indexing. In our example, we could use all threads in a group if we are processing (2/FUT_BLOCK_DIM) as large a slice of each rows per group. The variable mulx contains this factor for the kernel to handle input arrays with low height.

See issue #308 on GitHub for more details.

These kernels are optimized to ensure all global reads and writes are coalesced, and to avoid bank conflicts in shared memory. Each thread group transposes a 2D tile of block_dim*2 by block_dim*2 elements. The size of a thread group is block_dim/2 by block_dim*2, meaning that each thread will process 4 elements in a 2D tile. The shared memory array containing the 2D tile consists of block_dim*2 by block_dim*2+1 elements. Padding each row with an additional element prevents bank conflicts from occuring when the tile is accessed column-wise.

Note that input_size and output_size may not equal width*height if we are dealing with a truncated array - this happens sometimes for coalescing optimisations.