futhark-0.25.22: An optimising compiler for a functional, array-oriented language.
Safe HaskellSafe-Inferred
LanguageGHC2021

Futhark.Pass.ExtractKernels

Description

Kernel extraction.

In the following, I will use the term "width" to denote the amount of immediate parallelism in a map - that is, the outer size of the array(s) being used as input.

Basic Idea

If we have:

  map
    map(f)
    stms_a...
    map(g)

Then we want to distribute to:

  map
    map(f)
  map
    stms_a
  map
    map(g)

But for now only if

  1. it can be done without creating irregular arrays. Specifically, the size of the arrays created by map(f), by map(g) and whatever is created by stms_a that is also used in map(g), must be invariant to the outermost loop.
  2. the maps are _balanced_. That is, the functions f and g must do the same amount of work for every iteration.

The advantage is that the map-nests containing map(f) and map(g) can now be trivially flattened at no cost, thus exposing more parallelism. Note that the stms_a map constitutes array expansion, which requires additional storage.

Distributing Sequential Loops

As a starting point, sequential loops are treated like scalar expressions. That is, not distributed. However, sometimes it can be worthwhile to distribute if they contain a map:

  map
    loop
      map
    map

If we distribute the loop and interchange the outer map into the loop, we get this:

  loop
    map
      map
  map
    map

Now more parallelism may be available.

Unbalanced Maps

Unbalanced maps will as a rule be sequentialised, but sometimes, there is another way. Assume we find this:

  map
    map(f)
      map(g)
    map

Presume that map(f) is unbalanced. By the simple rule above, we would then fully sequentialise it, resulting in this:

  map
    loop
  map
    map

Balancing by Loop Interchange

The above is not ideal, as we cannot flatten the map-loop nest, and we are thus limited in the amount of parallelism available.

But assume now that the width of map(g) is invariant to the outer loop. Then if possible, we can interchange map(f) and map(g), sequentialise map(f) and distribute, interchanging the outer parallel loop into the sequential loop:

  loop(f)
    map
      map(g)
  map
    map

After flattening the two nests we can obtain more parallelism.

When distributing a map, we also need to distribute everything that the map depends on - possibly as its own map. When distributing a set of scalar bindings, we will need to know which of the binding results are used afterwards. Hence, we will need to compute usage information.

Redomap

Redomap can be handled much like map. Distributed loops are distributed as maps, with the parameters corresponding to the neutral elements added to their bodies. The remaining loop will remain a redomap. Example:

redomap(op,
        fn (v) =>
          map(f)
          map(g),
        e,a)

distributes to

let b = map(fn v =>
              let acc = e
              map(f),
              a)
redomap(op,
        fn (v,dist) =>
          map(g),
        e,a,b)

Note that there may be further kernel extraction opportunities inside the map(f). The downside of this approach is that the intermediate array (b above) must be written to main memory. An often better approach is to just turn the entire redomap into a single kernel.

Synopsis

Documentation

extractKernels :: Pass SOACS GPU Source #

Transform a program using SOACs to a program using explicit kernels, using the kernel extraction transformation.