Safe Haskell | Safe-Inferred |
---|---|
Language | GHC2021 |
Kernel extraction.
In the following, I will use the term "width" to denote the amount of immediate parallelism in a map - that is, the outer size of the array(s) being used as input.
Basic Idea
If we have:
map map(f) stms_a... map(g)
Then we want to distribute to:
map map(f) map stms_a map map(g)
But for now only if
- it can be done without creating irregular arrays.
Specifically, the size of the arrays created by
map(f)
, bymap(g)
and whatever is created bystms_a
that is also used inmap(g)
, must be invariant to the outermost loop. - the maps are _balanced_. That is, the functions
f
andg
must do the same amount of work for every iteration.
The advantage is that the map-nests containing map(f)
and
map(g)
can now be trivially flattened at no cost, thus exposing
more parallelism. Note that the stms_a
map constitutes array
expansion, which requires additional storage.
Distributing Sequential Loops
As a starting point, sequential loops are treated like scalar expressions. That is, not distributed. However, sometimes it can be worthwhile to distribute if they contain a map:
map loop map map
If we distribute the loop and interchange the outer map into the loop, we get this:
loop map map map map
Now more parallelism may be available.
Unbalanced Maps
Unbalanced maps will as a rule be sequentialised, but sometimes, there is another way. Assume we find this:
map map(f) map(g) map
Presume that map(f)
is unbalanced. By the simple rule above, we
would then fully sequentialise it, resulting in this:
map loop map map
Balancing by Loop Interchange
The above is not ideal, as we cannot flatten the map-loop
nest,
and we are thus limited in the amount of parallelism available.
But assume now that the width of map(g)
is invariant to the outer
loop. Then if possible, we can interchange map(f)
and map(g)
,
sequentialise map(f)
and distribute, interchanging the outer
parallel loop into the sequential loop:
loop(f) map map(g) map map
After flattening the two nests we can obtain more parallelism.
When distributing a map, we also need to distribute everything that the map depends on - possibly as its own map. When distributing a set of scalar bindings, we will need to know which of the binding results are used afterwards. Hence, we will need to compute usage information.
Redomap
Redomap can be handled much like map. Distributed loops are distributed as maps, with the parameters corresponding to the neutral elements added to their bodies. The remaining loop will remain a redomap. Example:
redomap(op, fn (v) => map(f) map(g), e,a)
distributes to
let b = map(fn v => let acc = e map(f), a) redomap(op, fn (v,dist) => map(g), e,a,b)
Note that there may be further kernel extraction opportunities
inside the map(f)
. The downside of this approach is that the
intermediate array (b
above) must be written to main memory. An
often better approach is to just turn the entire redomap
into a
single kernel.