datafix-0.0.1.0: Fixing data-flow problems

Datafix.Tutorial

Description

# What is This?

The purpose of datafix is to separate declaring data-flow problems from computing their solutions by fixed-point iteration.

The need for this library arose when I was combining two analyses within GHC for my master's thesis. I held a talk on that topic, feel free to click through if you want to know the details.

You can think of data-flow problems as problems that are solvable by dynamic programming or memoization, except that the dependency graph of data-flow problems doesn't need to be acyclic.

Data-flow problems are declared with the primitives in Datafix.Description and solved by Datafix.Worklist.solveProblem.

With that out of the way, let's set in place the GHCi environment of our examples:

>>> :set -XScopedTypeVariables
>>> :set -XTypeApplications
>>> :set -XTypeFamilies
>>> import Datafix
>>> import Data.Proxy (Proxy (..))
>>> import Algebra.Lattice (JoinSemiLattice (..), BoundedJoinSemiLattice (..))
>>> import Numeric.Natural


# Use Case: Solving Recurrences

Let's start out by computing the fibonacci series:

>>> :{
  fib :: Natural -> Natural
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
:}

>>> fib 3
2
>>> fib 10
55


Bring your rabbits to the vet while you can still count them...

Anyway, the fibonacci series is a typical problem exhibiting overlapping subproblems. As a result, our fib function from above scales badly in the size of its input argument n. Because we repeatedly recompute solutions, the time complexity of our above function is in $$\mathcal{O}(2^n)$$!

We can do better by using dynamic programming or memoization to keep a cache of already computed sub-problems, which helps computing the $$n$$th item in $$\mathcal{O}(n)$$ time and space:

>>> :{
  fib2 :: Natural -> Natural
fib2 n = fibs !! fromIntegral n
where
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
:}

>>> fib2 3
2
>>> fib2 10
55


That's one of Haskell's pet issues: Expressing dynamic programs as lists through laziness.

As promised in the previous section, we can do the same using datafix. First, we need to declare a transfer function that makes the data dependencies for the recursive case explicit, as if we were using fix to eliminate the recursion:

>>> :{
  transferFib
:: forall m
. (MonadDependency m, Domain m ~ Natural)
=> Node
-> LiftedFunc Natural m
transferFib (Node 0) = return 0
transferFib (Node 1) = return 1
transferFib (Node n) = do
a <- dependOn @m (Node (n-1))
b <- dependOn @m (Node (n-2))
return (a + b)
:}


MonadDependency contains a single primitive dependOn for that purpose.

Every point of the fibonacci series is modeled as a seperate Node of the data-flow graph. By looking at the definition of LiftedFunc, we can see that LiftedFunc Natural m ~ m Natural, so for our simple Natural Domain, the transfer function is specified directly in MonadDependency.

Note that indeed we eliminated explicit recursion in transferFib. This allows the solution algorithm to track and discover dependencies of the transfer function as it is executed!

With our transfer function (which denotes data-flow nodes in the semantics of Naturals) in place, we can construct a DataFlowFramework:

>>> :{
  fibDff :: forall m . (MonadDependency m, Domain m ~ Natural) => DataFlowFramework m
fibDff = DFF transferFib (const (eqChangeDetector @(Domain m)))
:}


The eqChangeDetector is important for cyclic dependency graphs and makes sure we detect when a fixed-point has been reached.

That's it for describing the data-flow problem of fibonacci numbers. We can ask Datafix.Worklist.solveProblem for a solution in a minute.

The solveProblem solver demands an instance of BoundedJoinSemiLattice on the Domain for when the data-flow graph is cyclic. We conveniently delegate to the total Ord instance for Natural, knowing that its semantic interpretation is irrelevant to us:

>>> instance JoinSemiLattice Natural where (\/) = max
>>> instance BoundedJoinSemiLattice Natural where bottom = 0


And now the final incantation of the solver:

>>> solveProblem fibDff Sparse NeverAbort (dependOn @(DependencyM _ Natural) (Node 10))
55


This will also execute in $$\mathcal{O}(n)$$ space and time, all without worrying about a smart solution strategy involving how to tie knots or allocate vectors. Granted, this doesn't really pay off for simple problems like computing fibonacci numbers because of the boilerplate involved and the somewhat devious type-level story, but the intended use case is that of static analysis of programming languages.

Before I delegate you to a blog post about strictness analysis, we will look at a more devious reccurence relation with actual cycles in the resulting data-flow graph.

# Use Case: Solving Cyclic Recurrences

The recurrence relation describing fibonacci numbers admits a clear plan of how to compute a solution, because the dependency graph is obviously acyclic: To compute the next new value of the sequence, only the prior two values are needed.

This is not true of the following reccurence relation:

$f(n) = \begin{cases} 2 \cdot f(\frac{n}{2}), & n \text{ even}\\ f(n+1)-1, & n \text{ odd} \end{cases}$

The identity function is the only solution to this, but it is unclear how we could arrive at that conclusion just by translating that relation into Haskell:

>>> :{
f n
| even n = 2 * f (n div 2)
| odd n  = f (n + 1) - 1
:}


Imagine a call f 1: This will call f 2 recursively, which again will call f 1. We hit a cyclic dependency!

Fortunately, we can use datafix to compute the solution by fixed-point iteration (which assumes monotonicity of the function to approximate):

>>> :{
  transferF
:: forall m
. (MonadDependency m, Domain m ~ Int)
=> Node
-> LiftedFunc Int m
transferF (Node n)
| even n = (* 2) <$> dependOn @m (Node (n div 2)) | odd n = (subtract 1) <$> dependOn @m (Node (n + 1))
:}

>>> :{
  fDff :: forall m . (MonadDependency m, Domain m ~ Int) => DataFlowFramework m
fDff = DFF transferF (const (eqChangeDetector @(Domain m)))
:}


Specification of the data-flow problem works the same as for the fib function.

As for Natural, we need an instance of BoundedJoinSemiLattice for Int to compute a solution:

>>> instance JoinSemiLattice Int where (\/) = max
>>> instance BoundedJoinSemiLattice Int where bottom = minBound


Now it's just a matter of calling solveProblem with the right parameters:

>>> solveProblem fDff Sparse NeverAbort (dependOn @(DependencyM _ Int) (Node 0))
0
>>> solveProblem fDff Sparse NeverAbort (dependOn @(DependencyM _ Int) (Node 5))
5
>>> solveProblem fDff Sparse NeverAbort (dependOn @(DependencyM _ Int) (Node 42))
42
>>> solveProblem fDff Sparse NeverAbort (dependOn @(DependencyM _ Int) (Node (-10)))
-10


Note how the specification of the data-flow problem was as unexciting as it was for the fibonacci sequence (modulo boilerplate), yet the recurrence we solved was pretty complicated already.

Of course, encoding the identity function this way is inefficient. But keep in mind that in general, we don't know the solution to a particular recurrence! It's always possible to solve the recurrence by hand upfront, but that's trading precious developer time for what might be a throw-away problem anyway.

Which brings us to the prime and final use case...

# Use Case: Static Analysis

Recurrence equations occur all the time in denotational semantics and static data-flow analysis.

For every invocation of the compiler, for every module, for every analysis within the compiler, a recurrence relation representing program semantics is to be solved. Naturally, we can't task a human with solving a bunch of complicated recurrences everytime we hit compile.

In the imperative world, it's common-place to have some kind of fixed-point iteration framework carry out the iteration of the data-flow graph, but I could not find a similar abstraction for functional programming languages yet. Analyses for functional languages are typically carried out as iterated traversals of the syntax tree, but that is unsatisfying for a number of reasons:

1. Solution logic of the data-flow problem is intertwined with its specification.
2. Solution logic is duplicated among multiple analyses, violating DRY.
3. A consequence of the last two points is that performance tweaks have to be adapted for every analysis separately. In the case of GHC's Demand Analyser, going from chaotic iteration (which corresponds to naive iterated tree traversals) to an iteration scheme that caches results of inner let-bindings, annotations to the syntax tree are suddenly used like State threads, which makes the analysis logic even more complex than it already was.

So, I can only encourage any compiler dev who wants to integrate static analyses into their compiler to properly specify the data-flow problems in terms of datafix and leave the intricacies of finding a good iteration order to this library :)

# Comparison to Datalog/Soufflé

In its most declarative form, datafix is an embedded DSL for specifying static analyses. In that regard, it is really similar to Soufflé, only that Soufflé uses an external DSL (a Datalog dialect) to specify the analysis. The resulting compiled executable needs to run in a separate process and gets the facts of the input program encoded in datalog facts. datafix analyses, on the other hand, will be compiled into the host program and don't need an additional communication layer.