Motivation
==========

Haskell is a great language for data processing. 
You load some data in the IO monad, parse it, 
funnel the data through various functions and 
write the result back to disk or display it 
via a web server. 

The programmer has the `let` and `where` patterns at hand 
which can be used to sub-structure a single function, e.g.

    workflow x y = let
       a = f x
       b = g a y
       in h a b

To the environment program, however, 
the values of the intermediate steps `a` and `b` 
are invisible and the reader does not know you used 
the auxiliary functions `f`, `g` and `h`, 
although they might be important 
when an outsider tries to check the correctness of 
the result of the `workflow` function. 
This is where the Provenience monad comes in. 

How it works
============

The Provenience monad is an ordinary state monad transformer. 
The state is a data flow 
[graph](https://hackage.haskell.org/package/fgl "fgl"), 
which we call the *variable store*. Nodes are 
[Pandoc](https://hackage.haskell.org/package/pandoc "pandoc") renderings 
of so-called *variables*. A variable is simply a pair of an ordinary 
Haskell value together with its node in the graph. 
A computation in the Provenience monad performs any number 
of the following five actions. 

* Register a new variable in the variable store 
* Provide a description of a registered variable 
(in form of a Pandoc [Block](http://hackage.haskell.org/package/pandoc-types/docs/Text-Pandoc-Definition.html#t:Block "Block"))
* Provide a short name for a registered variable (used in hyperlinks)
* Render the value of a registered variable into 
its node in the variable store (as a Pandoc `Block`). 
There is a class for default rendering methods akin to the `Show` class. 
* Apply a variable holding a function to a variable holding a value, 
similar to the `<*>` operator of `Applicative` functors. 
In the Provenience monad, we write `<%>` instead. 

The fifth action is the only action that adds edges to the 
data dependency graph. Suppose we have registered a variable `f` 
holding a value of type `a -> b` and a variable `x` holding a 
value of type `a`. The description of `f` should explain to the reader 
what the function that is the value of `f` does. 
The monadic action 

    y <- pure f <%> x

does not register `y` as a new variable; instead `y` points to the same 
node in the variable store as `f`. However, the value of `y` is the 
application of the value of `f` to the value of `x` and there is now 
an edge from `x` to `y` in the data flow graph labelled with the 
description of `f`. If `y` is not itself a function 
but the desired result, you should overwrite the node's description 
(which is still the description of `f`) with a new description of 
the value of `y`. 

Why this design choice? Because otherwise partial 
application is impossible. If <%> always registered new variables, 
then 

    f <%> a <%> b

would register both `f(a)` and `f(a)(b)` as variables, which might not be 
what the user intended. But overwriting `f` also means that we can not 
re-use the same function variable in several applications. When that is 
desired, use a Provenience action producing a variable instead of the 
variable itself. Consider the following.

    let f = var succ
    x <- input 4
    y <- f <%> x
    z <- f <%> y

Since the Haskell identifier `f` is bound to a Provenience action 
that registers a new variable holding the `succ` function, all 
three of `x`, `y` and `z` are distinct variables. 
The take-home message is that 

    f <- var succ
    x <- input 4
    y <- pure f <%> x

is a dangerous style because the value of `f` is not what the corresponding 
node in the graph is being used for anymore. 

alternative Representation
--------------------------

The variable store also permits to save an alternative representation 
of each variable in addition to the Pandoc rendering, 
since you might want to provide a machine-readable data flow graph 
in addition to a Pandoc document. 
Similarly to the `IHaskellDisplay` class, 
each type used in a variable must have a type class instance 
that allows automatic conversion into the alternative representation. 
If you don't need this feature, simply choose () as the alternative 
representation type. 
The graph of alternative representations can be extracted from 
the variable store. We provide code to assemble the store into a 
spreadsheet (of static cells). Foldable structures 
of basic values become columns while doubly-nested structures 
become tables. 


Example
=======

Continuing the example above, in the Provenience monad you would 
write something like the following. Of course it is up to the programmer 
to decide how fine-grained the decomposition into Provenience actions 
should be. 
 
    workflow x' y' = do
      ---------- register and render the input variables ------------------
      x <- input x' --                               register and render x'
      y <- input y'
      x `named` "x" --                          links to x show "x" as text
      y `named` "y"
      x <? renderDefault "first item of input data" --           describe x
      y <? renderDefault "second item of input data"
      linkx <- linkto x --                   create a hyperlink, used below
      let what_f_does = Para [Str "auxiliary function f applied to ",linkx]
      ---------------------------------------------------------------------
      ------ the actual computation is three lines as in the pure code ----
      a <- func f what_f_does <%> x
      b <- func g (renderDefault "auxiliary function g") <%> a <%> y
      c <- func h (renderDefault "auxiliary function h") <%> a <%> b
      ------ only book-keeping below --------------------------------------
      ---------------------------------------------------------------------
      a `named` "a" >> b `named` "b" >> c `named` "result"
      a <? renderDefault "first intermediate result"
      b <? renderDefault "second intermediate result"
      c <? renderDefault "the workflow result"
      render a >> render b >> render c
      return c

Above, the action `func` registers a new variable and immediately 
supplies a description, which is then used as edge label by the 
`<%>` operator on the same line.   
You see that instead of one line of pure Haskell you are burdened 
with writing four kinds of Provenience actions: 
*register*, *describe*, *alias* and *render*. But of the four actions, 
three are only concerned with providing descriptions that the pure code 
did not contain. 

Remarks
=======

This package was inspired by the 
[Javelin](https://en.wikipedia.org/wiki/Javelin_Software "wikipedia") 
Software. Thanks to John R Levine, one of the authors of Javelin, 
for explaining the concepts underlying Javelin.  

By using [Pandoc](https://hackage.haskell.org/package/pandoc "pandoc") 
the user has a number of output format choices. 
With a little CSS, the above example may be rendered like follows. 
Unfortunately, hackage does not allow raw html in markdown, so 
you have to convert the markdown yourself. 

(For the sake of example, 
we used `f = abs`, `g = replicate` and `h = fmap concat . replicate`). 

<div id="variables">
<div id="provenienceVar1" class="variable" style="border: 2px solid;padding: 10px 40px;width: 400px;border-radius: 25px;">
<div class="shortname" style="font-weight:bold;color:OliveDrab;">
<p>y</p>
</div>
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Used in:</p>
</div>
<ul>
<li><a href="#provenienceVar3" class="outgoing" style="color:DodgerBlue;" title="b">b</a></li>
</ul>
<hr />
<div class="description">
<p>second item of input data</p>
</div>
<div class="valueRendering" style="background-color:#ebebe0;">
<p>t</p>
</div>
</div>
<div id="provenienceVar0" class="variable" style="border: 2px solid;padding: 10px 40px;width: 400px;border-radius: 25px;">
<div class="shortname" style="font-weight:bold;color:OliveDrab;">
<p>x</p>
</div>
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Used in:</p>
</div>
<ul>
<li><a href="#provenienceVar2" class="outgoing" style="color:DodgerBlue;" title="a">a</a></li>
</ul>
<hr />
<div class="description">
<p>first item of input data</p>
</div>
<div class="valueRendering" style="background-color:#ebebe0;">
<p>-4</p>
</div>
</div>
<div id="provenienceVar2" class="variable" style="border: 2px solid;padding: 10px 40px;width: 400px;border-radius: 25px;">
<div class="shortname" style="font-weight:bold;color:OliveDrab;">
<p>a</p>
</div>
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Sources:</p>
</div>
<ul>
<li><a href="#provenienceVar0" class="incoming" style="color:Tomato;" title="x">x</a></li>
</ul>
<div class="edges">
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Construction:</p>
</div>
<p>auxiliary function f applied to <a href="#provenienceVar0" title="x">x</a></p>
</div>
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Used in:</p>
</div>
<ul>
<li><a href="#provenienceVar3" class="outgoing" style="color:DodgerBlue;" title="b">b</a></li>
<li><a href="#provenienceVar4" class="outgoing" style="color:DodgerBlue;" title="result">result</a></li>
</ul>
<hr />
<div class="description">
<p>first intermediate result</p>
</div>
<div class="valueRendering" style="background-color:#ebebe0;">
<p>4</p>
</div>
</div>
<div id="provenienceVar3" class="variable" style="border: 2px solid;padding: 10px 40px;width: 400px;border-radius: 25px;">
<div class="shortname" style="font-weight:bold;color:OliveDrab;">
<p>b</p>
</div>
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Sources:</p>
</div>
<ul>
<li><a href="#provenienceVar1" class="incoming" style="color:Tomato;" title="y">y</a></li>
<li><a href="#provenienceVar2" class="incoming" style="color:Tomato;" title="a">a</a></li>
</ul>
<div class="edges">
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Construction:</p>
</div>
<p>auxiliary function g</p>
</div>
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Used in:</p>
</div>
<ul>
<li><a href="#provenienceVar4" class="outgoing" style="color:DodgerBlue;" title="result">result</a></li>
</ul>
<hr />
<div class="description">
<p>second intermediate result</p>
</div>
<div class="valueRendering" style="background-color:#ebebe0;">
<p>tttt</p>
</div>
</div>
<div id="provenienceVar4" class="variable" style="border: 2px solid;padding: 10px 40px;width: 400px;border-radius: 25px;">
<div class="shortname" style="font-weight:bold;color:OliveDrab;">
<p>result</p>
</div>
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Sources:</p>
</div>
<ul>
<li><a href="#provenienceVar2" class="incoming" style="color:Tomato;" title="a">a</a></li>
<li><a href="#provenienceVar3" class="incoming" style="color:Tomato;" title="b">b</a></li>
</ul>
<div class="edges">
<div class="provenienceKeyword" style="font-weight:bold;">
<p>Construction:</p>
</div>
<p>auxiliary function h</p>
</div>
<hr />
<div class="description">
<p>the workflow result</p>
</div>
<div class="valueRendering" style="background-color:#ebebe0;">
<p>tttttttttttttttt</p>
</div>
</div>
</div>

Alternatives and shortcomings
=============================

Spreadsheets excel at displaying intermediate values, 
although comprehending the meaning and intention of 
spreadsheet formulas requires experience. Needless to say, 
Haskell is way more expressive than the formula language 
of contemporary spreadsheets.

Limiting yourself to one of Pandoc's output formats, 
you might write a 
[haskintex](http://hackage.haskell.org/package/haskintex "haskintex") 
document instead of a Provenience expression. 
Haskintex may be regarded as a form of literate programming, 
and you can use any of the numerous string interpolation packages 
or a library for type-safe construction of the target document format 
(e.g. 
[HaTeX](http://hackage.haskell.org/package/HaTeX "HaTeX"), 
[blaze](http://hackage.haskell.org/package/blaze-html "blaze") or 
[Pandoc](https://hackage.haskell.org/package/pandoc "pandoc"))
to achieve the same outcome. 
However, you will not get the complete data flow graph for free. 

Some programming languages provide interactive *notebooks*, e.g. the 
[IPython](https://github.com/ipython "IPython") interactive notebooks 
or the [IHaskell](http://hackage.haskell.org/package/ihaskell "ihaskell") 
variant. 
[Hyper-Haskell](https://github.com/HeinrichApfelmus/hyper-haskell "hyper-haskell") 
is another approach to Haskell workbooks and shares some ideas with this package. 
[Typed Spreadsheets](http://hackage.haskell.org/package/typed-spreadsheet "typed-spreadsheet") 
sort of do what this package does in an interactive way, although it does not show 
intermediate results.  
Working with large quantities of input data might become unwieldy, however. 
Rendering a notebook or interactive spreadsheet, though, requires the underlying language 
to be installed on the system. 
In contrast the philosophy of Provenience 
is that computation and display take place on different machines, 
and that a Provenience computation is only a small part 
of an actual application.  

Perhaps the biggest shortcoming at present is 
the behaviour of `mapM`.  
When you `mapM` a Provenience function over a 
traversable structure like a list, 
each element registers the same number of variables, 
whereas your intention probably was to register the 
entire traversable structure as input and obtain 
a single output variable. 
At present you have to do this transformation yourself. 

TODOs
=====

Add support for an actual graph layout of the variable store 
by means of a library such as [graphViz](http://hackage.haskell.org/package/graphviz "graphViz"). 

Perhaps it is wise to make usage of variables which meanwhile have been overwritten illegal to use. 

How could one retain the functions in the data flow graph, 
making the data flow interactive? 
We'd have to map a subset of Haskell onto some serializable type, 
e.g. spreadsheet formulas or JavaScript.