---
title: Performance & Optimizations
---

Performance and Optimizations
=============================

```haskell top hide
{-# LANGUAGE DataKinds          #-}
{-# LANGUAGE DeriveGeneric      #-}
{-# LANGUAGE FlexibleContexts   #-}
{-# LANGUAGE FlexibleInstances  #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE TemplateHaskell    #-}
{-# LANGUAGE TypeFamilies       #-}
{-# LANGUAGE ViewPatterns       #-}


import           GHC.Generics                 (Generic)
import           GHC.TypeNats
import           Inliterate.Import
import           Lens.Micro
import           Lens.Micro.TH
import           Numeric.Backprop
import           Numeric.Backprop.Class
import           Numeric.LinearAlgebra.Static (L, R)
import qualified Numeric.LinearAlgebra        as HU
import qualified Numeric.LinearAlgebra.Static as H
```

We can use the [MNIST tutorial][bench] as an example to compare automatic
differentiation with "manual" differentiation:

[bench]: https://github.com/mstksg/backprop/blob/master/bench/bench.hs

![benchmarks](https://i.imgur.com/7L5NV4P.png)

In the above, we compare:

1.  "Manual" differentiation of a 784 x 300 x 100 x 10 fully-connected
    feed-forward ANN.
2.  Automatic differentiation using *backprop* and the lens-based accessor
    interface
3.  Automatic differentiation using *backprop* and the "higher-kinded
    data"-based pattern matching interface
4.  A hybrid approach that manually provides gradients for individual layers
    but uses automatic differentiation for chaining the layers together.  See
    the section "Dealing with Overhead from Redundant Updates" for details.

Sources of Overhead
-------------------

One immediate result is that simply *running* the network and functions (using
`evalBP`) incurs virtually zero overhead.  This means that library authors
could actually export *only* backprop-lifted functions, and users would be able
to use them without losing any performance.

As for computing gradients, there exists some associated overhead.  There are
three main sources:

1.  The construction and traversal of the [Wengert tape][] used to implement
    automatic differentiation.  However, this overhead is typically negligible
    for backpropagating any numerical computations of non-trivial complexity.

2.  Redundant updates of entire data types during gradient accumulation.  This
    will be, **by far**, the *dominating* source of any overhead compared to manual
    differentiation for any numerical computation of non-trivial complexity.

3.  Inefficiencies associated with "naive" differentiation, compared to manual
    symbolic differentiation.  However, this inefficiency is typically
    negligible except in edge cases.

[Wengert tape]: https://dl.acm.org/citation.cfm?doid=355586.364791

In addition, usage of the "Higher-Kinded Data"-based pattern matching interface
(over the lens-based accessor interface) incurs some penalties from the
inefficient nature of GHC Generics in general.

Optimization Techniques
-----------------------

### Dealing with Overhead from Redundant Updates

By far the dominating source of overhead when using *backprop* is the redundant
update of data type fields when accumulating gradients.

#### Example

That is, if we had a data type like:

```haskell top
data MyType = MT { _mtX :: Double
                 , _mtY :: Double
                 , _mtZ :: Double
                 }
    deriving (Show, Generic)

makeLenses ''MyType

instance Backprop MyType
```

```haskell top hide
instance AskInliterate MyType
```


and we *use* all three fields somehow:

```haskell top
myFunc :: Reifies s W => BVar s MyType -> BVar s Double
myFunc mt = (mt ^^. mtX) * (mt ^^. mtY) + (mt ^^. mtZ)
```

and we compute its gradient:

```haskell eval
gradBP myFunc (MT 5 7 2)
```

The library will first compute the derivative of the first field, and embed it
into `MyType`:

```haskell
MT { _mtX = 7.0, _mtY = 0.0, _mtZ = 0.0 }
```

Then it'll compute the derivative of the second field and embed it:

```haskell
MT { _mtX = 0.0, _mtY = 5.0, _mtZ = 0.0 }
```

And finally compute the derivative of the third field and embed it:

```haskell
MT { _mtX = 0.0, _mtY = 0.0, _mtZ = 1.0 }
```

And it'll compute the final derivative by `add`-ing all three of those
together.

This is not too bad with `Double`s, but when you have huge matrices, there will
be *six redundant addition of zeroes* for a data type with three fields...and
those additions of zero matrices can incur a huge cost.

In general, for a data type with \\(n\\) fields where you use \\(m\\) of those
fields, you will have something on the order of \\(\mathcal{O}(n m)\\)
redundant additions by zero.

#### Mitigating

One way to mitigate these redundant updates is to prefer data types with less
fields if possible, or re-factor your data types into multiple "levels" of
nesting, to reduce the amount of redundant additions by zero.  That is, instead
of having a giant ten-field data type, have two five-field data types, and one
type having a value of each type.  This also works well with recursive "linked
list" data types, as well, as long as you write functions on your linked lists
inductively.

You can also be careful in how many times you use `^^.` (`viewVar`), because
each usage site incurs another addition-by-zero in the gradient accumulation.
If possible, refactor all of your `^^.` into a single binding, and share it
within your expression, instead of using it again several times for the same
field in the same expression.

You can also use clever lenses too "simulate" having a data type with less
fields than you actually have.  For example, you can have a lens on the first
two fields:

```haskell top
mtXY :: Lens' MyType (Double, Double)
mtXY f (MT x y z) = (\(x', y') -> MT x' y' z) <$> f (x, y)
```

This treats accessing both fields as effectively a single access to a single
tuple field, and so cuts out an extra addition by zero.

As a last resort, you can *completely eliminate* redundant additions by zero by
providing *manual gradients* to functions using your data type.

```haskell top
myFunc' :: Reifies s W => BVar s MyType -> BVar s Double
myFunc' = liftOp1 . op1 $ \(MT x y z) ->
    ( (x * y) + z
    , \d -> MT (d * y) (x * d) d
    )
```

```haskell eval
gradBP myFunc' (MT 5 7 2)
```

See the [writing manual gradients][manual-gradients] page for more information
on exactly how to specify your operations with manual gradients.

[manual-gradients]: https://backprop.jle.im/06-manual-gradients.html

Once you do this, you can use `myFunc'` as a part of any larger computation;
backpropagation will still work the same, and you avoid any redundant additions
of zero:

```haskell eval
gradBP (negate . sqrt . myFunc) (MT 5 7 2)
```

```haskell eval
gradBP (negate . sqrt . myFunc') (MT 5 7 2)
```

When you *use* `myFunc'` in a function, it will be efficiently backpropagated
by the *backprop* library.

This is useful for situations like optimizing artificial neural networks that
are a composition of multiple "layers": you can manually specify the derivative
of each layer, but let the *backprop* library take care of finding the
derivative of *their composition*.  This is exactly the "hybrid" mode mentioned
in the benchmarks above.  As can be seen by benchmark results, this brings the
manual and automatic backprop results to almost within range of random variance
of each other.

However, I don't recommend doing this, unless as a last resort for
optimization.  This is because:

1.  The whole point of the *backprop* library is to allow you to never have to
    specify manual gradients
2.  It is *very very easy* to make a mistake in your gradient computation and
    introduce subtle bugs
3.  It is difficult to *modify* your function if you want to tweak what it
    returns.  Compare changing the multiplication to division in the original
    `myFunc` vs. the manual `myFunc'`
4.  It makes it harder to read and understand (and subsequently refactor) your
    code.

However, this option is available as a low-level performance hack.

### Dealing with Overhead from Naive Differentiation

[Automatic differentiation][ad] is a mechanical process that is nothing more
than glorified book-keeping and accumulation.  It essentially "hitches a ride"
on your normal computation in order to automatically accumulate its gradient.
It isn't aware of the analytical nature of computations, and cannot do any
symbolic or analytical simplifications like re-associating additions or
canceling out factors that humans might perform if manually differentiating.

[ad]: https://en.wikipedia.org/wiki/Automatic_differentiation

In most cases, this is "good enough" and will not be any significant source of
inefficiency in the larger picture.  At least, it won't be worth the cognitive
overhead in squeezing out a one or two percent increase in performance.
However, there are some edge cases where this might become a concern worth
looking at.

A common example is the composition of the [softmax][] activation function and
the [cross-entropy][] error function often used in deep learning. Together,
their derivatives are somewhat complex, computationally.  However, the
derivative of their *composition*, `crossEntropy x . softMax` actually has an
extremely "simple" form, because of how some factors cancel out.  To get around
this, libraries like *tensorflow* offer an [optimized version of the
composition with manually computed gradients][smce].

[softmax]: https://en.wikipedia.org/wiki/Softmax_function
[cross-entropy]: https://en.wikipedia.org/wiki/Cross_entropy
[smce]: https://www.tensorflow.org/api_docs/python/tf/losses/softmax_cross_entropy

```haskell top hide
instance Backprop (R n) where
    zero = zeroNum
    add  = addNum
    one  = oneNum

instance (KnownNat n, KnownNat m) => Backprop (L n m) where
    zero = zeroNum
    add  = addNum
    one  = oneNum

instance KnownNat n => AskInliterate (R n) where
    askInliterate = answerWith (show . H.extract)

konst
    :: (KnownNat n, Reifies s W)
    => BVar s Double
    -> BVar s (R n)
konst = liftOp1 . op1 $ \x ->
    ( H.konst x
    , HU.sumElements . H.extract
    )

sumElements
    :: (KnownNat n, Reifies s W)
    => BVar s (R n)
    -> BVar s Double
sumElements = liftOp1 . op1 $ \x ->
    ( HU.sumElements . H.extract $ x
    , H.konst
    )

dot
    :: (KnownNat n, Reifies s W)
    => BVar s (R n)
    -> BVar s (R n)
    -> BVar s Double
dot = liftOp2 . op2 $ \x y ->
    ( x `H.dot` y
    , \d -> let d' = H.konst d
            in  (d' * y, x * d')
    )
```

```haskell top
-- import Numeric.LinearAlgebra.Static.Backprop

softMax
    :: (KnownNat n, Reifies s W)
    => BVar s (R n)
    -> BVar s (R n)
softMax x = konst (1 / totx) * expx
  where
    expx = exp x
    totx = sumElements expx

crossEntropy
    :: (KnownNat n, Reifies s W)
    => R n
    -> BVar s (R n)
    -> BVar s Double
crossEntropy x y = -(log y `dot` auto x)
```

(Note the usage of `auto :: a -> BVar s a` to lift a normal value into a `BVar`)

Now, you can use `crossEntropy x . softMax` as a `BVar s (R n) -> BVar s Double`
function, and the result and gradient would be correct.  It would backpropagate
the gradient of `crossEntropy` into `softMax`.  However, you can take advantage
of the fact that some factors in the result "cancel out", and you can
drastically simplify the computation.

Their normal composition would naively be:

```haskell top
softMaxCrossEntropy
    :: (KnownNat n, Reifies s W)
    => R n
    -> BVar s (R n)
    -> BVar s Double
softMaxCrossEntropy x y = -(log softMaxY `dot` auto x)
  where
    expy     = exp y
    toty     = sumElements expy
    softMaxY = konst (1 / toty) * expy
```

Which you can probably guess has a decently complex gradient, just from all of
the chained operations we have going on.

However, if you work things out on pencil and paper, you'll find a nice form
for the gradient of the cross entropy composed with softmax, \\(f(x,y)\\):

\\[
\nabla_y f(\mathbf{x}, \mathbf{y}) = \mathrm{softmax}(\mathbf{y}) - \mathbf{x}
\\]

Basically, the gradient is just the result of `softMax` vector-subtracted
from the target.

After computing the gradient by hand, we can write `softMaxCrossEntropy`
with our manual gradient:

```haskell top
-- using the non-lifted interfaces
-- import qualified Numeric.LinearAlgebra        as HU
-- import qualified Numeric.LinearAlgebra.Statuc as H

softMaxCrossEntropy'
    :: (KnownNat n, Reifies s W)
    => R n
    -> BVar s (R n)
    -> BVar s Double
softMaxCrossEntropy' x = liftOp1 . op1 $ \y ->
    let expy     = exp y
        toty     = HU.sumElements (H.extract expy)
        softMaxY = H.konst (1 / toty) * expy
        smce     = -(log softMaxY `H.dot` x)
    in  ( smce
        , \d -> H.konst d * (softMaxY - x)
        )
```

Our gradient is now just `softMaxY - x`, which I can assure you is much, much
simpler than the automatic differentiation-derived gradient.  This is because a
lot of factors show up on the top and bottom of functions and cancel out, and
a lot of positive and negative additions also end up canceling out.

Again, refer to the [writing manual gradients][manual-gradients] page for more
information on exactly how to specify your operations with manual gradients.

Once you do this, `softMaxCrossEntropy'` is now a function you can use normally
and compose with other backpropagatable functions.  You won't be able to
functionally tell apart `crossEntropy x . softMax` from `softMaxCrossEntropy'`,
and the two will behave identically, propagating gradients with other `BVar`
functions:

```haskell eval
gradBP ((**2) . crossEntropy (H.vec3 1 0 0) . softMax) (H.vec3 0.9 0.2 0.3)
```

```haskell eval
gradBP ((**2) . softMaxCrossEntropy (H.vec3 1 0 0)) (H.vec3 0.9 0.2 0.3)
```

```haskell eval
gradBP ((**2) . softMaxCrossEntropy' (H.vec3 1 0 0)) (H.vec3 0.9 0.2 0.3)
```

`softMaxCrossEntropy'` will be more efficient in computing gradients.

Again, I don't recommend doing this in most cases, and this should always be a
last resort.  To me, this is even less warranted than the situation above
(mentioning redundant additions) because any losses due to naive AD should be
negligible.  Only doing this *after profiling and benchmarking*, when you are
*sure* that a particular function composition is causing your bottleneck.
Don't do this for any ol' composition you write, because:

1.  Again, the *whole point* of this library is to allow you to *avoid*
    computing gradients by hand.
2.  Computing gradients by hand is very tricky and there are many places where
    you could introduce a bug in a subtle way that might not be apparent even
    through initial testings.
3.  This is very fragile, and any future changes to your function will require
    you to completely re-compute and re-write your giant lifted function.
4.  It is again much harder to read and understand your code.

But, if you profile and benchmark and conclude that a bad composition is
bottleneck, know that this path is available.