% Learning MNIST with Neural Networks with backprop library
% Justin Le

The *backprop* library performs back-propagation over a *hetereogeneous*
system of relationships.  back-propagation is done automatically (as
reverse-mode automatic differentiation), and you work with your values as if
you were writing normal functions with them, with the help of [lens][].

[ad]: http://hackage.haskell.org/package/ad
[lens]: http://hackage.haskell.org/package/lens

Repository source is [on github][repo], and docs are [on hackage][hackage].

[repo]: https://github.com/mstksg/backprop
[hackage]: http://hackage.haskell.org/package/backprop

If you're reading this as a literate haskell file, you should know that a
[rendered pdf version is available on github.][rendered].  If you are reading
this as a pdf file, you should know that a [literate haskell version that
you can run][lhs] is also available on github!

[rendered]: https://github.com/mstksg/backprop/blob/master/renders/backprop-mnist.pdf
[lhs]: https://github.com/mstksg/backprop/blob/master/samples/backprop-mnist.lhs

The (extra) packages involved are:

*   hmatrix
*   lens
*   mnist-idx
*   mwc-random
*   one-liner-instances
*   split

> {-# LANGUAGE BangPatterns                #-}
> {-# LANGUAGE DataKinds                   #-}
> {-# LANGUAGE DeriveGeneric               #-}
> {-# LANGUAGE FlexibleContexts            #-}
> {-# LANGUAGE GADTs                       #-}
> {-# LANGUAGE LambdaCase                  #-}
> {-# LANGUAGE ScopedTypeVariables         #-}
> {-# LANGUAGE TemplateHaskell             #-}
> {-# LANGUAGE TupleSections               #-}
> {-# LANGUAGE TypeApplications            #-}
> {-# LANGUAGE ViewPatterns                #-}
> {-# OPTIONS_GHC -Wno-incomplete-patterns #-}
> {-# OPTIONS_GHC -Wno-orphans             #-}
> {-# OPTIONS_GHC -Wno-unused-top-binds    #-}
>
> import           Control.DeepSeq
> import           Control.Exception
> import           Control.Monad
> import           Control.Monad.IO.Class
> import           Control.Monad.Trans.Maybe
> import           Control.Monad.Trans.State
> import           Data.Bitraversable
> import           Data.Foldable
> import           Data.IDX
> import           Data.List.Split
> import           Data.Time.Clock
> import           Data.Traversable
> import           Data.Tuple
> import           GHC.Generics                        (Generic)
> import           GHC.TypeLits
> import           Lens.Micro
> import           Lens.Micro.TH
> import           Numeric.Backprop
> import           Numeric.Backprop.Class
> import           Numeric.LinearAlgebra.Static
> import           Numeric.OneLiner
> import           Text.Printf
> import qualified Data.Vector                         as V
> import qualified Data.Vector.Generic                 as VG
> import qualified Data.Vector.Unboxed                 as VU
> import qualified Numeric.LinearAlgebra               as HM
> import qualified System.Random.MWC                   as MWC
> import qualified System.Random.MWC.Distributions     as MWC

Introduction
============

In this walkthrough, we'll be building a classifier for the *[MNIST][]* data
set.  This is meant to mirror the [Tensorflow Tutorial][tf-intro] for
beginners.

[tf-intro]: https://www.tensorflow.org/versions/r1.2/get_started/mnist/beginners

Essentially, we use a two-layer artificial neural network -- or a series of
matrix multiplications, differentiable function applications, and vector
additions.  We feed our input image to the ANN and then try to get a label
from it.  Training an ANN is a matter of finding the right matrices to
multiply by, and the right vectors to add.

To do that, we train our network by treating our network's accuracy as a
function `Network -> Error`.  If we can find the gradient of the input network
with respect to the error, we can perform [gradient descent][], and slowly
make our network better and better.

[gradient descent]: https://en.wikipedia.org/wiki/Gradient_descent

Finding the gradient is usually complicated, but *backprop* makes it simpler:

1.  Write a function to compute the error from the network
2.  That's it!

Hooray!  Once you do that, the library finds the gradient function
*automatically*, without any further intervention!

Types
=====

For the most part, we're going to be using the great *[hmatrix][]* library
and its vector and matrix types.  It offers a type `L m n` for $m \times n$
matrices, and a type `R n` for an $n$ vector.

[hmatrix]: http://hackage.haskell.org/package/hmatrix

First things first: let's define our neural networks as simple containers
of parameters (weight matrices and bias vectors).

First, a type for layers:

> data Layer i o =
>     Layer { _lWeights :: !(L o i)
>           , _lBiases  :: !(R o)
>           }
>   deriving (Show, Generic)
>
> instance NFData (Layer i o)
> makeLenses ''Layer

And a type for a simple feed-forward network with two hidden layers:

> data Network i h1 h2 o =
>     Net { _nLayer1 :: !(Layer i  h1)
>         , _nLayer2 :: !(Layer h1 h2)
>         , _nLayer3 :: !(Layer h2 o)
>         }
>   deriving (Show, Generic)
>
> instance NFData (Network i h1 h2 o)
> makeLenses ''Network

These are pretty straightforward container types...pretty much exactly the
type you'd make to represent these networks!  Note that, following true
Haskell form, we separate out logic from data.  This should be all we need.

Instances
---------

Things are much simplier if we had `Num` and `Fractional` instances for
everything, so let's just go ahead and define that now, as well.  Just a
little bit of boilerplate, made easier using *[one-liner-instances][]* to
auto-derive instances using Generics.

[one-liner-instances]: http://hackage.haskell.org/package/one-liner-instances

> instance (KnownNat i, KnownNat o) => Num (Layer i o) where
>     (+)         = gPlus
>     (-)         = gMinus
>     (*)         = gTimes
>     negate      = gNegate
>     abs         = gAbs
>     signum      = gSignum
>     fromInteger = gFromInteger
>
> instance ( KnownNat i
>          , KnownNat h1
>          , KnownNat h2
>          , KnownNat o
>          ) => Num (Network i h1 h2 o) where
>     (+)         = gPlus
>     (-)         = gMinus
>     (*)         = gTimes
>     negate      = gNegate
>     abs         = gAbs
>     signum      = gSignum
>     fromInteger = gFromInteger
>
> instance (KnownNat i, KnownNat o) => Fractional (Layer i o) where
>     (/)          = gDivide
>     recip        = gRecip
>     fromRational = gFromRational
>
> instance ( KnownNat i
>          , KnownNat h1
>          , KnownNat h2
>          , KnownNat o
>          ) => Fractional (Network i h1 h2 o) where
>     (/)          = gDivide
>     recip        = gRecip
>     fromRational = gFromRational

`KnownNat` comes from *base*; it's a typeclass that *hmatrix* uses to refer
to the numbers in its type and use it to go about its normal hmatrixy
business.

Now we need instances of `Backprop` for our types in order to use them for
automatic differentiation.  Luckily, these can be generated automatically
using GHC Generics:

> instance (KnownNat i, KnownNat o) => Backprop (Layer i o)
> instance (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) => Backprop (Network i h1 h2 o)

Ops
===

Now, *backprop* does require *primitive* differentiable operations on our
relevant types to be defined.  *backprop* uses these primitive operations to
tie everything together.  Ideally we'd import these from a library that
implements these for you, and the end-user never has to make these primitives.

But in this case, I'm going to put the definitions here to show that there
isn't any magic going on.  If you're curious, refer to [documentation for
`Op`][opdoc] for more details on how `Op` is implemented and how this works.

[opdoc]: http://hackage.haskell.org/package/backprop/docs/Numeric-Backprop-Op.html

First, matrix-vector multiplication primitive, giving an explicit gradient
function.

> infixr 8 #>!
> (#>!)
>     :: (KnownNat m, KnownNat n, Reifies s W)
>     => BVar s (L m n)
>     -> BVar s (R n)
>     -> BVar s (R m)
> (#>!) = liftOp2 . op2 $ \m v ->
>   ( m #> v, \g -> (g `outer` v, tr m #> g) )

Dot products would be nice too.

> infixr 8 <.>!
> (<.>!)
>     :: (KnownNat n, Reifies s W)
>     => BVar s (R n)
>     -> BVar s (R n)
>     -> BVar s Double
> (<.>!) = liftOp2 . op2 $ \x y ->
>   ( x <.> y, \g -> (konst g * y, x * konst g)
>   )

Also a function to fill a vector with the same element:

> konst'
>     :: (KnownNat n, Reifies s W)
>     => BVar s Double
>     -> BVar s (R n)
> konst' = liftOp1 . op1 $ \c -> (konst c, HM.sumElements . extract)

Finally, an operation to sum all of the items in the vector.

> sumElements'
>     :: (KnownNat n, Reifies s W)
>     => BVar s (R n)
>     -> BVar s Double
> sumElements' = liftOp1 . op1 $ \x -> (HM.sumElements (extract x), konst)

Again, these are not intended to be used by end-users of *backprop*, but
rather are meant to be provided by libraries as primitive operations for users
of the library to use.

Running our Network
===================

Now that we have our primitives in place, let's actually write a function
to run our network!  And, once we do this, we automatically also have
functions to back-propagate our network!

Normally, to write this function, we'd write:

> runLayerNormal
>     :: (KnownNat i, KnownNat o)
>     => Layer i o
>     -> R i
>     -> R o
> runLayerNormal l x = (l ^. lWeights) #> x + (l ^. lBiases)
> {-# INLINE runLayerNormal #-}

Using the `lWeights` and `lBiases` lenses to access the weights and biases of
our layer.  However, we can translate this to *backprop* by operating on
`BVar`s instead of the type directly, and using our backprop-aware `#>!`:

> runLayer
>     :: (KnownNat i, KnownNat o, Reifies s W)
>     => BVar s (Layer i o)
>     -> BVar s (R i)
>     -> BVar s (R o)
> runLayer l x = (l ^^. lWeights) #>! x + (l ^^. lBiases)
> {-# INLINE runLayer #-}

`^.` lets to access data within a value using a lens, and `^^.` lets you
access data within a `BVar` using a lens:

```haskell
(^.)  ::        a -> Lens' a b ->        b
(^^.) :: BVar s a -> Lens' a b -> BVar s b
```

(There is also `^^?`, which can use a `Prism` or `Traversal` to extract a
target that might not exist, `^^..`, which uses a `Traversal` to extract all
targets, and `.~~`, which uses a `Lens` to update a value inside `BVar`)

Now `runLayer` is a function on two inputs that can be backpropagated,
automatically!  We can find its gradient given any input, and also run it to
get our expected output as well.

Before writing our final network runner, we need a function to compute the
"softmax" of our output vector.  Writing it normally would look like:

> softMaxNormal :: KnownNat n => R n -> R n
> softMaxNormal x = konst (1 / HM.sumElements (extract expx)) * expx
>   where
>     expx = exp x
> {-# INLINE softMaxNormal #-}

But we can make the mechanical shift to the backpropagatable version:

> softMax :: (KnownNat n, Reifies s W) => BVar s (R n) -> BVar s (R n)
> softMax x = konst' (1 / sumElements' expx) * expx
>   where
>     expx = exp x
> {-# INLINE softMax #-}

We also need the [logistic function][], which is our activation function
between layer outputs. Because `BVar`s have a `Floating` instance, we can just
write it using typeclass functions.

[logistic function]: https://en.wikipedia.org/wiki/Logistic_function

> logistic :: Floating a => a -> a
> logistic x = 1 / (1 + exp (-x))
> {-# INLINE logistic #-}

With those in hand, let's compare how we would normally write a function to run
our network:

> runNetNormal
>     :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
>     => Network i h1 h2 o
>     -> R i
>     -> R o
> runNetNormal n = softMaxNormal
>                . runLayerNormal (n ^. nLayer3)
>                . logistic
>                . runLayerNormal (n ^. nLayer2)
>                . logistic
>                . runLayerNormal (n ^. nLayer1)
> {-# INLINE runNetNormal #-}

Basic function composition, neat.  We use our lenses `nLayer1`, `nLayer2`, and
`nLayer3` to extract the first, second, and third layers from our network.

Writing it in a way that backprop can use is also very similar:

> runNetwork
>     :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o, Reifies s W)
>     => BVar s (Network i h1 h2 o)
>     -> R i
>     -> BVar s (R o)
> runNetwork n = softMax
>              . runLayer (n ^^. nLayer3)
>              . logistic
>              . runLayer (n ^^. nLayer2)
>              . logistic
>              . runLayer (n ^^. nLayer1)
>              . constVar
> {-# INLINE runNetwork #-}

We use `constVar` on the input vector, because we don't care about its
gradient and so treat it as a constant.

And now here again we use `^^.` (instead of `^.`) to extract a value from our
`BVar` of a `Network`, using a lens.

Computing Errors
----------------

Now, training a neural network is about calculating its gradient with respect
to some error function.  The library calculatues the gradient for us -- we
just need to tell it how to compute the error function.

For classification problems, we usually use a [cross entropy][] error.  Given
a target vector, how does our neural network's output differ from what is
expected?  Lower numbers are better!

[cross entropy]: https://en.wikipedia.org/wiki/Cross_entropy

Again, let's look at a "normal" implementation, regular variables and no
backprop:

> crossEntropyNormal :: KnownNat n => R n -> R n -> Double
> crossEntropyNormal targ res = -(log res <.> targ)
> {-# INLINE crossEntropyNormal #-}

And we can see that the backpropable version is pretty similar.  We see
`constVar t`, to introduce a `BVar` that is a constant value (that we don't
care about the gradient of).

> crossEntropy
>     :: (KnownNat n, Reifies s W)
>     => R n
>     -> BVar s (R n)
>     -> BVar s Double
> crossEntropy targ res = -(log res <.>! constVar targ)
> {-# INLINE crossEntropy #-}

Our final "error function", then, is:

> netErr
>     :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o, Reifies s W)
>     => R i
>     -> R o
>     -> BVar s (Network i h1 h2 o)
>     -> BVar s Double
> netErr x targ n = crossEntropy targ (runNetwork n x)
> {-# INLINE netErr #-}

The Magic
=========

The actual "magic" of the library happens with the functions to "run" the
functions we defined earlier:

```haskell
evalBP   :: (forall s. Reifies s W => BVar s a -> BVar s b) -> a -> b
gradBP   :: (forall s. Reifies s W => BVar s a -> BVar s b) -> a -> a
backprop :: (forall s. Reifies s W => BVar s a -> BVar s b) -> a -> (b, a)
```

`evalBP` "runs" the function like normal, `gradBP` computes the gradient of
the function, and `backprop` computes both the result and the gradient.

So, if we have a network `net0`, an input vector `x`, and a target vector `t`,
we could compute its error using:

```haskell
evalBP (netErr x targ) net0 :: Double
```

And we can calculate its *gradient* using:

```haskell
gradBP (netErr x targ) net0 :: (Network i h1 h2 o, R i)
```

Pulling it all together
=======================

Let's write a simple function to step our network in the direction opposite of
the gradient to train our model:

> trainStep
>     :: forall i h1 h2 o. (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
>     => Double             -- ^ learning rate
>     -> R i                -- ^ input
>     -> R o                -- ^ target
>     -> Network i h1 h2 o  -- ^ initial network
>     -> Network i h1 h2 o
> trainStep r !x !targ !n = n - realToFrac r * gradBP (netErr x targ) n
> {-# INLINE trainStep #-}

Here's a convenient wrapper for training over all of the observations in a
list:

> trainList
>     :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
>     => Double             -- ^ learning rate
>     -> [(R i, R o)]       -- ^ input and target pairs
>     -> Network i h1 h2 o  -- ^ initial network
>     -> Network i h1 h2 o
> trainList r = flip $ foldl' (\n (x,y) -> trainStep r x y n)
> {-# INLINE trainList #-}


`testNet` will be a quick way to test our net by computing the percentage
of correct guesses: (mostly using *hmatrix* stuff, so don't mind too much)

> testNet
>     :: forall i h1 h2 o. (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
>     => [(R i, R o)]
>     -> Network i h1 h2 o
>     -> Double
> testNet xs n = sum (map (uncurry test) xs) / fromIntegral (length xs)
>   where
>     test :: R i -> R o -> Double          -- test if the max index is correct
>     test x (extract->t)
>         | HM.maxIndex t == HM.maxIndex (extract r) = 1
>         | otherwise                                = 0
>       where
>         r :: R o
>         r = evalBP (`runNetwork` x) n

And now, a main loop!

If you are following along at home, download the [mnist data set files][MNIST]
and uncompress them into the folder `data`, and everything should work fine.

[MNIST]: http://yann.lecun.com/exdb/mnist/

> main :: IO ()
> main = MWC.withSystemRandom $ \g -> do
>     Just train <- loadMNIST "data/train-images-idx3-ubyte" "data/train-labels-idx1-ubyte"
>     Just test  <- loadMNIST "data/t10k-images-idx3-ubyte"  "data/t10k-labels-idx1-ubyte"
>     putStrLn "Loaded data."
>     net0 <- MWC.uniformR @(Network 784 300 100 10) (-0.5, 0.5) g
>     flip evalStateT net0 . forM_ [1..] $ \e -> do
>       train' <- liftIO . fmap V.toList $ MWC.uniformShuffle (V.fromList train) g
>       liftIO $ printf "[Epoch %d]\n" (e :: Int)
>
>       forM_ ([1..] `zip` chunksOf batch train') $ \(b, chnk) -> StateT $ \n0 -> do
>         printf "(Batch %d)\n" (b :: Int)
>
>         t0 <- getCurrentTime
>         n' <- evaluate . force $ trainList rate chnk n0
>         t1 <- getCurrentTime
>         printf "Trained on %d points in %s.\n" batch (show (t1 `diffUTCTime` t0))
>
>         let trainScore = testNet chnk n'
>             testScore  = testNet test n'
>         printf "Training error:   %.2f%%\n" ((1 - trainScore) * 100)
>         printf "Validation error: %.2f%%\n" ((1 - testScore ) * 100)
>
>         return ((), n')
>   where
>     rate  = 0.02
>     batch = 5000

Each iteration of the loop:

1.  Shuffles the training set
2.  Splits it into chunks of `batch` size
3.  Uses `trainList` to train over the batch
4.  Computes the score based on `testNet` based on the training set and the
    test set
5.  Prints out the results

And, that's really it!

Performance
-----------

Currently, benchmarks show that *running* the network has virtually zero
overhead (~ 4%) over writing the running function directly.  The actual
gradient descent process (compute gradient, then descend) carries about 60%
overhead over writing the gradients manually, but it is unclear how much of
this is because of the library, and how much of it is just because of
automatic differentation giving slightly less efficient matrix/vector
multiplication operations.

The [README][repo] has some more detailed benchmarks and statistics, if you
want to get more detailed information.

Main takeaways
==============

Most of the actual heavy lifting/logic actually came from the *hmatrix*
library itself.  We just created simple types to wrap up our bare matrices.

Basically, all that *backprop* did was give you an API to define *how to
run* a neural net --- how to *run* a net based on a `Network` and `R i` input
you were given.  The goal of the library is to let you write down how to
run things in as natural way as possible.

And then, after things are run, we can just get the gradient and roll from
there!

Because the heavy lifting is done by the data types themselves, we can
presumably plug in *any* type and any tensor/numerical backend, and reap
the benefits of those libraries' optimizations and parallelizations.  *Any*
type can be backpropagated! :D

What now?
---------

Check out the docs for the [Numeric.Backprop][] module for a more detailed
picture of what's going on, or find more examples at the [github repo][repo]!

[Numeric.Backprop]: http://hackage.haskell.org/package/backprop/docs/Numeric-Backprop.html

Also, check out follow-up writeup to this tutorial, expanding on using the
library with more advanced extensible neural network types, like the ones
described in [this blog post][blog].  Check out the [literate haskell
here][neural-lhs], and the [rendered PDF here][neural-pdf].

[blog]: https://blog.jle.im/entries/series/+practical-dependent-types-in-haskell.html

[neural-lhs]: https://github.com/mstksg/backprop/blob/master/samples/extensible-neural.lhs
[neural-pdf]: https://github.com/mstksg/backprop/blob/master/renders/extensible-neural.pdf

Boring stuff
============

Here is a small wrapper function over the [mnist-idx][] library loading the
contents of the idx files into *hmatrix* vectors:

[mnist-idx]: http://hackage.haskell.org/package/mnist-idx

> loadMNIST
>     :: FilePath
>     -> FilePath
>     -> IO (Maybe [(R 784, R 10)])
> loadMNIST fpI fpL = runMaybeT $ do
>     i <- MaybeT          $ decodeIDXFile       fpI
>     l <- MaybeT          $ decodeIDXLabelsFile fpL
>     d <- MaybeT . return $ labeledIntData l i
>     r <- MaybeT . return $ for d (bitraverse mkImage mkLabel . swap)
>     liftIO . evaluate $ force r
>   where
>     mkImage :: VU.Vector Int -> Maybe (R 784)
>     mkImage = create . VG.convert . VG.map (\i -> fromIntegral i / 255)
>     mkLabel :: Int -> Maybe (R 10)
>     mkLabel n = create $ HM.build 10 (\i -> if round i == n then 1 else 0)

And here are instances to generating random
vectors/matrices/layers/networks, used for the initialization step.

> instance KnownNat n => MWC.Variate (R n) where
>     uniform g = randomVector <$> MWC.uniform g <*> pure Uniform
>     uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g
>
> instance (KnownNat m, KnownNat n) => MWC.Variate (L m n) where
>     uniform g = uniformSample <$> MWC.uniform g <*> pure 0 <*> pure 1
>     uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g
>
> instance (KnownNat i, KnownNat o) => MWC.Variate (Layer i o) where
>     uniform g = Layer <$> MWC.uniform g <*> MWC.uniform g
>     uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g
>
> instance ( KnownNat i
>          , KnownNat h1
>          , KnownNat h2
>          , KnownNat o
>          )
>       => MWC.Variate (Network i h1 h2 o) where
>     uniform g = Net <$> MWC.uniform g <*> MWC.uniform g <*> MWC.uniform g
>     uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g

Also, some orphan instances of `Backprop` for vector and matrix types.  These
are provided by the [hmatrix-backprop][] library normally:

> instance Backprop (R n) where
>     zero = zeroNum
>     add  = addNum
>     one  = oneNum
>
> instance (KnownNat n, KnownNat m) => Backprop (L m n) where
>     zero = zeroNum
>     add  = addNum
>     one  = oneNum

[hmatrix-backprop]: http://hackage.haskell.org/package/hmatrix-backprop