---
author:
- Justin Le
fontfamily: 'palatino,cmtt'
geometry: margin=1in
links-as-notes: true
title: Learning MNIST with Neural Networks with backprop library
---

The *backprop* library performs back-propagation over a *hetereogeneous*
system of relationships. It offers both an implicit (*[ad]*-like) and
explicit graph building usage style. Let’s use it to build neural
networks and learn mnist!

  [ad]: http://hackage.haskell.org/package/ad

Repository source is [on github], and docs are [on hackage].

  [on github]: https://github.com/mstksg/backprop
  [on hackage]: http://hackage.haskell.org/package/backprop

If you’re reading this as a literate haskell file, you should know that
a [rendered pdf version is available on github.]. If you are reading
this as a pdf file, you should know that a [literate haskell version
that you can run] is also available on github!

  [rendered pdf version is available on github.]: https://github.com/mstksg/backprop/blob/master/renders/backprop-mnist.pdf
  [literate haskell version that you can run]: https://github.com/mstksg/backprop/blob/master/samples/backprop-mnist.lhs

``` {.sourceCode .literate .haskell}
{-# LANGUAGE BangPatterns                     #-}
{-# LANGUAGE DataKinds                        #-}
{-# LANGUAGE DeriveGeneric                    #-}
{-# LANGUAGE GADTs                            #-}
{-# LANGUAGE LambdaCase                       #-}
{-# LANGUAGE ScopedTypeVariables              #-}
{-# LANGUAGE TupleSections                    #-}
{-# LANGUAGE TypeApplications                 #-}
{-# LANGUAGE ViewPatterns                     #-}
{-# OPTIONS_GHC -fno-warn-orphans             #-}
{-# OPTIONS_GHC -fno-warn-incomplete-patterns #-}
{-# OPTIONS_GHC -fno-warn-unused-top-binds    #-}

import           Control.DeepSeq
import           Control.Exception
import           Control.Monad
import           Control.Monad.IO.Class
import           Control.Monad.Trans.Maybe
import           Control.Monad.Trans.State
import           Data.Bitraversable
import           Data.Foldable
import           Data.IDX
import           Data.List.Split
import           Data.Maybe
import           Data.Time.Clock
import           Data.Traversable
import           Data.Tuple
import           GHC.Generics                        (Generic)
import           GHC.TypeLits
import           Numeric.Backprop
import           Numeric.LinearAlgebra.Static hiding (dot)
import           Text.Printf
import qualified Data.Vector                         as V
import qualified Data.Vector.Generic                 as VG
import qualified Data.Vector.Unboxed                 as VU
import qualified Generics.SOP                        as SOP
import qualified Numeric.LinearAlgebra               as HM
import qualified System.Random.MWC                   as MWC
import qualified System.Random.MWC.Distributions     as MWC
```

Types
=====

For the most part, we’re going to be using the great *[hmatrix]* library
and its vector and matrix types. It offers a type `L m n` for
$m \times n$ matrices, and a type `R n` for an $n$ vector.

  [hmatrix]: http://hackage.haskell.org/package/hmatrix

First things first: let’s define our neural networks as simple
containers of parameters (weight matrices and bias vectors).

First, a type for layers:

``` {.sourceCode .literate .haskell}
data Layer i o =
    Layer { _lWeights :: !(L o i)
          , _lBiases  :: !(R o)
          }
  deriving (Show, Generic)

instance SOP.Generic (Layer i o)
instance NFData (Layer i o)
```

And a type for a simple feed-forward network with two hidden layers:

``` {.sourceCode .literate .haskell}
data Network i h1 h2 o =
    Net { _nLayer1 :: !(Layer i  h1)
        , _nLayer2 :: !(Layer h1 h2)
        , _nLayer3 :: !(Layer h2 o)
        }
  deriving (Show, Generic)

instance SOP.Generic (Network i h1 h2 o)
instance NFData (Network i h1 h2 o)
```

These are pretty straightforward container types…pretty much exactly the
type you’d make to represent these networks! Note that, following true
Haskell form, we separate out logic from data. This should be all we
need.

We derive an instance of `SOP.Generic` from the *[generics-sop]*
package, which *backprop* uses to propagate derivatives on values inside
product types.

  [generics-sop]: http://hackage.haskell.org/package/generics-sop

Instances
---------

Things are much simplier if we had `Num` and `Fractional` instances for
everything, so let’s just go ahead and define that now, as well. Just a
little bit of boilerplate.

``` {.sourceCode .literate .haskell}
instance (KnownNat i, KnownNat o) => Num (Layer i o) where
    Layer w1 b1 + Layer w2 b2 = Layer (w1 + w2) (b1 + b2)
    Layer w1 b1 - Layer w2 b2 = Layer (w1 - w2) (b1 - b2)
    Layer w1 b1 * Layer w2 b2 = Layer (w1 * w2) (b1 * b2)
    abs    (Layer w b)        = Layer (abs    w) (abs    b)
    signum (Layer w b)        = Layer (signum w) (signum b)
    negate (Layer w b)        = Layer (negate w) (negate b)
    fromInteger x             = Layer (fromInteger x) (fromInteger x)

instance (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) => Num (Network i h1 h2 o) where
    Net a b c + Net d e f = Net (a + d) (b + e) (c + f)
    Net a b c - Net d e f = Net (a - d) (b - e) (c - f)
    Net a b c * Net d e f = Net (a * d) (b * e) (c * f)
    abs    (Net a b c)    = Net (abs    a) (abs    b) (abs    c)
    signum (Net a b c)    = Net (signum a) (signum b) (signum c)
    negate (Net a b c)    = Net (negate a) (negate b) (negate c)
    fromInteger x         = Net (fromInteger x) (fromInteger x) (fromInteger x)

instance (KnownNat i, KnownNat o) => Fractional (Layer i o) where
    Layer w1 b1 / Layer w2 b2 = Layer (w1 / w2) (b1 / b2)
    recip (Layer w b)         = Layer (recip w) (recip b)
    fromRational x            = Layer (fromRational x) (fromRational x)

instance (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) => Fractional (Network i h1 h2 o) where
    Net a b c / Net d e f = Net (a / d) (b / e) (c / f)
    recip (Net a b c)     = Net (recip a) (recip b) (recip c)
    fromRational x        = Net (fromRational x) (fromRational x) (fromRational x)
```

`KnownNat` comes from *base*; it’s a typeclass that *hmatrix* uses to
refer to the numbers in its type and use it to go about its normal
hmatrixy business.

Ops
===

Now, *backprop* does require *primitive* differentiable operations on
our relevant types to be defined. *backprop* uses these primitive `Op`s
to tie everything together. Ideally we’d import these from a library
that implements these for you, and the end-user never has to make `Op`
primitives.

But in this case, I’m going to put the definitions here to show that
there isn’t any magic going on. If you’re curious, refer to
[documentation for `Op`] for more details on how `Op` is implemented and
how this works.

  [documentation for `Op`]: http://hackage.haskell.org/package/backprop/docs/Numeric-Backprop-Op.html

First, matrix-vector multiplication primitive, giving an explicit
gradient function.

``` {.sourceCode .literate .haskell}
matVec
    :: (KnownNat m, KnownNat n)
    => Op '[ L m n, R n ] (R m)
matVec = op2' $ \m v ->
  ( m #> v, \(fromMaybe 1 -> g) ->
              (g `outer` v, tr m #> g)
  )
```

Dot products would be nice too.

``` {.sourceCode .literate .haskell}
dot :: KnownNat n
    => Op '[ R n, R n ] Double
dot = op2' $ \x y ->
  ( x <.> y, \case Nothing -> (y, x)
                   Just g  -> (konst g * y, x * konst g)
  )
```

Also a “scaling” function, scales a vector by a given factor.

``` {.sourceCode .literate .haskell}
scale
    :: KnownNat n
    => Op '[ Double, R n ] (R n)
scale = op2' $ \a x ->
  ( konst a * x
  , \case Nothing -> (HM.sumElements (extract x      ), konst a    )
          Just g  -> (HM.sumElements (extract (x * g)), konst a * g)
  )
```

Finally, an operation to sum all of the items in the vector.

``` {.sourceCode .literate .haskell}
vsum
    :: KnownNat n
    => Op '[ R n ] Double
vsum = op1' $ \x -> (HM.sumElements (extract x), maybe 1 konst)
```

And why not, here’s the [logistic function], which we’ll use as an
activation function for internal layers. We don’t need to define this as
an `Op` up-front right now, because the library can automatically
promote any numeric polymorphic function (an `a -> a` or `a -> a -> a`,
etc.) to an `Op` anyways.

  [logistic function]: https://en.wikipedia.org/wiki/Logistic_function

``` {.sourceCode .literate .haskell}
logistic :: Floating a => a -> a
logistic x = 1 / (1 + exp (-x))
```

Running our Network
===================

Now that we have our primitives in place, let’s actually write a
function to run our network!

``` {.sourceCode .literate .haskell}
runLayer
    :: (KnownNat i, KnownNat o)
    => BPOp s '[ R i, Layer i o ] (R o)
runLayer = withInps $ \(x :< l :< Ø) -> do
    w :< b :< Ø <- gTuple #<~ l
    y <- matVec ~$ (w :< x :< Ø)
    return $ y + b
```

A `BPOp s '[ R i, Layer i o ] (R o)` is a backpropagatable function that
produces an `R o` (a vector with `o` elements, from the *[hmatrix]*
library) given an input environment of an `R i` (the “input” of the
layer) and a layer.

  [hmatrix]: http://hackage.haskell.org/package/hmatrix

We use `withInps` to bring the environment into scope as a bunch of
`BVar`s. `x` is a `BVar` containing the input vector, and `l` is a
`BVar` containing the layer.

The first thing we do is split out the parts of the layer so we can work
with the internal matrices. We can use `#<~` to “split out” the
components of a `BVar`, splitting on `gTuple` (which uses `GHC.Generics`
to automatically figure out how to split up a product type).

Then we apply `matVec` (our primitive `Op` that does matrix-vector
multiplication) to `w` and `x`, and then the result is that added to the
bias vector `b`.

We can write the `runNetwork` function pretty much the same way.

``` {.sourceCode .literate .haskell}
runNetwork
    :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
    => BPOp s '[ R i, Network i h1 h2 o ] (R o)
runNetwork = withInps $ \(x :< n :< Ø) -> do
    l1 :< l2 :< l3 :< Ø <- gTuple #<~ n
    y <- runLayer -$ (x          :< l1 :< Ø)
    z <- runLayer -$ (logistic y :< l2 :< Ø)
    r <- runLayer -$ (logistic z :< l3 :< Ø)
    softmax       -$ (r          :< Ø)
  where
    softmax :: KnownNat n => BPOp s '[ R n ] (R n)
    softmax = withInps $ \(x :< Ø) -> do
        expX <- bindVar (exp x)
        totX <- vsum ~$ (expX   :< Ø)
        scale        ~$ (1/totX :< expX :< Ø)
```

After splitting out the layers in the input `Network`, we run each layer
successively using our previously defined `runLayer`, giving inputs
using `-$`. We can directly apply `logistic` to `BVar`s. At the end, we
run a [softmax function] because MNIST is a classification challenge.
The softmax is done by applying $e^x$ for every item in the input
vector, and dividing each element by the total.

  [softmax function]: https://en.wikipedia.org/wiki/Softmax_function

The Magic
---------

What did we just define? Well, with a `BPOp s rs a`, we can *run* it and
get the output:

``` {.sourceCode .literate .haskell}
runNetOnInp
    :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
    => Network i h1 h2 o
    -> R i
    -> R o
runNetOnInp n x = evalBPOp runNetwork (x ::< n ::< Ø)
```

But, the magic part is that we can also get the gradient!

``` {.sourceCode .literate .haskell}
gradNet
    :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
    => Network i h1 h2 o
    -> R i
    -> Network i h1 h2 o
gradNet n x = case gradBPOp runNetwork (x ::< n ::< Ø) of
    _gradX ::< gradN ::< Ø -> gradN
```

This gives the gradient of all of the parameters in the matrices and
vectors inside the `Network`, which we can use to “train”!

Training
========

Now for the real work. To train a network, we can do gradient descent
based on the gradient of some type of *error function* with respect to
the network parameters. Let’s use the [cross entropy], which is popular
for classification problems.

  [cross entropy]: https://en.wikipedia.org/wiki/Cross_entropy

``` {.sourceCode .literate .haskell}
crossEntropy
    :: KnownNat n
    => R n
    -> BPOpI s '[ R n ] Double
crossEntropy targ (r :< Ø) = negate (dot .$ (log r :< t :< Ø))
  where
    t = constVar targ
```

Given a target vector and a `BVar` referring to the result of the
network, we can directly apply:

$$
H(\mathbf{r}, \mathbf{t}) = - (log(\mathbf{r}) \cdot \mathbf{t})
$$

Just for fun, I implemented `crossEntropy` in “implicit-graph” mode, so
you don’t see any binds or returns.

Now, a function to make one gradient descent step based on an input
vector and a target, using `gradBPOp`:

``` {.sourceCode .literate .haskell}
trainStep
    :: forall i h1 h2 o. (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
    => Double
    -> R i
    -> R o
    -> Network i h1 h2 o
    -> Network i h1 h2 o
trainStep r !x !t !n = case gradBPOp o (x ::< n ::< Ø) of
    _ ::< gN ::< Ø ->
        n - (realToFrac r * gN)
  where
    o :: BPOp s '[ R i, Network i h1 h2 o ] Double
    o = do
      y <- runNetwork
      implicitly (crossEntropy t) -$ (y :< Ø)
```

A convenient wrapper for training over all of the observations in a
list:

``` {.sourceCode .literate .haskell}
trainList
    :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
    => Double
    -> [(R i, R o)]
    -> Network i h1 h2 o
    -> Network i h1 h2 o
trainList r = flip $ foldl' (\n (x,y) -> trainStep r x y n)
```

Pulling it all together
=======================

`testNet` will be a quick way to test our net by computing the
percentage of correct guesses: (mostly using *hmatrix* stuff)

``` {.sourceCode .literate .haskell}
testNet
    :: forall i h1 h2 o. (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o)
    => [(R i, R o)]
    -> Network i h1 h2 o
    -> Double
testNet xs n = sum (map (uncurry test) xs) / fromIntegral (length xs)
  where
    test :: R i -> R o -> Double
    test x (extract->t)
        | HM.maxIndex t == HM.maxIndex (extract r) = 1
        | otherwise                                = 0
      where
        r :: R o
        r = evalBPOp runNetwork (x ::< n ::< Ø)
```

And now, a main loop!

If you are following along at home, download the [mnist data set files]
and uncompress them into the folder `data`, and everything should work
fine.

  [mnist data set files]: http://yann.lecun.com/exdb/mnist/

``` {.sourceCode .literate .haskell}
main :: IO ()
main = MWC.withSystemRandom $ \g -> do
    Just train <- loadMNIST "data/train-images-idx3-ubyte" "data/train-labels-idx1-ubyte"
    Just test  <- loadMNIST "data/t10k-images-idx3-ubyte"  "data/t10k-labels-idx1-ubyte"
    putStrLn "Loaded data."
    net0 <- MWC.uniformR @(Network 784 300 100 9) (-0.5, 0.5) g
    flip evalStateT net0 . forM_ [1..] $ \e -> do
      train' <- liftIO . fmap V.toList $ MWC.uniformShuffle (V.fromList train) g
      liftIO $ printf "[Epoch %d]\n" (e :: Int)

      forM_ ([1..] `zip` chunksOf batch train') $ \(b, chnk) -> StateT $ \n0 -> do
        printf "(Batch %d)\n" (b :: Int)

        t0 <- getCurrentTime
        n' <- evaluate . force $ trainList rate chnk n0
        t1 <- getCurrentTime
        printf "Trained on %d points in %s.\n" batch (show (t1 `diffUTCTime` t0))

        let trainScore = testNet chnk n'
            testScore  = testNet test n'
        printf "Training error:   %.2f%%\n" ((1 - trainScore) * 100)
        printf "Validation error: %.2f%%\n" ((1 - testScore ) * 100)

        return ((), n')
  where
    rate  = 0.02
    batch = 5000
```

Each iteration of the loop:

1.  Shuffles the training set
2.  Splits it into chunks of `batch` size
3.  Uses `trainList` to train over the batch
4.  Computes the score based on `testNet` based on the training set and
    the test set
5.  Prints out the results

And, that’s really it!

Result
------

I haven’t put much into optimizing the library yet, but the network
(with hidden layer sizes 300 and 100) seems to take 25s on my computer
to finish a batch of 5000 training points. It’s slow (five minutes per
60000 point epooch), but it’s a first unoptimized run and a proof of
concept! It’s my goal to get this down to a point where the result has
the same performance characteristics as the actual backend (*hmatrix*),
and so overhead is 0.

Main takeaways
==============

Most of the actual heavy lifting/logic actually came from the *hmatrix*
library itself. We just created simple types to wrap up our bare
matrices.

Basically, all that *backprop* did was give you an API to define *how to
run* a neural net — how to *run* a net based on a `Network` and `R i`
input you were given. The goal of the library is to let you write down
how to run things in as natural way as possible.

And then, after things are run, we can just get the gradient and roll
from there!

Because the heavy lifting is done by the data types themselves, we can
presumably plug in *any* type and any tensor/numerical backend, and reap
the benefits of those libraries’ optimizations and parallelizations.
*Any* type can be backpropagated! :D

What now?
---------

Check out the docs for the [Numeric.Backprop] module for a more detailed
picture of what’s going on, or find more examples at the [github repo]!

  [Numeric.Backprop]: http://hackage.haskell.org/package/backprop/docs/Numeric-Backprop.html
  [github repo]: https://github.com/mstksg/backprop

Boring stuff
============

Here is a small wrapper function over the [mnist-idx] library loading
the contents of the idx files into *hmatrix* vectors:

  [mnist-idx]: http://hackage.haskell.org/package/mnist-idx

``` {.sourceCode .literate .haskell}
loadMNIST
    :: FilePath
    -> FilePath
    -> IO (Maybe [(R 784, R 9)])
loadMNIST fpI fpL = runMaybeT $ do
    i <- MaybeT          $ decodeIDXFile       fpI
    l <- MaybeT          $ decodeIDXLabelsFile fpL
    d <- MaybeT . return $ labeledIntData l i
    r <- MaybeT . return $ for d (bitraverse mkImage mkLabel . swap)
    liftIO . evaluate $ force r
  where
    mkImage :: VU.Vector Int -> Maybe (R 784)
    mkImage = create . VG.convert . VG.map (\i -> fromIntegral i / 255)
    mkLabel :: Int -> Maybe (R 9)
    mkLabel n = create $ HM.build 9 (\i -> if round i == n then 1 else 0)
```

And here are instances to generating random
vectors/matrices/layers/networks, used for the initialization step.

``` {.sourceCode .literate .haskell}
instance KnownNat n => MWC.Variate (R n) where
    uniform g = randomVector <$> MWC.uniform g <*> pure Uniform
    uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g

instance (KnownNat m, KnownNat n) => MWC.Variate (L m n) where
    uniform g = uniformSample <$> MWC.uniform g <*> pure 0 <*> pure 1
    uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g

instance (KnownNat i, KnownNat o) => MWC.Variate (Layer i o) where
    uniform g = Layer <$> MWC.uniform g <*> MWC.uniform g
    uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g

instance (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) => MWC.Variate (Network i h1 h2 o) where
    uniform g = Net <$> MWC.uniform g <*> MWC.uniform g <*> MWC.uniform g
    uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g
```