% Learning MNIST with Neural Networks with backprop library % Justin Le The *backprop* library performs back-propagation over a *hetereogeneous* system of relationships. It offers both an implicit (*[ad][]*-like) and explicit graph building usage style. Let's use it to build neural networks and learn mnist! [ad]: http://hackage.haskell.org/package/ad Repository source is [on github][repo], and docs are [on hackage][hackage]. [repo]: https://github.com/mstksg/backprop [hackage]: http://hackage.haskell.org/package/backprop If you're reading this as a literate haskell file, you should know that a [rendered pdf version is available on github.][rendered]. If you are reading this as a pdf file, you should know that a [literate haskell version that you can run][lhs] is also available on github! [rendered]: https://github.com/mstksg/backprop/blob/master/renders/MNIST.pdf [lhs]: https://github.com/mstksg/backprop/blob/master/samples/MNIST.lhs > {-# LANGUAGE BangPatterns #-} > {-# LANGUAGE DataKinds #-} > {-# LANGUAGE DeriveGeneric #-} > {-# LANGUAGE GADTs #-} > {-# LANGUAGE LambdaCase #-} > {-# LANGUAGE ScopedTypeVariables #-} > {-# LANGUAGE TupleSections #-} > {-# LANGUAGE TypeApplications #-} > {-# LANGUAGE ViewPatterns #-} > {-# OPTIONS_GHC -fno-warn-orphans #-} > {-# OPTIONS_GHC -fno-warn-incomplete-patterns #-} > {-# OPTIONS_GHC -fno-warn-unused-top-binds #-} > > import Control.DeepSeq > import Control.Exception > import Control.Monad > import Control.Monad.IO.Class > import Control.Monad.Trans.Maybe > import Control.Monad.Trans.State > import Data.Bitraversable > import Data.Foldable > import Data.IDX > import Data.List.Split > import Data.Maybe > import Data.Time.Clock > import Data.Traversable > import Data.Tuple > import GHC.Generics (Generic) > import GHC.TypeLits > import Numeric.Backprop > import Numeric.LinearAlgebra.Static hiding (dot) > import Text.Printf > import qualified Data.Vector as V > import qualified Data.Vector.Generic as VG > import qualified Data.Vector.Unboxed as VU > import qualified Generics.SOP as SOP > import qualified Numeric.LinearAlgebra as HM > import qualified System.Random.MWC as MWC > import qualified System.Random.MWC.Distributions as MWC Types ===== For the most part, we're going to be using the great *[hmatrix][]* library and its vector and matrix types. It offers a type `L m n` for $m \times n$ matrices, and a type `R n` for an $n$ vector. [hmatrix]: http://hackage.haskell.org/package/hmatrix First things first: let's define our neural networks as simple containers of parameters (weight matrices and bias vectors). First, a type for layers: > data Layer i o = > Layer { _lWeights :: !(L o i) > , _lBiases :: !(R o) > } > deriving (Show, Generic) > > instance SOP.Generic (Layer i o) > instance NFData (Layer i o) And a type for a simple feed-forward network with two hidden layers: > data Network i h1 h2 o = > Net { _nLayer1 :: !(Layer i h1) > , _nLayer2 :: !(Layer h1 h2) > , _nLayer3 :: !(Layer h2 o) > } > deriving (Show, Generic) > > instance SOP.Generic (Network i h1 h2 o) > instance NFData (Network i h1 h2 o) These are pretty straightforward container types...pretty much exactly the type you'd make to represent these networks! Note that, following true Haskell form, we separate out logic from data. This should be all we need. We derive an instance of `SOP.Generic` from the *[generics-sop][]* package, which *backprop* uses to propagate derivatives on values inside product types. [generics-sop]: http://hackage.haskell.org/package/generics-sop Instances --------- Things are much simplier if we had `Num` and `Fractional` instances for everything, so let's just go ahead and define that now, as well. Just a little bit of boilerplate. > instance (KnownNat i, KnownNat o) => Num (Layer i o) where > Layer w1 b1 + Layer w2 b2 = Layer (w1 + w2) (b1 + b2) > Layer w1 b1 - Layer w2 b2 = Layer (w1 - w2) (b1 - b2) > Layer w1 b1 * Layer w2 b2 = Layer (w1 * w2) (b1 * b2) > abs (Layer w b) = Layer (abs w) (abs b) > signum (Layer w b) = Layer (signum w) (signum b) > negate (Layer w b) = Layer (negate w) (negate b) > fromInteger x = Layer (fromInteger x) (fromInteger x) > > instance (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) => Num (Network i h1 h2 o) where > Net a b c + Net d e f = Net (a + d) (b + e) (c + f) > Net a b c - Net d e f = Net (a - d) (b - e) (c - f) > Net a b c * Net d e f = Net (a * d) (b * e) (c * f) > abs (Net a b c) = Net (abs a) (abs b) (abs c) > signum (Net a b c) = Net (signum a) (signum b) (signum c) > negate (Net a b c) = Net (negate a) (negate b) (negate c) > fromInteger x = Net (fromInteger x) (fromInteger x) (fromInteger x) > > instance (KnownNat i, KnownNat o) => Fractional (Layer i o) where > Layer w1 b1 / Layer w2 b2 = Layer (w1 / w2) (b1 / b2) > recip (Layer w b) = Layer (recip w) (recip b) > fromRational x = Layer (fromRational x) (fromRational x) > > instance (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) => Fractional (Network i h1 h2 o) where > Net a b c / Net d e f = Net (a / d) (b / e) (c / f) > recip (Net a b c) = Net (recip a) (recip b) (recip c) > fromRational x = Net (fromRational x) (fromRational x) (fromRational x) `KnownNat` comes from *base*; it's a typeclass that *hmatrix* uses to refer to the numbers in its type and use it to go about its normal hmatrixy business. Ops === Now, *backprop* does require *primitive* differentiable operations on our relevant types to be defined. *backprop* uses these primitive `Op`s to tie everything together. Ideally we'd import these from a library that implements these for you, and the end-user never has to make `Op` primitives. But in this case, I'm going to put the definitions here to show that there isn't any magic going on. If you're curious, refer to [documentation for `Op`][opdoc] for more details on how `Op` is implemented and how this works. [opdoc]: http://hackage.haskell.org/package/backprop/docs/Numeric-Backprop-Op.html First, matrix-vector multiplication primitive, giving an explicit gradient function. > matVec > :: (KnownNat m, KnownNat n) > => Op '[ L m n, R n ] (R m) > matVec = op2' $ \m v -> > ( m #> v, \(fromMaybe 1 -> g) -> > (g `outer` v, tr m #> g) > ) Dot products would be nice too. > dot :: KnownNat n > => Op '[ R n, R n ] Double > dot = op2' $ \x y -> > ( x <.> y, \case Nothing -> (y, x) > Just g -> (konst g * y, x * konst g) > ) Also a "scaling" function, scales a vector by a given factor. > scale > :: KnownNat n > => Op '[ Double, R n ] (R n) > scale = op2' $ \a x -> > ( konst a * x > , \case Nothing -> (HM.sumElements (extract x ), konst a ) > Just g -> (HM.sumElements (extract (x * g)), konst a * g) > ) Finally, an operation to sum all of the items in the vector. > vsum > :: KnownNat n > => Op '[ R n ] Double > vsum = op1' $ \x -> (HM.sumElements (extract x), maybe 1 konst) And why not, here's the [logistic function][], which we'll use as an activation function for internal layers. We don't need to define this as an `Op` up-front right now, because the library can automatically promote any numeric polymorphic function (an `a -> a` or `a -> a -> a`, etc.) to an `Op` anyways. [logistic function]: https://en.wikipedia.org/wiki/Logistic_function > logistic :: Floating a => a -> a > logistic x = 1 / (1 + exp (-x)) Running our Network =================== Now that we have our primitives in place, let's actually write a function to run our network! > runLayer > :: (KnownNat i, KnownNat o) > => BPOp s '[ R i, Layer i o ] (R o) > runLayer = withInps $ \(x :< l :< Ø) -> do > w :< b :< Ø <- gTuple #<~ l > y <- matVec ~$ (w :< x :< Ø) > return $ y + b A `BPOp s '[ R i, Layer i o ] (R o)` is a backpropagatable function that produces an `R o` (a vector with `o` elements, from the *[hmatrix][]* library) given an input environment of an `R i` (the "input" of the layer) and a layer. We use `withInps` to bring the environment into scope as a bunch of `BVar`s. `x` is a `BVar` containing the input vector, and `l` is a `BVar` containing the layer. The first thing we do is split out the parts of the layer so we can work with the internal matrices. We can use `#<~` to "split out" the components of a `BVar`, splitting on `gTuple` (which uses `GHC.Generics` to automatically figure out how to split up a product type). Then we apply `matVec` (our primitive `Op` that does matrix-vector multiplication) to `w` and `x`, and then the result is that added to the bias vector `b`. We can write the `runNetwork` function pretty much the same way. > runNetwork > :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) > => BPOp s '[ R i, Network i h1 h2 o ] (R o) > runNetwork = withInps $ \(x :< n :< Ø) -> do > l1 :< l2 :< l3 :< Ø <- gTuple #<~ n > y <- runLayer -$ (x :< l1 :< Ø) > z <- runLayer -$ (logistic y :< l2 :< Ø) > r <- runLayer -$ (logistic z :< l3 :< Ø) > softmax -$ (r :< Ø) > where > softmax :: KnownNat n => BPOp s '[ R n ] (R n) > softmax = withInps $ \(x :< Ø) -> do > expX <- bindVar (exp x) > totX <- vsum ~$ (expX :< Ø) > scale ~$ (1/totX :< expX :< Ø) After splitting out the layers in the input `Network`, we run each layer successively using our previously defined `runLayer`, giving inputs using `-$`. We can directly apply `logistic` to `BVar`s. At the end, we run a [softmax function][] because MNIST is a classification challenge. The softmax is done by applying $e^x$ for every item in the input vector, and dividing each element by the total. [softmax function]: https://en.wikipedia.org/wiki/Softmax_function The Magic --------- What did we just define? Well, with a `BPOp s rs a`, we can *run* it and get the output: > runNetOnInp > :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) > => Network i h1 h2 o > -> R i > -> R o > runNetOnInp n x = evalBPOp runNetwork (x ::< n ::< Ø) But, the magic part is that we can also get the gradient! > gradNet > :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) > => Network i h1 h2 o > -> R i > -> Network i h1 h2 o > gradNet n x = case gradBPOp runNetwork (x ::< n ::< Ø) of > _gradX ::< gradN ::< Ø -> gradN This gives the gradient of all of the parameters in the matrices and vectors inside the `Network`, which we can use to "train"! Training ======== Now for the real work. To train a network, we can do gradient descent based on the gradient of some type of *error function* with respect to the network parameters. Let's use the [cross entropy][], which is popular for classification problems. [cross entropy]: https://en.wikipedia.org/wiki/Cross_entropy > crossEntropy > :: KnownNat n > => R n > -> BPOpI s '[ R n ] Double > crossEntropy targ (r :< Ø) = negate (dot .$ (log r :< t :< Ø)) > where > t = constVar targ Given a target vector and a `BVar` referring to the result of the network, we can directly apply: $$ H(\mathbf{r}, \mathbf{t}) = - (log(\mathbf{r}) \cdot \mathbf{t}) $$ Just for fun, I implemented `crossEntropy` in "implicit-graph" mode, so you don't see any binds or returns. Now, a function to make one gradient descent step based on an input vector and a target, using `gradBPOp`: > trainStep > :: forall i h1 h2 o. (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) > => Double > -> R i > -> R o > -> Network i h1 h2 o > -> Network i h1 h2 o > trainStep r !x !t !n = case gradBPOp o (x ::< n ::< Ø) of > _ ::< gN ::< Ø -> > n - (realToFrac r * gN) > where > o :: BPOp s '[ R i, Network i h1 h2 o ] Double > o = do > y <- runNetwork > implicitly (crossEntropy t) -$ (y :< Ø) A convenient wrapper for training over all of the observations in a list: > trainList > :: (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) > => Double > -> [(R i, R o)] > -> Network i h1 h2 o > -> Network i h1 h2 o > trainList r = flip $ foldl' (\n (x,y) -> trainStep r x y n) Pulling it all together ======================= `testNet` will be a quick way to test our net by computing the percentage of correct guesses: (mostly using *hmatrix* stuff) > testNet > :: forall i h1 h2 o. (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) > => [(R i, R o)] > -> Network i h1 h2 o > -> Double > testNet xs n = sum (map (uncurry test) xs) / fromIntegral (length xs) > where > test :: R i -> R o -> Double > test x (extract->t) > | HM.maxIndex t == HM.maxIndex (extract r) = 1 > | otherwise = 0 > where > r :: R o > r = evalBPOp runNetwork (x ::< n ::< Ø) And now, a main loop! If you are following along at home, download the [mnist data set files][mnist] and uncompress them into the folder `data`, and everything should work fine. [mnist]: http://yann.lecun.com/exdb/mnist/ > main :: IO () > main = MWC.withSystemRandom $ \g -> do > Just train <- loadMNIST "data/train-images-idx3-ubyte" "data/train-labels-idx1-ubyte" > Just test <- loadMNIST "data/t10k-images-idx3-ubyte" "data/t10k-labels-idx1-ubyte" > putStrLn "Loaded data." > net0 <- MWC.uniformR @(Network 784 300 100 9) (-0.5, 0.5) g > flip evalStateT net0 . forM_ [1..] $ \e -> do > train' <- liftIO . fmap V.toList $ MWC.uniformShuffle (V.fromList train) g > liftIO $ printf "[Epoch %d]\n" (e :: Int) > > forM_ ([1..] `zip` chunksOf batch train') $ \(b, chnk) -> StateT $ \n0 -> do > printf "(Batch %d)\n" (b :: Int) > > t0 <- getCurrentTime > n' <- evaluate . force $ trainList rate chnk n0 > t1 <- getCurrentTime > printf "Trained on %d points in %s.\n" batch (show (t1 `diffUTCTime` t0)) > > let trainScore = testNet chnk n' > testScore = testNet test n' > printf "Training error: %.2f%%\n" ((1 - trainScore) * 100) > printf "Validation error: %.2f%%\n" ((1 - testScore ) * 100) > > return ((), n') > where > rate = 0.02 > batch = 5000 Each iteration of the loop: 1. Shuffles the training set 2. Splits it into chunks of `batch` size 3. Uses `trainList` to train over the batch 4. Computes the score based on `testNet` based on the training set and the test set 5. Prints out the results And, that's really it! Result ------ I haven't put much into optimizing the library yet, but the network (with hidden layer sizes 300 and 100) seems to take 25s on my computer to finish a batch of 5000 training points. It's slow (five minutes per 60000 point epooch), but it's a first unoptimized run and a proof of concept! It's my goal to get this down to a point where the result has the same performance characteristics as the actual backend (*hmatrix*), and so overhead is 0. Main takeaways ============== Most of the actual heavy lifting/logic actually came from the *hmatrix* library itself. We just created simple types to wrap up our bare matrices. Basically, all that *backprop* did was give you an API to define *how to run* a neural net --- how to *run* a net based on a `Network` and `R i` input you were given. The goal of the library is to let you write down how to run things in as natural way as possible. And then, after things are run, we can just get the gradient and roll from there! Because the heavy lifting is done by the data types themselves, we can presumably plug in *any* type and any tensor/numerical backend, and reap the benefits of those libraries' optimizations and parallelizations. *Any* type can be backpropagated! :D What now? --------- Check out the docs for the [Numeric.Backprop][] module for a more detailed picture of what's going on, or find more examples at the [github repo][repo]! [Numeric.Backprop]: http://hackage.haskell.org/package/backprop/docs/Numeric-Backprop.html Boring stuff ============ Here is a small wrapper function over the [mnist-idx][] library loading the contents of the idx files into *hmatrix* vectors: [mnist-idx]: http://hackage.haskell.org/package/mnist-idx > loadMNIST > :: FilePath > -> FilePath > -> IO (Maybe [(R 784, R 9)]) > loadMNIST fpI fpL = runMaybeT $ do > i <- MaybeT $ decodeIDXFile fpI > l <- MaybeT $ decodeIDXLabelsFile fpL > d <- MaybeT . return $ labeledIntData l i > r <- MaybeT . return $ for d (bitraverse mkImage mkLabel . swap) > liftIO . evaluate $ force r > where > mkImage :: VU.Vector Int -> Maybe (R 784) > mkImage = create . VG.convert . VG.map (\i -> fromIntegral i / 255) > mkLabel :: Int -> Maybe (R 9) > mkLabel n = create $ HM.build 9 (\i -> if round i == n then 1 else 0) And here are instances to generating random vectors/matrices/layers/networks, used for the initialization step. > instance KnownNat n => MWC.Variate (R n) where > uniform g = randomVector <$> MWC.uniform g <*> pure Uniform > uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g > > instance (KnownNat m, KnownNat n) => MWC.Variate (L m n) where > uniform g = uniformSample <$> MWC.uniform g <*> pure 0 <*> pure 1 > uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g > > instance (KnownNat i, KnownNat o) => MWC.Variate (Layer i o) where > uniform g = Layer <$> MWC.uniform g <*> MWC.uniform g > uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g > > instance (KnownNat i, KnownNat h1, KnownNat h2, KnownNat o) => MWC.Variate (Network i h1 h2 o) where > uniform g = Net <$> MWC.uniform g <*> MWC.uniform g <*> MWC.uniform g > uniformR (l, h) g = (\x -> x * (h - l) + l) <$> MWC.uniform g