src/Control/Foldl/Incremental.hs

{-| This module provides incremental statistical folds based upon the foldl library

An incremental statsitical fold can be thought of as exponentially-weighting statistics designed to be efficient computations over a Foldable.

Some throat clearing is required, however.

The common usage term \"exponential moving ...\" refers to the cumulative effect
of the fold referencing the original data. From the point of view of a
single step, the algorithm could be better described as \"constant proportion\" or
\"geometric\" decay. Many other methods are also possible and future versions of the library may introduce some more.

A main point of the library is that the traditional simple moving average
uses a sliding window of past data and thus requires keeping track of
the last n elements in State (in a LIFO queue most likey). It may be simple for the human brain but its a more complex and costly computational than this single-pass version.

For clarity, moving average (and moving whatever) below refers to geometric decay
rather than the common usage. So with the throat clearing out of the way:

To avoid clashes, Control.Foldl should be qualified.

>>> import Control.Foldl.Incremental
>>> import qualified Control.Foldl as L

The folds represent incremental statistics such as moving averages`.

The stream of moving averages with a forgetting `rate` of 0.9 is:

>>> L.scan (ma 0.9) [1..10]
[NaN,1.0,1.5263157894736843,2.070110701107011,2.6312881651642916,3.2097140484969837,3.805217699371904,4.4175932632947745,5.046601250122929,5.691970329383086,6.3533993278762955]

or if you just want the moving average at the end.

>>> L.fold (ma 0.9) [1..10]
6.3533993278762955

The simple average is obtained via a decay rate of 1.0 (ie no decay)

>>> L.fold (ma 1.0) [1..10]
5.5


further reading:

<http://queue.acm.org/detail.cfm?id=2534976 Online Algorithms in High-frequency Trading>
<http://www.google.com.au/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCwQFjAA&url=http%3A%2F%2Fwww-personal.umich.edu%2F~annastef%2Fpapers_Long_ctrl%2FJournalPaperMassGrade_Final.pdf&ei=kq1ZU76IFIPGkAXtqoHACQ&usg=AFQjCNHG7gVfpVoh1g9gjWUcK3Hb22JIlg&sig2=_YxAjKZFo_UEXoFWJafUsw&bvm=bv.65397613,d.dGI  recursive least squares with exponential forgetting>

-}

module Control.Foldl.Incremental (
    -- * incrementalize
    incrementalize
    -- * common incremental folds
  , ma
  , absma
  , sqma
  , std
  , cov
  , corr
  , length
  , beta
  , alpha
  , autocorr
  ) where

import Control.Applicative ((<$>), (<*>))
import Control.Foldl (Fold(..),premap)
import Prelude hiding (length)

-- | An Increment is the incremental state within an exponential moving average fold.
data Increment = Increment
   { _adder   :: {-# UNPACK #-} !Double
   , _counter :: {-# UNPACK #-} !Double
   , _rate    :: {-# UNPACK #-} !Double
   } deriving (Show)

{-| Incrementalize takes a function and turns it into a `Control.Foldl.Fold` where the step is an Increment similar to the typical step in an exponential moving average calculation.

>>> incrementalize id

is a moving average of a foldable

>>> incrementalize (*2)

is a moving average of the square of a foldable

This lets you build an exponential standard deviation computation (using Foldl) as

>>> std r = (\s ss -> sqrt (ss - s**2)) <$> incrementalize id r <*> incrementalize (*2) r

incrementalize works with any function that produces a double.  A correlation fold of a tuple is quite intuitive:

>>> cov r = (\xy xbar ybar -> xy - xbar * ybar) <$> incrementalize (uncurry (*)) r <*> incrementalize fst r <*> incrementalize snd r
>>> corr r = (\cov' stdx stdy -> cov' / (stdx * stdy)) <$> cov r <*> L.premap fst (std r) <*> L.premap snd (std r)

The rate is the parameter regulating the discount (or forgetting) of current state and the introduction of the current value.

>>> incrementalize id 1

tracks the sum/average of an entire Foldable. In other words, prior values are never forgotten.

>>> incrementalize id 0

produces the latest value (ie current state is discounted (or decays) to zero).  In other words, prior values are immediately forgotten.

A exponential moving average with an exponetially-weighted length (duration if its a time series) of 10 (the average lag of the values effecting the calculation) is

>>> incrementalize id (1 - 1/10)

>>> L.fold (length 0.9) [1..100]
9.999734386011127

There is no particular reason for different parts to have the same rate.  A standard deviation where mean is expected to be static (eg equal to the unconditional sample average) would be:

>>> std' r = (\s ss -> sqrt (ss - s**2)) <$> incrementalize id 1 <*> incrementalize (*2) r

and a standard deviation with a prior for the mean (eg ignoring sample averges) would be:

>>> std'' mean r = incrementalize (\x -> x*2 - mean**2) r

-}
incrementalize :: (a -> Double) -> Double -> Fold a Double
incrementalize f r =  Fold step (Increment 0 0 r) (\(Increment a c _) -> a / c)
  where
    step (Increment n d r') n' = Increment (r' * n + f n') (r' * d + 1) r'
{-# INLINABLE incrementalize #-}

-- | incremental average
ma :: Double -> Fold Double Double
ma = incrementalize id
{-# INLINABLE ma #-}

-- | incremental absolute average
absma :: Double -> Fold Double Double
absma = incrementalize abs
{-# INLINABLE absma #-}

-- | incremental average square
sqma :: Double -> Fold Double Double
sqma = incrementalize (\x -> x*x)
{-# INLINABLE sqma #-}

-- | incremental standard deviation
std :: Double -> Fold Double Double
std rate = (\s ss -> sqrt (ss - s**2)) <$> ma rate <*> sqma rate
{-# INLINABLE std #-}

-- | incremental covariance
cov :: Double -> Fold (Double, Double) Double
cov r = (\xy xbar ybar -> xy - xbar * ybar) <$> incrementalize (uncurry (*)) r <*> incrementalize fst r <*> incrementalize snd r
{-# INLINABLE cov #-}

-- | incremental corelation
corr :: Double -> Fold (Double, Double) Double
corr r = (\cov' stdx stdy -> cov' / (stdx * stdy)) <$> cov r <*> premap fst (std r) <*> premap snd (std r)
{-# INLINABLE corr #-}

-- | the exponentially weighted length of a rate, which is 1/(1-rate) at infinity
length :: Double -> Fold a Double
length r =  Fold step (Increment 0 0 r) (\(Increment _ d _) -> d)
  where
    step (Increment _ d r') _ = Increment 1 (r' * d + 1) r'
{-# INLINABLE length #-}

-- | the beta in a simple linear regression of `snd` on `fst`
beta :: Double -> Fold (Double, Double) Double
beta r = (/) <$> cov r <*> premap snd (std r)
{-# INLINABLE beta #-}

-- | the alpha in a simple linear regression of `snd` on `fst`
alpha :: Double -> Fold (Double, Double) Double
alpha r = (\y b x -> y - b * x) <$> premap fst (ma r) <*> beta r <*> premap snd (ma r)
{-# INLINABLE alpha #-}

{-| autocorrelation is a slippery concept.  This method starts with the concept that there is an underlying random error process (e), and autocorrelation is a process on top of that ie for a one-step correlation relationship.

value@t = e@t + k * e@t-1

where k is the autocorrelation.

There are thus two decay rates needed: one for the average being considered to be the dependent variable, and one for the decay of the correlation calculation between the most recent value and the moving average. 

>>> L.fold (autoCorr 0 1)

Would estimate the one-step autocorrelation relationship of the previous value and the current value over the entire sample set. 

-}
autocorr :: Double -> Double -> Fold Double Double
autocorr maR corrR = 
    case ma maR of
        (Fold maStep maBegin maDone) ->
            case corr corrR of
                (Fold corrStep corrBegin corrDone) ->
                    let begin = (maBegin, corrBegin)
                        step (maAcc,corrAcc) a = (maStep maAcc a,
                            if isNaN (maDone maAcc)
                            then corrAcc
                            else corrStep corrAcc (maDone maAcc, a)) 
                        done = corrDone . snd in
                    Fold step begin done