Copyright	(c) 2011 Bryan O'Sullivan 2016 National ICT Australia
License	BSD3
Maintainer	alex.mason@nicta.com.au
Stability	experimental
Portability	portable
Safe Haskell	None
Language	Haskell2010

Control.Foldl.Statistics

Contents

Introduction
Statistics of location
Statistics of dispersion
References

Description

Synopsis

Introduction

Statistical functions from the Statistics.Sample module of the statistics package by Bryan O'Sullivan, implemented as Folds from the foldl package.

This allows many statistics to be computed concurrently with at most two passes over the data, usually by computing the mean first, and passing it to further Folds.

range :: Fold Double Double Source #

The difference between the largest and smallest elements of a sample.

sum' :: Fold Double Double Source #

A numerically stable sum using Kahan-Babuška-Neumaier summation from Numeric.Sum

Statistics of location

mean :: Fold Double Double Source #

Arithmetic mean. This uses Kahan-Babuška-Neumaier summation, so is more accurate than welfordMean unless the input values are very large.

Since foldl-1.2.2, Foldl exports a mean function, so you will have to hide one.

welfordMean :: Fold Double Double Source #

Arithmetic mean. This uses Welford's algorithm to provide numerical stability, using a single pass over the sample data.

Compared to mean, this loses a surprising amount of precision unless the inputs are very large.

meanWeighted :: Fold (Double, Double) Double Source #

Arithmetic mean for weighted sample. It uses a single-pass algorithm analogous to the one used by welfordMean.

harmonicMean :: Fold Double Double Source #

Harmonic mean.

geometricMean :: Fold Double Double Source #

Geometric mean of a sample containing no negative values.

Statistics of dispersion

The variance—and hence the standard deviation—of a sample of fewer than two elements are both defined to be zero.

Many of these Folds take the mean as an argument for constructing the variance, and as such require two passes over the data.

Functions over central moments

centralMoment :: Int -> Double -> Fold Double Double Source #

Compute the kth central moment of a sample. The central moment is also known as the moment about the mean.

This function requires the mean of the data to compute the central moment.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

centralMoments :: Int -> Int -> Double -> Fold Double (Double, Double) Source #

Compute the kth and jth central moments of a sample.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

centralMoments' :: Int -> Int -> Double -> Fold Double (Double, Double) Source #

Compute the kth and jth central moments of a sample.

This fold requires the mean of the data to be known.

This variation of centralMoments uses Kahan-Babuška-Neumaier summation to attempt to improve the accuracy of results, which may make computation slower.

skewness :: Double -> Fold Double Double Source #

Compute the skewness of a sample. This is a measure of the asymmetry of its distribution.

A sample with negative skew is said to be left-skewed. Most of its mass is on the right of the distribution, with the tail on the left.

skewness $ U.to [1,100,101,102,103]
==> -1.497681449918257

A sample with positive skew is said to be right-skewed.

skewness $ U.to [1,2,3,4,100]
==> 1.4975367033335198

A sample's skewness is not defined if its variance is zero.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

kurtosis :: Double -> Fold Double Double Source #

Compute the excess kurtosis of a sample. This is a measure of the "peakedness" of its distribution. A high kurtosis indicates that more of the sample's variance is due to infrequent severe deviations, rather than more frequent modest deviations.

A sample's excess kurtosis is not defined if its variance is zero.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

Functions requiring the mean to be known (numerically robust)

These functions use the compensated summation algorithm of Chan et al. for numerical robustness, but require two passes over the sample data as a result.

variance :: Double -> Fold Double Double Source #

Maximum likelihood estimate of a sample's variance. Also known as the population variance, where the denominator is n.

varianceUnbiased :: Double -> Fold Double Double Source #

Unbiased estimate of a sample's variance. Also known as the sample variance, where the denominator is n-1.

stdDev :: Double -> Fold Double Double Source #

Standard deviation. This is simply the square root of the unbiased estimate of the variance.

varianceWeighted :: Double -> Fold (Double, Double) Double Source #

Weighted variance. This is biased estimation. Requires the weighted mean of the input data.

Single-pass functions (faster, less safe)

The functions prefixed with the name fast below perform a single pass over the sample data using Knuth's algorithm. They usually work well, but see below for caveats. These functions are subject to fusion and do not require the mean to be passed.

Note: in cases where most sample data is close to the sample's mean, Knuth's algorithm gives inaccurate results due to catastrophic cancellation.

fastVariance :: Fold Double Double Source #

Maximum likelihood estimate of a sample's variance.

fastVarianceUnbiased :: Fold Double Double Source #

Maximum likelihood estimate of a sample's variance.

fastStdDev :: Fold Double Double Source #

Standard deviation. This is simply the square root of the maximum likelihood estimate of the variance.

fastLMVSK :: Fold Double LMVSK Source #

Efficiently compute the length, mean, variance, skewness and kurtosis with a single pass.

Since: 0.1.1.0

fastLMVSKu :: Fold Double LMVSK Source #

Efficiently compute the length, mean, unbiased variance, skewness and kurtosis with a single pass.

Since: 0.1.3.0

data LMVSK Source #

When returned by fastLMVSK, contains the count, mean, variance, skewness and kurtosis of a series of samples.

Since: 0.1.1.0

Constructors

LMVSK
Fields lmvskCount :: !Int lmvskMean :: !Double lmvskVariance :: !Double lmvskSkewness :: !Double lmvskKurtosis :: !Double

Instances

Eq LMVSK Source #
Methods (==) :: LMVSK -> LMVSK -> Bool # (/=) :: LMVSK -> LMVSK -> Bool #
Show LMVSK Source #
Methods showsPrec :: Int -> LMVSK -> ShowS # show :: LMVSK -> String # showList :: [LMVSK] -> ShowS #

data LMVSKState Source #

Instances

Semigroup LMVSKState Source #
Methods (<>) :: LMVSKState -> LMVSKState -> LMVSKState # sconcat :: NonEmpty LMVSKState -> LMVSKState # stimes :: Integral b => b -> LMVSKState -> LMVSKState #
Monoid LMVSKState Source #
Methods mempty :: LMVSKState # mappend :: LMVSKState -> LMVSKState -> LMVSKState # mconcat :: [LMVSKState] -> LMVSKState #

foldLMVSKState :: Fold Double LMVSKState Source #

Performs the heavy lifting of fastLMVSK. This is exposed because the internal LMVSKState is monoidal, allowing you to run these statistics in parallel over datasets which are split and then combine the results.

Since: 0.1.2.0

getLMVSK :: LMVSKState -> LMVSK Source #

Returns the stats which have been computed in a LMVSKState.

Since: 0.1.2.0

getLMVSKu :: LMVSKState -> LMVSK Source #

Returns the stats which have been computed in a LMVSKState, with the unbiased variance.

Since: 0.1.2.0

Linear Regression

fastLinearReg :: Fold (Double, Double) LinRegResult Source #

Computes the slope, (Y) intercept and correlation of (x,y) pairs, as well as the LMVSK stats for both the x and y series.

>>> F.fold fastLinearReg $ map (\x -> (x,3*x+7)) [1..100]
LinRegResult
  {lrrSlope = 3.0
  , lrrIntercept = 7.0
  , lrrCorrelation = 100.0
  , lrrXStats = LMVSK
      {lmvskCount = 100
      , lmvskMean = 50.5
      , lmvskVariance = 833.25
      , lmvskSkewness = 0.0
      , lmvskKurtosis = -1.2002400240024003}
  , lrrYStats = LMVSK
      {lmvskCount = 100
      , lmvskMean = 158.5
      , lmvskVariance = 7499.25
      , lmvskSkewness = 0.0
      , lmvskKurtosis = -1.2002400240024003}
  }

>>> F.fold fastLinearReg $ map (\x -> (x,0.005*x*x+3*x+7)) [1..100]
LinRegResult
  {lrrSlope = 3.5049999999999994
  , lrrIntercept = -1.5849999999999795
  , lrrCorrelation = 99.93226275740273
  , lrrXStats = LMVSK
      {lmvskCount = 100
      , lmvskMean = 50.5
      , lmvskVariance = 833.25
      , lmvskSkewness = 0.0
      , lmvskKurtosis = -1.2002400240024003}
  , lrrYStats = LMVSK
      {lmvskCount = 100
      , lmvskMean = 175.4175
      , lmvskVariance = 10250.37902625
      , lmvskSkewness = 9.862971188165422e-2
      , lmvskKurtosis = -1.1923628437011482}
  }

Since: 0.1.1.0

foldLinRegState :: Fold (Double, Double) LinRegState Source #

Performs the heavy lifting for fastLinReg. Exposed because LinRegState is a Monoid, allowing statistics to be computed on datasets in parallel and combined afterwards.

Since: 0.1.4.0

getLinRegResult :: LinRegState -> LinRegResult Source #

Produces the slope, Y intercept, correlation and LMVSK stats from a LinRegState.

Since: 0.1.4.0

data LinRegResult Source #

When returned by fastLinearReg, contains the count, slope, intercept and correlation of combining (x,y) pairs.

Since: 0.1.1.0

Constructors

LinRegResult
Fields lrrSlope :: !Double lrrIntercept :: !Double lrrCorrelation :: !Double lrrXStats :: !LMVSK lrrYStats :: !LMVSK

Instances

Eq LinRegResult Source #
Methods (==) :: LinRegResult -> LinRegResult -> Bool # (/=) :: LinRegResult -> LinRegResult -> Bool #
Show LinRegResult Source #
Methods showsPrec :: Int -> LinRegResult -> ShowS # show :: LinRegResult -> String # showList :: [LinRegResult] -> ShowS #

data LinRegState Source #

The Monoidal state used to compute linear regression, see fastLinearReg.

Since: 0.1.4.0

Instances

Semigroup LinRegState Source #
Methods (<>) :: LinRegState -> LinRegState -> LinRegState # sconcat :: NonEmpty LinRegState -> LinRegState # stimes :: Integral b => b -> LinRegState -> LinRegState #
Monoid LinRegState Source #
Methods mempty :: LinRegState # mappend :: LinRegState -> LinRegState -> LinRegState # mconcat :: [LinRegState] -> LinRegState #

lrrCount :: LinRegResult -> Int Source #

The number of elements which make up this LinRegResult Since: 0.1.4.1

correlation :: (Double, Double) -> (Double, Double) -> Fold (Double, Double) Double Source #

Given the mean and standard deviation of two distributions, computes the correlation between them, given the means and standard deviation of the x and y series. The results may be more accurate than those returned by fastLinearReg

References

Chan, T. F.; Golub, G.H.; LeVeque, R.J. (1979) Updating formulae and a pairwise algorithm for computing sample variances. Technical Report STAN-CS-79-773, Department of Computer Science, Stanford University. ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
Knuth, D.E. (1998) The art of computer programming, volume 2: seminumerical algorithms, 3rd ed., p. 232.
Welford, B.P. (1962) Note on a method for calculating corrected sums of squares and products. Technometrics 4(3):419–420. http://www.jstor.org/stable/1266577
West, D.H.D. (1979) Updating mean and variance estimates: an improved method. Communications of the ACM 22(9):532–535. http://doi.acm.org/10.1145/359146.359153
John D. Cook. Computing skewness and kurtosis in one pass http://www.johndcook.com/blog/skewness_kurtosis/

module Control.Foldl