foldl-statistics-0.1.5.0: Statistical functions from the statistics package implemented as Folds.

Copyright (c) 2011 Bryan O'Sullivan 2016 National ICT Australia 2018 CSIRO BSD3 Alex.Mason@data61.csiro.au experimental portable None Haskell2010

Control.Foldl.Statistics

Description

Synopsis

# Introduction

Statistical functions from the Statistics.Sample module of the statistics package by Bryan O'Sullivan, implemented as Folds from the foldl package.

This allows many statistics to be computed concurrently with at most two passes over the data, usually by computing the mean first, and passing it to further Folds.

The difference between the largest and smallest elements of a sample.

A numerically stable sum using Kahan-Babuška-Neumaier summation from Numeric.Sum

histogram :: Ord a => Fold a (Map a Int) Source #

Create a histogram of each value of type a. Useful for folding over categorical values, for example, a CSV where you have a data type for a selection of categories.

It should not be used for continuous values which would lead to a high number of keys. One way to avoid this is to use the Profunctor instance for Fold to break your values into categories. For an example of doing this, see ordersOfMagnitude.

histogram' :: (Hashable a, Eq a) => Fold a (HashMap a Int) Source #

Like histogram, but for use when hashmaps would be more efficient for the particular type a.

Provides a histogram of the orders of magnitude of the values in a series. Negative values are placed in the 0.0 category due to the behaviour of logBase. it may be useful to use lmap abs on this Fold to get a histogram of the absolute magnitudes.

# Statistics of location

Arithmetic mean. This uses Kahan-Babuška-Neumaier summation, so is more accurate than welfordMean unless the input values are very large.

Since foldl-1.2.2, Foldl exports a mean function, so you will have to hide one.

Arithmetic mean. This uses Welford's algorithm to provide numerical stability, using a single pass over the sample data.

Compared to mean, this loses a surprising amount of precision unless the inputs are very large.

Arithmetic mean for weighted sample. It uses a single-pass algorithm analogous to the one used by welfordMean.

Harmonic mean.

Geometric mean of a sample containing no negative values.

# Statistics of dispersion

The variance—and hence the standard deviation—of a sample of fewer than two elements are both defined to be zero.

Many of these Folds take the mean as an argument for constructing the variance, and as such require two passes over the data.

## Functions over central moments

Compute the kth central moment of a sample. The central moment is also known as the moment about the mean.

This function requires the mean of the data to compute the central moment.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

Compute the kth and jth central moments of a sample.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

Compute the kth and jth central moments of a sample.

This fold requires the mean of the data to be known.

This variation of centralMoments uses Kahan-Babuška-Neumaier summation to attempt to improve the accuracy of results, which may make computation slower.

Compute the skewness of a sample. This is a measure of the asymmetry of its distribution.

A sample with negative skew is said to be left-skewed. Most of its mass is on the right of the distribution, with the tail on the left.

skewness $U.to [1,100,101,102,103] ==> -1.497681449918257 A sample with positive skew is said to be right-skewed. skewness$ U.to [1,2,3,4,100]
==> 1.4975367033335198

A sample's skewness is not defined if its variance is zero.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

Compute the excess kurtosis of a sample. This is a measure of the "peakedness" of its distribution. A high kurtosis indicates that more of the sample's variance is due to infrequent severe deviations, rather than more frequent modest deviations.

A sample's excess kurtosis is not defined if its variance is zero.

This fold requires the mean of the data to be known.

For samples containing many values very close to the mean, this function is subject to inaccuracy due to catastrophic cancellation.

## Functions requiring the mean to be known (numerically robust)

These functions use the compensated summation algorithm of Chan et al. for numerical robustness, but require two passes over the sample data as a result.

Maximum likelihood estimate of a sample's variance. Also known as the population variance, where the denominator is n.

Unbiased estimate of a sample's variance. Also known as the sample variance, where the denominator is n-1.

Standard deviation. This is simply the square root of the unbiased estimate of the variance.

Weighted variance. This is biased estimation. Requires the weighted mean of the input data.

## Single-pass functions (faster, less safe)

The functions prefixed with the name fast below perform a single pass over the sample data using Knuth's algorithm. They usually work well, but see below for caveats. These functions are subject to fusion and do not require the mean to be passed.

Note: in cases where most sample data is close to the sample's mean, Knuth's algorithm gives inaccurate results due to catastrophic cancellation.

Maximum likelihood estimate of a sample's variance.

Maximum likelihood estimate of a sample's variance.

Standard deviation. This is simply the square root of the maximum likelihood estimate of the variance.

Efficiently compute the length, mean, variance, skewness and kurtosis with a single pass.

Since: 0.1.1.0

Efficiently compute the length, mean, unbiased variance, skewness and kurtosis with a single pass.

Since: 0.1.3.0

data LMVSK Source #

When returned by fastLMVSK, contains the count, mean, variance, skewness and kurtosis of a series of samples.

Since: 0.1.1.0

Constructors

 LMVSK FieldslmvskCount :: !Int lmvskMean :: !Double lmvskVariance :: !Double lmvskSkewness :: !Double lmvskKurtosis :: !Double

Instances

 Source # Methods(==) :: LMVSK -> LMVSK -> Bool #(/=) :: LMVSK -> LMVSK -> Bool # Source # MethodsshowsPrec :: Int -> LMVSK -> ShowS #show :: LMVSK -> String #showList :: [LMVSK] -> ShowS #

Instances

 Source # Methodsstimes :: Integral b => b -> LMVSKState -> LMVSKState # Source # Methodsmconcat :: [LMVSKState] -> LMVSKState #

Performs the heavy lifting of fastLMVSK. This is exposed because the internal LMVSKState is monoidal, allowing you to run these statistics in parallel over datasets which are split and then combine the results.

Since: 0.1.2.0

Returns the stats which have been computed in a LMVSKState.

Since: 0.1.2.0

Returns the stats which have been computed in a LMVSKState, with the unbiased variance.

Since: 0.1.2.0

## Linear Regression

Computes the slope, (Y) intercept and correlation of (x,y) pairs, as well as the LMVSK stats for both the x and y series.

>>> F.fold fastLinearReg $map (\x -> (x,3*x+7)) [1..100] LinRegResult {lrrSlope = 3.0 , lrrIntercept = 7.0 , lrrCorrelation = 100.0 , lrrXStats = LMVSK {lmvskCount = 100 , lmvskMean = 50.5 , lmvskVariance = 833.25 , lmvskSkewness = 0.0 , lmvskKurtosis = -1.2002400240024003} , lrrYStats = LMVSK {lmvskCount = 100 , lmvskMean = 158.5 , lmvskVariance = 7499.25 , lmvskSkewness = 0.0 , lmvskKurtosis = -1.2002400240024003} }  >>> F.fold fastLinearReg$ map (\x -> (x,0.005*x*x+3*x+7)) [1..100]
LinRegResult
{lrrSlope = 3.5049999999999994
, lrrIntercept = -1.5849999999999795
, lrrCorrelation = 99.93226275740273
, lrrXStats = LMVSK
{lmvskCount = 100
, lmvskMean = 50.5
, lmvskVariance = 833.25
, lmvskSkewness = 0.0
, lmvskKurtosis = -1.2002400240024003}
, lrrYStats = LMVSK
{lmvskCount = 100
, lmvskMean = 175.4175
, lmvskVariance = 10250.37902625
, lmvskSkewness = 9.862971188165422e-2
, lmvskKurtosis = -1.1923628437011482}
}


Since: 0.1.1.0

Performs the heavy lifting for fastLinReg. Exposed because LinRegState is a Monoid, allowing statistics to be computed on datasets in parallel and combined afterwards.

Since: 0.1.4.0

Produces the slope, Y intercept, correlation and LMVSK stats from a LinRegState.

Since: 0.1.4.0

When returned by fastLinearReg, contains the count, slope, intercept and correlation of combining (x,y) pairs.

Since: 0.1.1.0

Constructors

 LinRegResult FieldslrrSlope :: !Double lrrIntercept :: !Double lrrCorrelation :: !Double lrrXStats :: !LMVSK lrrYStats :: !LMVSK

Instances

 Source # Methods Source # MethodsshowList :: [LinRegResult] -> ShowS #

The Monoidal state used to compute linear regression, see fastLinearReg.

Since: 0.1.4.0

Instances

 Source # Methodsstimes :: Integral b => b -> LinRegState -> LinRegState # Source # Methodsmconcat :: [LinRegState] -> LinRegState #

The number of elements which make up this LinRegResult Since: 0.1.4.1

correlation :: (Double, Double) -> (Double, Double) -> Fold (Double, Double) Double Source #

Given the mean and standard deviation of two distributions, computes the correlation between them, given the means and standard deviation of the x and y series. The results may be more accurate than those returned by fastLinearReg