tdigest-0: On-line accumulation of rank-based statistics

Safe HaskellNone
LanguageHaskell2010

Data.TDigest

Contents

Description

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. . See original paper: "Computing extremely accurate quantiles using t-digest" by Ted Dunning and Otmar Ertl for more details https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf.

Examples

>>> quantile 0.99 (tdigest [1..1000] :: TDigest 25)
Just 990.499...
>>> quantile 0.99 (tdigest [1..1000] :: TDigest 3)
Just 992.7...

t-Digest is more precise in tails, especially median is imprecise:

>>> median (tdigest [1..1000] :: TDigest 25)
Just 502.5...

Synopsis

Construction

data TDigest compression Source #

TDigest is a tree of centroids.

compression is a 1/δ. The greater the value of compression the less likely value merging will happen.

Instances

KnownNat comp => Reducer Double (TDigest comp) Source #

Both cons and snoc are insert

Methods

unit :: Double -> TDigest comp #

snoc :: TDigest comp -> Double -> TDigest comp #

cons :: Double -> TDigest comp -> TDigest comp #

Show (TDigest compression) Source # 

Methods

showsPrec :: Int -> TDigest compression -> ShowS #

show :: TDigest compression -> String #

showList :: [TDigest compression] -> ShowS #

KnownNat comp => Semigroup (TDigest comp) Source # 

Methods

(<>) :: TDigest comp -> TDigest comp -> TDigest comp #

sconcat :: NonEmpty (TDigest comp) -> TDigest comp #

stimes :: Integral b => b -> TDigest comp -> TDigest comp #

KnownNat comp => Monoid (TDigest comp) Source # 

Methods

mempty :: TDigest comp #

mappend :: TDigest comp -> TDigest comp -> TDigest comp #

mconcat :: [TDigest comp] -> TDigest comp #

KnownNat comp => Binary (TDigest comp) Source #

TDigest isn't compressed after de-serialisation, but it can be still smaller.

Methods

put :: TDigest comp -> Put #

get :: Get (TDigest comp) #

putList :: [TDigest comp] -> Put #

NFData (TDigest comp) Source #

TDigest has only strict fields.

Methods

rnf :: TDigest comp -> () #

tdigest :: (Foldable f, KnownNat comp) => f Double -> TDigest comp Source #

Strict foldl' over Foldable structure.

Population

singleton :: KnownNat comp => Double -> TDigest comp Source #

Make a TDigest of a single data point.

insert Source #

Arguments

:: KnownNat comp 
=> Double

element

-> TDigest comp 
-> TDigest comp 

Insert single value into TDigest.

insert' Source #

Arguments

:: KnownNat comp 
=> Double

element

-> TDigest comp 
-> TDigest comp 

Insert single value, don't compress TDigest even if needed.

For sensibly bounded input, it makes sense to let TDigest grow (it might grow linearly in size), and after that compress it once.

Compression

>>> let digest = foldl' (flip insert') mempty [0..1000] :: TDigest 10
>>> (size digest, size $ compress digest)
(1001,54)
>>> (quantile 0.1 digest, quantile 0.1 $ compress digest)
(Just 99.6...,Just 90.1...)

Note: when values are inserted in more random order, t-Digest self-compresses on the fly:

>>> let digest = foldl' (flip insert') mempty (fairshuffle [0..1000]) :: TDigest 10
>>> (size digest, size $ compress digest, size $ forceCompress digest)
(77,77,44)
>>> quantile 0.1 digest
Just 96.9...

compress :: forall comp. KnownNat comp => TDigest comp -> TDigest comp Source #

Compress TDigest.

Reinsert the centroids in "better" order (in original paper: in random) so they have opportunity to merge.

Compression will happen only if size is both: bigger than relMaxSize * comp and bigger than absMaxSize.

forceCompress :: forall comp. KnownNat comp => TDigest comp -> TDigest comp Source #

Perform compression, even if current size says it's not necessary.

Statistics

totalWeight :: TDigest comp -> Double Source #

Total count of samples.

>>> totalWeight (tdigest [1..100] :: TDigest 3)
100.0

minimumValue :: TDigest comp -> Mean Source #

Center of left-most centroid. Note: may be different than min element inserted.

>>> minimumValue (tdigest [1..100] :: TDigest 3)
1.0

maximumValue :: TDigest comp -> Mean Source #

Center of right-most centroid. Note: may be different than max element inserted.

>>> maximumValue (tdigest [1..100] :: TDigest 3)
99.0

Histogram

histogram :: TDigest comp -> [HistBin] Source #

Calculate histogram based on the TDigest.

data HistBin Source #

Histogram bin

Constructors

HistBin 

Fields

Instances

Percentile

median :: TDigest comp -> Maybe Double Source #

Median, i.e. quantile 0.5.

quantile :: Double -> TDigest comp -> Maybe Double Source #

Calculate quantile of a specific value.

CDF

cdf :: Double -> TDigest comp -> Double Source #

Cumulative distribution function.

Note: if this is the only thing you need, it's more efficient to count this directly.

icdf :: Double -> TDigest comp -> Maybe Double Source #

Alias of quantile.

Debug

validate :: TDigest comp -> Either String (TDigest comp) Source #

Check various invariants in the TDigest tree.