data-sketches-0.3.0.1
Safe HaskellNone
LanguageHaskell2010

DataSketches.Quantiles.RelativeErrorQuantile

Description

The Relative Error Quantile (REQ) sketch provides extremely high accuracy at a chosen end of the rank domain. This is best illustrated with some rank domain accuracy plots that compare the KLL quantiles sketch to the REQ sketch.

This first plot illustrates the typical error behavior of the KLL sketch (also the quantiles/DoublesSketch). The error is flat for all ranks (0, 1). The green and yellow lines correspond to +/- one RSE at 68% confidence; the blue and red lines, +- two RSE at 95% confidence; and, the purple and brown lines +- 3 RSE at 99% confidence. The reason all the curves pinch at 0 and 1.0, is because the sketch knows with certainty that a request for a quantile at rank = 0 is the minimum value of the stream; and a request for a quantiles at rank = 1.0, is the maximum value of the stream. Both of which the sketch tracks.

The next plot is the exact same data and queries fed to the REQ sketch set for High Rank Accuracy (HRA) mode. In this plot, starting at a rank of about 0.3, the contour lines start converging and actually reach zero error at rank 1.0. Therefore the error (the inverse of accuracy) is relative to the requested rank, thus the name of the sketch. This means that the user can perform getQuantile(rank) queries, where rank = .99999 and get accurate results.

This next plot is also the same data and queries, except the REQ sketch was configured for Low Rank Accuracy (LRA). In this case the user can perform getQuantiles(rank) queries, where rank = .00001 and get accurate results.

Synopsis

Construction

data ReqSketch s Source #

This Relative Error Quantiles Sketch is the Haskell implementation based on the paper "Relative Error Streaming Quantiles", https://arxiv.org/abs/2004.01668, and loosely derived from a Python prototype written by Pavel Vesely, ported from the Java equivalent.

This implementation differs from the algorithm described in the paper in the following:

The algorithm requires no upper bound on the stream length. Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). Initially, the relative-compactor starts with INIT_NUMBER_OF_SECTIONS. Each time the number of compactions (variable state) exceeds 2^{numSections - 1}, we double numSections. Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.

The size of each section (variable k and sectionSize in the code and parameter k in the paper) is initialized with a value set by the user via variable k. When the number of sections doubles, we decrease sectionSize by a factor of sqrt(2). This is applied at each level separately. Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. sqrt(2) (+/- rounding).

The merge operation here does not perform "special compactions", which are used in the paper to allow for a tight mathematical analysis of the sketch.

This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.

The Python prototype only implemented high accuracy for low ranks. This implementation provides the user with the ability to choose either high rank accuracy or low rank accuracy at the time of sketch construction.

  • The Python prototype only implemented a comparison criterion of "<". This implementation allows the user to switch back and forth between the "<=" criterion and the "<=" criterion.

Instances

Instances details
TakeSnapshot ReqSketch Source # 
Instance details

Defined in DataSketches.Quantiles.RelativeErrorQuantile

Associated Types

data Snapshot ReqSketch Source #

Show (Snapshot ReqSketch) Source # 
Instance details

Defined in DataSketches.Quantiles.RelativeErrorQuantile

Generic (ReqSketch s) Source # 
Instance details

Defined in DataSketches.Quantiles.RelativeErrorQuantile

Associated Types

type Rep (ReqSketch s) :: Type -> Type #

Methods

from :: ReqSketch s -> Rep (ReqSketch s) x #

to :: Rep (ReqSketch s) x -> ReqSketch s #

NFData (ReqSketch s) Source # 
Instance details

Defined in DataSketches.Quantiles.RelativeErrorQuantile

Methods

rnf :: ReqSketch s -> () #

data Snapshot ReqSketch Source # 
Instance details

Defined in DataSketches.Quantiles.RelativeErrorQuantile

type Rep (ReqSketch s) Source # 
Instance details

Defined in DataSketches.Quantiles.RelativeErrorQuantile

type Rep (ReqSketch s) = D1 ('MetaData "ReqSketch" "DataSketches.Quantiles.RelativeErrorQuantile" "data-sketches-0.3.0.1-6J1wEmzzafm56VnpwQnbyl" 'False) (C1 ('MetaCons "ReqSketch" 'PrefixI 'True) (((S1 ('MetaSel ('Just "k") 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 Word32) :*: (S1 ('MetaSel ('Just "rankAccuracySetting") 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 RankAccuracy) :*: S1 ('MetaSel ('Just "criterion") 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 Criterion))) :*: (S1 ('MetaSel ('Just "sketchRng") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (Gen s)) :*: (S1 ('MetaSel ('Just "totalN") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (URef s Word64)) :*: S1 ('MetaSel ('Just "minValue") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (URef s Double))))) :*: ((S1 ('MetaSel ('Just "maxValue") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (URef s Double)) :*: (S1 ('MetaSel ('Just "sumValue") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (URef s Double)) :*: S1 ('MetaSel ('Just "retainedItems") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (URef s Int)))) :*: (S1 ('MetaSel ('Just "maxNominalCapacitiesSize") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (URef s Int)) :*: (S1 ('MetaSel ('Just "aux") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (MutVar s (Maybe ReqAuxiliary))) :*: S1 ('MetaSel ('Just "compactors") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 (MutVar s (Vector (ReqCompactor s)))))))))

mkReqSketch Source #

Arguments

:: forall m. PrimMonad m 
=> Word32

K

-> RankAccuracy 
-> m (ReqSketch (PrimState m)) 

The K parameter can be increased to trade increased space efficiency for higher accuracy in rank and quantile calculations. Due to the way the compaction algorithm works, it must be an even number between 4 and 1024.

A good starting number when in doubt is 6.

Configuration settings

data RankAccuracy Source #

Constructors

HighRanksAreAccurate

High ranks are prioritized for better accuracy.

LowRanksAreAccurate

Low ranks are prioritized for better accuracy

data Criterion Source #

Constructors

(:<) 
(:<=) 

Sketch summaries

count :: PrimMonad m => ReqSketch (PrimState m) -> m Word64 Source #

Get the total number of items inserted into the sketch

null :: PrimMonad m => ReqSketch (PrimState m) -> m Bool Source #

Returns true if this sketch is empty.

maximum :: PrimMonad m => ReqSketch (PrimState m) -> m Double Source #

Gets the largest value seen by this sketch

minimum :: PrimMonad m => ReqSketch (PrimState m) -> m Double Source #

Gets the smallest value seen by this sketch

relativeStandardError Source #

Arguments

:: Int

k - the given value of k

-> Double

rank - the given normalized rank, a number in [0,1].

-> RankAccuracy 
-> Word64

totalN - an estimate of the total number of items submitted to the sketch.

-> Double

an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).

Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in https://arxiv.org/abs/2004.01668v2, but the constant factors were modified based on empirical measurements.

countWithCriterion :: (PrimMonad m, s ~ PrimState m) => ReqSketch s -> Double -> m Word64 Source #

Returns the approximate count of items satisfying the criterion set in the ReqSketch criterion field.

probabilityMassFunction Source #

Arguments

:: PrimMonad m 
=> ReqSketch (PrimState m) 
-> [Double]

splitPoints - an array of m unique, monotonically increasing double values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these splitpoints.

-> m [Double]

An array of m+1 doubles each of which is an approximation to the fraction of the input stream values (the mass) that fall into one of those intervals. The definition of an "interval" is inclusive of the left splitPoint and exclusive of the right splitPoint, with the exception that the last interval will include maximum value.

Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values). The resulting approximations have a probabilistic guarantee that be obtained, a priori, from the getRSE(int, double, boolean, long) function.

If the sketch is empty this returns an empty list.

quantile Source #

Arguments

:: PrimMonad m 
=> ReqSketch (PrimState m) 
-> Double

normRank - the given normalized rank

-> m Double

the approximate quantile given the normalized rank.

Gets the approximate quantile of the given normalized rank based on the lteq criterion.

quantiles Source #

Arguments

:: PrimMonad m 
=> ReqSketch (PrimState m) 
-> [Double]

normRanks - the given array of normalized ranks.

-> m [Double]

the array of quantiles that correspond to the given array of normalized ranks.

Gets an array of quantiles that correspond to the given array of normalized ranks.

rank Source #

Arguments

:: PrimMonad m 
=> ReqSketch (PrimState m) 
-> Double

value - the given value

-> m Double

the normalized rank of the given value in the stream.

Computes the normalized rank of the given value in the stream. The normalized rank is the fraction of values less than the given value; or if lteq is true, the fraction of values less than or equal to the given value.

rankLowerBound Source #

Arguments

:: PrimMonad m 
=> ReqSketch (PrimState m) 
-> Double

rank - the given rank, a value between 0 and 1.0.

-> Int

numStdDev - the number of standard deviations. Must be 1, 2, or 3.

-> m Double

an approximate lower bound rank.

Returns an approximate lower bound rank of the given normalized rank.

ranks :: (PrimMonad m, s ~ PrimState m) => ReqSketch s -> [Double] -> m [Double] Source #

Gets an array of normalized ranks that correspond to the given array of values. TODO, make it ifaster

rankUpperBound Source #

Arguments

:: PrimMonad m 
=> ReqSketch (PrimState m) 
-> Double

rank - the given rank, a value between 0 and 1.0.

-> Int

numStdDev - the number of standard deviations. Must be 1, 2, or 3.

-> m Double

an approximate upper bound rank.

Returns an approximate upper bound rank of the given rank.

cumulativeDistributionFunction Source #

Arguments

:: PrimMonad m 
=> ReqSketch (PrimState m) 
-> [Double]

Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).

The resulting approximations have a probabilistic guarantee that be obtained, a priori, from the getRSE(int, double, boolean, long) function.

If the sketch is empty this returns Nothing.

-> m (Maybe [Double]) 

Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).

Updating the sketch

merge :: (PrimMonad m, s ~ PrimState m) => ReqSketch s -> ReqSketch s -> m (ReqSketch s) Source #

Merge other sketch into this one.

insert :: PrimMonad m => ReqSketch (PrimState m) -> Double -> m () Source #

Updates this sketch with the given item.

isEstimationMode :: PrimMonad m => ReqSketch (PrimState m) -> m Bool Source #

Returns true if this sketch is in estimation mode.

isLessThanOrEqual :: ReqSketch s -> Bool Source #

Returns the current comparison criterion.

Internals used in test. DO NOT USE.

If you see this error, please file an issue in the GitHub repository.