tasty-bench-0.1: Featherlight benchmark framework
Copyright(c) 2021 Andrew Lelechenko
LicenseMIT
Safe HaskellNone
LanguageHaskell2010

Test.Tasty.Bench

Description

Featherlight benchmark framework (only one file!) for performance measurement with API mimicking criterion and gauge.

How lightweight is it?

There is only one source file Test.Tasty.Bench and no external dependencies except tasty. So if you already depend on tasty for a test suite, there is nothing else to install.

Compare this to criterion (10+ modules, 50+ dependencies) and gauge (40+ modules, depends on basement and vector).

How is it possible?

Our benchmarks are literally regular tasty tests, so we can leverage all existing machinery for command-line options, resource management, structuring, listing and filtering benchmarks, running and reporting results. It also means that tasty-bench can be used in conjunction with other tasty ingredients.

Unlike criterion and gauge we use a very simple statistical model described below. This is arguably a questionable choice, but it works pretty well in practice. A rare developer is sufficiently well-versed in probability theory to make sense and use of all numbers generated by criterion.

How to switch?

Cabal mixins allow to taste tasty-bench instead of criterion or gauge without changing a single line of code:

cabal-version: 2.0

benchmark foo
  ...
  build-depends:
    tasty-bench
  mixins:
    tasty-bench (Test.Tasty.Bench as Criterion)

This works vice versa as well: if you use tasty-bench, but at some point need a more comprehensive statistical analysis, it is easy to switch temporarily back to criterion.

How to write a benchmark?

Benchmarks are declared in a separate section of cabal file:

cabal-version:   2.0
name:            bench-fibo
version:         0.0
build-type:      Simple
synopsis:        Example of a benchmark

benchmark bench-fibo
  main-is:       BenchFibo.hs
  type:          exitcode-stdio-1.0
  build-depends: base, tasty-bench

And here is BenchFibo.hs:

import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "fibonacci numbers"
    [ bench "fifth"     $ nf fibo  5
    , bench "tenth"     $ nf fibo 10
    , bench "twentieth" $ nf fibo 20
    ]
  ]

Since tasty-bench provides an API compatible with criterion, one can refer to its documentation for more examples.

How to read results?

Running the example above (cabal bench or stack bench) results in the following output:

All
  fibonacci numbers
    fifth:     OK (2.13s)
       63 ns ± 3.4 ns
    tenth:     OK (1.71s)
      809 ns ±  73 ns
    twentieth: OK (3.39s)
      104 μs ± 4.9 μs

All 3 tests passed (7.25s)

The output says that, for instance, the first benchmark was repeatedly executed for 2.13 seconds (wall time), its mean time was 63 nanoseconds and, assuming ideal precision of a system clock, execution time does not often diverge from the mean further than ±3.4 nanoseconds (double standard deviation, which for normal distributions corresponds to 95% probability). Take standard deviation numbers with a grain of salt; there are lies, damned lies, and statistics.

Note that this data is not directly comparable with criterion output:

benchmarking fibonacci numbers/fifth
time                 62.78 ns   (61.99 ns .. 63.41 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 62.39 ns   (61.93 ns .. 62.94 ns)
std dev              1.753 ns   (1.427 ns .. 2.258 ns)

One might interpret the second line as saying that 95% of measurements fell into 61.99–63.41 ns interval, but this is wrong. It states that the OLS regression of execution time (which is not exactly the mean time) is most probably somewhere between 61.99 ns and 63.41 ns, but does not say a thing about individual measurements. To understand how far away a typical measurement deviates you need to add/subtract double standard deviation yourself (which gives 62.78 ns ± 3.506 ns, similar to tasty-bench above).

To add to the confusion, gauge in --small mode outputs not the second line of criterion report as one might expect, but a mean value from the penultimate line and a standard deviation:

fibonacci numbers/fifth                  mean 62.39 ns  ( +- 1.753 ns  )

The interval ±1.753 ns answers for 68% of samples only, double it to estimate the behavior in 95% of cases.

Statistical model

Here is a procedure used by tasty-bench to measure execution time:

  1. Set \( n \leftarrow 1 \).
  2. Measure execution time \( t_n \) of \( n \) iterations and execution time \( t_{2n} \) of \( 2n \) iterations.
  3. Find \( t \) which minimizes deviation of \( (nt, 2nt) \) from \( (t_n, t_{2n}) \).
  4. If deviation is small enough (see --stdev below), return \( t \) as a mean execution time.
  5. Otherwise set \( n \leftarrow 2n \) and jump back to Step 2.

This is roughly similar to the linear regression approach which criterion takes, but we fit only two last points. This allows us to simplify away all heavy-weight statistical analysis. More importantly, earlier measurements, which are presumably shorter and noisier, do not affect overall result. This is in contrast to criterion, which fits all measurements and is biased to use more data points corresponding to shorter runs (it employs \( n \leftarrow 1.05n \) progression).

An alert reader could object that we measure standard deviation for samples with \( n \) and \( 2n \) iterations, but report it scaled to a single iteration. Strictly speaking, this is justified only if we assume that deviating factors are either roughly periodic (e. g., coarseness of a system clock, garbage collection) or are likely to affect several successive iterations in the same way (e. g., slow down by another concurrent process).

Obligatory disclaimer: statistics is a tricky matter, there is no one-size-fits-all approach. In the absence of a good theory simplistic approaches are as (un)sound as obscure ones. Those who seek statistical soundness should rather collect raw data and process it themselves in R/Python. Data reported by tasty-bench is only of indicative and comparative significance.

Tip

Passing +RTS -T (via cabal bench --benchmark-options '+RTS -T' or stack bench --ba '+RTS -T') enables tasty-bench to estimate and report memory usage such as allocated and copied bytes.

Command-line options

Use --help to list command-line options.

-p, --pattern
This is a standard tasty option, which allows filtering benchmarks by a pattern or awk expression. Please refer to tasty documentation for details.
--csv
File to write results in CSV format. If specified, suppresses console output.
-t, --timeout
This is a standard tasty option, setting timeout for individual benchmarks in seconds. Use it when benchmarks tend to take too long: tasty-bench will make an effort to report results (even if of subpar quality) before timeout. Setting timeout too tight (insufficient for at least three iterations of benchmark) will result in a benchmark failure. Do not use --timeout without a reason: it forks an additional thread and thus affects reliability of measurements.
--stdev
Target relative standard deviation of measurements in percents (5% by default). Large values correspond to fast and loose benchmarks, and small ones to long and precise. If it takes far too long, consider setting --timeout, which will interrupt benchmarks, potentially before reaching the target deviation.
Synopsis

Running Benchmark

defaultMain :: [Benchmark] -> IO () Source #

Run benchmarks and report results.

Wrapper around defaultMain (+ csvReporter) to provide an interface compatible with defaultMain and defaultMain.

type Benchmark = TestTree Source #

Benchmarks are actually just a regular TestTree in disguise.

This is a drop-in replacement for Benchmark and Benchmark.

bench :: String -> Benchmarkable -> Benchmark Source #

Attach a name to Benchmarkable.

This is actually a synonym of singleTest to provide an interface compatible with bench and bench.

bgroup :: String -> [Benchmark] -> Benchmark Source #

Attach a name to a group of Benchmark.

This is actually a synonym of testGroup to provide an interface compatible with bgroup and bgroup.

Creating Benchmarkable

data Benchmarkable Source #

Something that can be benchmarked.

Drop-in replacement for Benchmarkable and Benchmarkable.

Instances

Instances details
IsTest Benchmarkable Source # 
Instance details

Defined in Test.Tasty.Bench

nf :: NFData b => (a -> b) -> a -> Benchmarkable Source #

nf f x measures time to compute a normal form (by means of rnf) of f x.

Note that forcing a normal form requires an additional traverse of the structure. In certain scenarios (imagine benchmarking tail), especially when NFData instance is badly written, this traversal may take non-negligible time and affect results.

Drop-in replacement for nf and nf.

whnf :: (a -> b) -> a -> Benchmarkable Source #

whnf f x measures time to compute a weak head normal form of f x.

Computing only a weak head normal form is rarely what intuitively is meant by "evaluation". Unless you understand precisely, what is measured, it is recommended to use nf instead.

Drop-in replacement for whnf and whnf.

nfIO :: NFData a => IO a -> Benchmarkable Source #

nfIO x measures time to evaluate side-effects of x and compute its normal form (by means of rnf).

Pure subexpression of an effectful computation x may be evaluated only once and get cached; use nfAppIO to avoid this.

Note that forcing a normal form requires an additional traverse of the structure. In certain scenarios, especially when NFData instance is badly written, this traversal may take non-negligible time and affect results.

Drop-in replacement for nfIO and nfIO.

whnfIO :: NFData a => IO a -> Benchmarkable Source #

whnfIO x measures time to evaluate side-effects of x and compute its weak head normal form.

Pure subexpression of an effectful computation x may be evaluated only once and get cached; use whnfAppIO to avoid this.

Computing only a weak head normal form is rarely what intuitively is meant by "evaluation". Unless you understand precisely, what is measured, it is recommended to use nfIO instead.

Drop-in replacement for whnfIO and whnfIO.

nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable Source #

nfAppIO f x measures time to evaluate side-effects of f x and compute its normal form (by means of rnf).

Note that forcing a normal form requires an additional traverse of the structure. In certain scenarios, especially when NFData instance is badly written, this traversal may take non-negligible time and affect results.

Drop-in replacement for nfAppIO and nfAppIO.

whnfAppIO :: (a -> IO b) -> a -> Benchmarkable Source #

whnfAppIO f x measures time to evaluate side-effects of f x and compute its weak head normal form.

Computing only a weak head normal form is rarely what intuitively is meant by "evaluation". Unless you understand precisely, what is measured, it is recommended to use nfAppIO instead.

Drop-in replacement for whnfAppIO and whnfAppIO.

CSV ingredient

csvReporter :: Ingredient Source #

Add this ingredient to run benchmarks and save results in CSV format. It activates when --csv FILE command line option is specified.

defaultMainWithIngredients [listingTests, csvReporter, consoleTestReporter] benchmarks

Remember that successful activation of an ingredient suppresses all subsequent ingredients. If you wish to produce CSV in addition to other reports, use composeReporters:

defaultMainWithIngredients [listingTests, composeReporters csvReporter consoleTestReporter] benchmarks