Copyright	(c) 2021 Andrew Lelechenko
License	MIT
Safe Haskell	Safe-Inferred
Language	Haskell2010

Test.Tasty.Bench

Contents

Running Benchmark
Creating Benchmarkable
Ingredients
Utils

Description

Featherlight benchmark framework (only one file!) for performance measurement with API mimicking criterion and gauge. A prominent feature is built-in comparison against previous runs and between benchmarks.

How lightweight is it?

There is only one source file Test.Tasty.Bench and no non-boot dependencies except tasty. So if you already depend on tasty for a test suite, there is nothing else to install.

Compare this to criterion (10+ modules, 50+ dependencies) and gauge (40+ modules, depends on basement and vector). A build on a clean machine is up to 16x faster than criterion and up to 4x faster than gauge. A build without dependencies is up to 6x faster than criterion and up to 8x faster than gauge.

tasty-bench is a native Haskell library and works everywhere, where GHC does. We support a full range of architectures (i386, amd64, armhf, arm64, ppc64le, s390x) and operating systems (Linux, Windows, MacOS, FreeBSD), plus any GHC from 7.0 to 9.4.

How is it possible?

Our benchmarks are literally regular tasty tests, so we can leverage all existing machinery for command-line options, resource management, structuring, listing and filtering benchmarks, running and reporting results. It also means that tasty-bench can be used in conjunction with other tasty ingredients.

Unlike criterion and gauge we use a very simple statistical model described below. This is arguably a questionable choice, but it works pretty well in practice. A rare developer is sufficiently well-versed in probability theory to make sense and use of all numbers generated by criterion.

How to switch?

Cabal mixins allow to taste tasty-bench instead of criterion or gauge without changing a single line of code:

cabal-version: 2.0

benchmark foo
  ...
  build-depends:
    tasty-bench
  mixins:
    tasty-bench (Test.Tasty.Bench as Criterion, Test.Tasty.Bench as Criterion.Main, Test.Tasty.Bench as Gauge, Test.Tasty.Bench as Gauge.Main)

This works vice versa as well: if you use tasty-bench, but at some point need a more comprehensive statistical analysis, it is easy to switch temporarily back to criterion.

How to write a benchmark?

Benchmarks are declared in a separate section of cabal file:

cabal-version:   2.0
name:            bench-fibo
version:         0.0
build-type:      Simple
synopsis:        Example of a benchmark

benchmark bench-fibo
  main-is:       BenchFibo.hs
  type:          exitcode-stdio-1.0
  build-depends: base, tasty-bench
  ghc-options:   "-with-rtsopts=-A32m"

And here is BenchFibo.hs:

import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "fibonacci numbers"
    [ bench "fifth"     $ nf fibo  5
    , bench "tenth"     $ nf fibo 10
    , bench "twentieth" $ nf fibo 20
    ]
  ]

Since tasty-bench provides an API compatible with criterion, one can refer to its documentation for more examples.

How to read results?

Running the example above (cabal bench or stack bench) results in the following output:

All
  fibonacci numbers
    fifth:     OK (2.13s)
       63 ns ± 3.4 ns
    tenth:     OK (1.71s)
      809 ns ±  73 ns
    twentieth: OK (3.39s)
      104 μs ± 4.9 μs

All 3 tests passed (7.25s)

The output says that, for instance, the first benchmark was repeatedly executed for 2.13 seconds (wall-clock time), its predicted mean CPU time was 63 nanoseconds and means of individual samples do not often diverge from it further than ±3.4 nanoseconds (double standard deviation). Take standard deviation numbers with a grain of salt; there are lies, damned lies, and statistics.

Wall-clock time vs. CPU time

What time are we talking about? Both criterion and gauge by default report wall-clock time, which is affected by any other application which runs concurrently. Ideally benchmarks are executed on a dedicated server without any other load, but — let's face the truth — most of developers run benchmarks on a laptop with a hundred other services and a window manager, and watch videos while waiting for benchmarks to finish. That's the cause of a notorious "variance introduced by outliers: 88% (severely inflated)" warning.

To alleviate this issue tasty-bench measures CPU time by getCPUTime instead of wall-clock time by default. It does not provide a perfect isolation from other processes (e. g., if CPU cache is spoiled by others, populating data back from RAM is your burden), but is a bit more stable.

Caveat: this means that for multithreaded algorithms tasty-bench reports total elapsed CPU time across all cores, while criterion and gauge print maximum of core's wall-clock time. It also means that by default tasty-bench does not measure time spent out of process, e. g., calls to other executables. To work around this limitation use --time-mode command-line option or set it locally via TimeMode option.

Statistical model

Here is a procedure used by tasty-bench to measure execution time:

Set \( n \leftarrow 1 \).
Measure execution time \( t_n \) of \( n \) iterations and execution time \( t_{2n} \) of \( 2n \) iterations.
Find \( t \) which minimizes deviation of \( (nt, 2nt) \) from \( (t_n, t_{2n}) \), namely \( t \leftarrow (t_n + 2t_{2n}) / 5n \).
If deviation is small enough (see --stdev below) or time is running out soon (see --timeout below), return \( t \) as a mean execution time.
Otherwise set \( n \leftarrow 2n \) and jump back to Step 2.

This is roughly similar to the linear regression approach which criterion takes, but we fit only two last points. This allows us to simplify away all heavy-weight statistical analysis. More importantly, earlier measurements, which are presumably shorter and noisier, do not affect overall result. This is in contrast to criterion, which fits all measurements and is biased to use more data points corresponding to shorter runs (it employs \( n \leftarrow 1.05n \) progression).

Mean time and its deviation does not say much about the distribution of individual timings. E. g., imagine a computation which (according to a coarse system timer) takes either 0 ms or 1 ms with equal probability. While one would be able to establish that its mean time is 0.5 ms with a very small deviation, this does not imply that individual measurements are anywhere near 0.5 ms. Even assuming an infinite precision of a system timer, the distribution of individual times is not known to be normal.

Obligatory disclaimer: statistics is a tricky matter, there is no one-size-fits-all approach. In the absence of a good theory simplistic approaches are as (un)sound as obscure ones. Those who seek statistical soundness should rather collect raw data and process it themselves using a proper statistical toolbox. Data reported by tasty-bench is only of indicative and comparative significance.

Memory usage

Configuring RTS to collect GC statistics (e. g., via cabal bench --benchmark-options '+RTS -T' or stack bench --ba '+RTS -T') enables tasty-bench to estimate and report memory usage:

All
  fibonacci numbers
    fifth:     OK (2.13s)
       63 ns ± 3.4 ns, 223 B  allocated,   0 B  copied, 2.0 MB peak memory
    tenth:     OK (1.71s)
      809 ns ±  73 ns, 2.3 KB allocated,   0 B  copied, 4.0 MB peak memory
    twentieth: OK (3.39s)
      104 μs ± 4.9 μs, 277 KB allocated,  59 B  copied, 5.0 MB peak memory

All 3 tests passed (7.25s)

This data is reported as per RTSStats fields: allocated_bytes, copied_bytes and max_mem_in_use_bytes.

Combining tests and benchmarks

When optimizing an existing function, it is important to check that its observable behavior remains unchanged. One can rebuild both tests and benchmarks after each change, but it would be more convenient to run sanity checks within benchmark itself. Since our benchmarks are compatible with tasty tests, we can easily do so.

Imagine you come up with a faster function myFibo to generate Fibonacci numbers:

import Test.Tasty.Bench
import Test.Tasty.QuickCheck -- from tasty-quickcheck package

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

myFibo :: Int -> Integer
myFibo n = if n < 3 then toInteger n else myFibo (n - 1) + myFibo (n - 2)

main :: IO ()
main = Test.Tasty.Bench.defaultMain -- not Test.Tasty.defaultMain
  [ bench "fibo   20" $ nf fibo   20
  , bench "myFibo 20" $ nf myFibo 20
  , testProperty "myFibo = fibo" $ \n -> fibo n === myFibo n
  ]

This outputs:

All
  fibo   20:     OK (3.02s)
    104 μs ± 4.9 μs
  myFibo 20:     OK (1.99s)
     71 μs ± 5.3 μs
  myFibo = fibo: FAIL
    *** Failed! Falsified (after 5 tests and 1 shrink):
    2
    1 /= 2
    Use --quickcheck-replay=927711 to reproduce.

1 out of 3 tests failed (5.03s)

We see that myFibo is indeed significantly faster than fibo, but unfortunately does not do the same thing. One should probably look for another way to speed up generation of Fibonacci numbers.

Troubleshooting

If benchmarks take too long, set --timeout to limit execution time of individual benchmarks, and tasty-bench will do its best to fit into a given time frame. Without --timeout we rerun benchmarks until achieving a target precision set by --stdev, which in a noisy environment of a modern laptop with GUI may take a lot of time.
While criterion runs each benchmark at least for 5 seconds, tasty-bench is happy to conclude earlier, if it does not compromise the quality of results. In our experiments tasty-bench suites tend to finish earlier, even if some individual benchmarks take longer than with criterion.
A common source of noisiness is garbage collection. Setting a larger allocation area (nursery) is often a good idea, either via cabal bench --benchmark-options '+RTS -A32m' or stack bench --ba '+RTS -A32m'. Alternatively bake it into cabal file as ghc-options: "-with-rtsopts=-A32m".
For GHC ≥ 8.10 consider switching benchmarks to a non-moving garbage collector, because it decreases GC pauses and corresponding noise: +RTS --nonmoving-gc.

Never compile benchmarks with -fstatic-argument-transformation, because it breaks a trick we use to force GHC into reevaluation of the same function application over and over again.

If benchmark results look malformed like below, make sure that you are invoking defaultMain and not defaultMain (the difference is consoleBenchReporter vs. consoleTestReporter):

All
  fibo 20:       OK (1.46s)
    Response {respEstimate = Estimate {estMean = Measurement {measTime = 87496728, measAllocs = 0, measCopied = 0}, estStdev = 694487}, respIfSlower = FailIfSlower Infinity, respIfFaster = FailIfFaster Infinity}

If benchmarks fail with an error message
```
Unhandled resource. Probably a bug in the runner you're using.
```
or
```
Unexpected state of the resource (NotCreated) in getResource. Report as a tasty bug.
```
this is likely caused by env or envWithCleanup affecting benchmarks structure. You can use env to read test data from IO, but not to read benchmark names or affect their hierarchy in other way. This is a fundamental restriction of tasty to list and filter benchmarks without launching missiles.
If benchmarks fail with Test dependencies form a loop or Test dependencies have cycles, this is likely because of bcompare, which compares a benchmark with itself. Locating a benchmark in a global environment may be tricky, please refer to tasty documentation for details and consider using locateBenchmark.

Isolating interfering benchmarks

One difficulty of benchmarking in Haskell is that it is hard to isolate benchmarks so that they do not interfere. Changing the order of benchmarks or skipping some of them has an effect on heap’s layout and thus affects garbage collection. This issue is well attested in both criterion and gauge.

Usually (but not always) skipping some benchmarks speeds up remaining ones. That’s because once a benchmark allocated heap which for some reason was not promptly released afterwards (e. g., it forced a top-level thunk in an underlying library), all further benchmarks are slowed down by garbage collector processing this additional amount of live data over and over again.

There are several mitigation strategies. First of all, giving garbage collector more breathing space by +RTS -A32m (or more) is often good enough.

Further, avoid using top-level bindings to store large test data. Once such thunks are forced, they remain allocated forever, which affects detrimentally subsequent unrelated benchmarks. Treat them as external data, supplied via env: instead of

largeData :: String
largeData = replicate 1000000 'a'

main :: IO ()
main = defaultMain
  [ bench "large" $ nf length largeData, ... ]

use

import Control.DeepSeq (force)
import Control.Exception (evaluate)

main :: IO ()
main = defaultMain
  [ env (evaluate (force (replicate 1000000 'a'))) $ \largeData ->
    bench "large" $ nf length largeData, ... ]

Finally, as an ultimate measure to reduce interference between benchmarks, one can run each of them in a separate process. We do not quite recommend this approach, but if you are desperate, here is how.

Assuming that a benchmark is declared in cabal file as benchmark my-bench component, let’s first find its executable:

cabal build --enable-benchmarks my-bench
MYBENCH=$(cabal list-bin my-bench) # available since cabal-3.4

Now list all benchmark names (hopefully, they do not contain newlines), escape quotes and slashes, and run each of them separately:

$MYBENCH -l | sed -e 's/[\"]/\\\\\\&/g' | while read -r name; do $MYBENCH -p '$0 == "'"$name"'"'; done

Comparison against baseline

One can compare benchmark results against an earlier baseline in an automatic way. To use this feature, first run tasty-bench with --csv FILE key to dump results to FILE in CSV format (it could be a good idea to set smaller --stdev, if possible):

Name,Mean (ps),2*Stdev (ps)
All.fibonacci numbers.fifth,48453,4060
All.fibonacci numbers.tenth,637152,46744
All.fibonacci numbers.twentieth,81369531,3342646

Now modify implementation and rerun benchmarks with --baseline FILE key. This produces a report as follows:

All
  fibonacci numbers
    fifth:     OK (0.44s)
       53 ns ± 2.7 ns,  8% more than baseline
    tenth:     OK (0.33s)
      641 ns ±  59 ns,       same as baseline
    twentieth: OK (0.36s)
       77 μs ± 6.4 μs,  5% less than baseline

All 3 tests passed (1.50s)

You can also fail benchmarks, which deviate too far from baseline, using --fail-if-slower and --fail-if-faster options. For example, setting both of them to 6 will fail the first benchmark above (because it is more than 6% slower), but the last one still succeeds (even while it is measurably faster than baseline, deviation is less than 6%). Consider also using --hide-successes to show only problematic benchmarks, or even tasty-rerun package to focus on rerunning failing items only.

If you wish to compare two CSV reports non-interactively, here is a handy awk incantation:

awk 'BEGIN{FS=",";OFS=",";print "Name,Old,New,Ratio"}FNR==1{trueNF=NF;next}NF<trueNF{print "Benchmark names should not contain newlines";exit 1}FNR==NR{oldTime=$(NF-trueNF+2);NF-=trueNF-1;a[$0]=oldTime;next}{newTime=$(NF-trueNF+2);NF-=trueNF-1;print $0,a[$0],newTime,newTime/a[$0];gs+=log(newTime/a[$0]);gc++}END{if(gc>0)print "Geometric mean,,",exp(gs/gc)}' old.csv new.csv

Note that columns in CSV report are different from what criterion or gauge would produce. If names do not contain commas, missing columns can be faked this way:

awk 'BEGIN{FS=",";OFS=",";print "Name,Mean,MeanLB,MeanUB,Stddev,StddevLB,StddevUB"}NR==1{trueNF=NF;next}NF<trueNF{print $0;next}{mean=$(NF-trueNF+2);stddev=$(NF-trueNF+3);NF-=trueNF-1;print $0,mean/1e12,mean/1e12,mean/1e12,stddev/2e12,stddev/2e12,stddev/2e12}'

To fake gauge in --csvraw mode use

awk 'BEGIN{FS=",";OFS=",";print "name,iters,time,cycles,cpuTime,utime,stime,maxrss,minflt,majflt,nvcsw,nivcsw,allocated,numGcs,bytesCopied,mutatorWallSeconds,mutatorCpuSeconds,gcWallSeconds,gcCpuSeconds"}NR==1{trueNF=NF;next}NF<trueNF{print $0;next}{mean=$(NF-trueNF+2);fourth=$(NF-trueNF+4);fifth=$(NF-trueNF+5);sixth=$(NF-trueNF+6);NF-=trueNF-1;print $0,1,mean/1e12,0,mean/1e12,mean/1e12,0,sixth+0,0,0,0,0,fourth+0,0,fifth+0,0,0,0,0}'

Comparison between benchmarks

You can also compare benchmarks to each other without any external tools, all in the comfort of your terminal.

import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "fibonacci numbers"
    [ bcompare "tenth"  $ bench "fifth"     $ nf fibo  5
    ,                     bench "tenth"     $ nf fibo 10
    , bcompare "tenth"  $ bench "twentieth" $ nf fibo 20
    ]
  ]

This produces a report, comparing mean times of fifth and twentieth to tenth:

All
  fibonacci numbers
    fifth:     OK (16.56s)
      121 ns ± 2.6 ns, 0.08x
    tenth:     OK (6.84s)
      1.6 μs ±  31 ns
    twentieth: OK (6.96s)
      203 μs ± 4.1 μs, 128.36x

To locate a baseline benchmark in a larger suite use locateBenchmark.

One can leverage comparisons between benchmarks to implement portable performance tests, expressing properties like "this algorithm must be at least twice faster than that one" or "this operation should not be more than thrice slower than that". This can be achieved with bcompareWithin, which takes an acceptable interval of performance as an argument.

Plotting results

Users can dump results into CSV with --csv FILE and plot them using gnuplot or other software. But for convenience there is also a built-in quick-and-dirty SVG plotting feature, which can be invoked by passing --svg FILE. Here is a sample of its output:

Plotting

Build flags

Build flags are a brittle subject and users do not normally need to touch them.

If you find yourself in an environment, where tasty is not available and you have access to boot packages only, you can still use tasty-bench! Just copy Test/Tasty/Bench.hs to your project (imagine it like a header-only C library). It will provide you with functions to build Benchmarkable and run them manually via measureCpuTime. This mode of operation can be also configured by disabling Cabal flag tasty.
If results are amiss or oscillate wildly and adjusting --timeout and --stdev does not help, you may be interested to investigate individual timings of successive runs by enabling Cabal flag debug. This will pipe raw data into stderr.

Command-line options

Use --help to list command-line options.

-p, --pattern: This is a standard tasty option, which allows filtering benchmarks by a pattern or awk expression. Please refer to tasty documentation for details.
-t, --timeout: This is a standard tasty option, setting timeout for individual benchmarks in seconds. Use it when benchmarks tend to take too long: tasty-bench will make an effort to report results (even if of subpar quality) before timeout. Setting timeout too tight (insufficient for at least three iterations) will result in a benchmark failure. One can adjust it locally for a group of benchmarks, e. g., localOption (mkTimeout 100000000) for 100 seconds.
--stdev: Target relative standard deviation of measurements in percents (5% by default). Large values correspond to fast and loose benchmarks, and small ones to long and precise. It can also be adjusted locally for a group of benchmarks, e. g., localOption (RelStDev 0.02). If benchmarking takes far too long, consider setting --timeout, which will interrupt benchmarks, potentially before reaching the target deviation.
--csv: File to write results in CSV format.
--baseline: File to read baseline results in CSV format (as produced by --csv).
--fail-if-slower, --fail-if-faster: Upper bounds of acceptable slow down / speed up in percents. If a benchmark is unacceptably slower / faster than baseline (see --baseline), it will be reported as failed. Can be used in conjunction with a standard tasty option --hide-successes to show only problematic benchmarks. Both options can be adjusted locally for a group of benchmarks, e. g., localOption (FailIfSlower 0.10).
--svg: File to plot results in SVG format.
--time-mode: Whether to measure CPU time (cpu, default) or wall-clock time (wall).
+RTS -T: Estimate and report memory usage.

Custom command-line options

As usual with tasty, it is easy to extend benchmarks with custom command-line options. Here is an example:

import Data.Proxy
import Test.Tasty.Bench
import Test.Tasty.Ingredients.Basic
import Test.Tasty.Options
import Test.Tasty.Runners

newtype RandomSeed = RandomSeed Int

instance IsOption RandomSeed where
  defaultValue = RandomSeed 42
  parseValue = fmap RandomSeed . safeRead
  optionName = pure "seed"
  optionHelp = pure "Random seed used in benchmarks"

main :: IO ()
main = do
  let customOpts  = [Option (Proxy :: Proxy RandomSeed)]
      ingredients = includingOptions customOpts : benchIngredients
  opts <- parseOptions ingredients benchmarks
  let RandomSeed seed = lookupOption opts
  defaultMainWithIngredients ingredients benchmarks

benchmarks :: Benchmark
benchmarks = bgroup "All" []

Synopsis

defaultMain ∷ [Benchmark] → IO ()
type Benchmark = TestTree
bench ∷ String → Benchmarkable → Benchmark
bgroup ∷ String → [Benchmark] → Benchmark
bcompare ∷ String → Benchmark → Benchmark
bcompareWithin ∷ Double → Double → String → Benchmark → Benchmark
env ∷ NFData env ⇒ IO env → (env → Benchmark) → Benchmark
envWithCleanup ∷ NFData env ⇒ IO env → (env → IO a) → (env → Benchmark) → Benchmark
newtype Benchmarkable = Benchmarkable {
- unBenchmarkable ∷ Word64 → IO ()
}
nf ∷ NFData b ⇒ (a → b) → a → Benchmarkable
whnf ∷ (a → b) → a → Benchmarkable
nfIO ∷ NFData a ⇒ IO a → Benchmarkable
whnfIO ∷ IO a → Benchmarkable
nfAppIO ∷ NFData b ⇒ (a → IO b) → a → Benchmarkable
whnfAppIO ∷ (a → IO b) → a → Benchmarkable
measureCpuTime ∷ Timeout → RelStDev → Benchmarkable → IO Double
benchIngredients ∷ [Ingredient]
consoleBenchReporter ∷ Ingredient
csvReporter ∷ Ingredient
svgReporter ∷ Ingredient
newtype RelStDev = RelStDev Double
newtype FailIfSlower = FailIfSlower Double
newtype FailIfFaster = FailIfFaster Double
newtype CsvPath = CsvPath FilePath
newtype BaselinePath = BaselinePath FilePath
newtype SvgPath = SvgPath FilePath
data TimeMode
- = CpuTime
- | WallTime
locateBenchmark ∷ [String] → Expr
mapLeafBenchmarks ∷ ([String] → Benchmark → Benchmark) → Benchmark → Benchmark

Running `Benchmark`

defaultMain ∷ [Benchmark] → IO () Source #

Run benchmarks and report results, providing an interface compatible with Criterion.defaultMain and Gauge.defaultMain.

Since: 0.1

type Benchmark = TestTree Source #

Benchmarks are actually just a regular TestTree in disguise.

This is a drop-in replacement for Criterion.Benchmark and Gauge.Benchmark.

Since: 0.1

bench ∷ String → Benchmarkable → Benchmark Source #

Attach a name to Benchmarkable.

This is actually a synonym of singleTest to provide an interface compatible with Criterion.bench and Gauge.bench.

Since: 0.1

bgroup ∷ String → [Benchmark] → Benchmark Source #

Attach a name to a group of Benchmark.

This is actually a synonym of testGroup to provide an interface compatible with Criterion.bgroup and Gauge.bgroup.

Since: 0.1

bcompare Source #

Arguments

∷ String	`tasty` pattern, which must unambiguously match a unique baseline benchmark. Consider using `locateBenchmark` to construct it.
→ Benchmark	Benchmark (or a group of benchmarks) to be compared against the baseline benchmark by dividing measured mean times. The result is reported by `consoleBenchReporter`, e. g., 0.50x or 1.25x.
→ Benchmark

Compare benchmarks, reporting relative speed up or slow down.

This function is a vague reminiscence of bcompare, which existed in pre-1.0 versions of criterion, but their types are incompatible. Under the hood bcompare is a thin wrapper over after and requires tasty-1.2. If you use bcompare, it is prudent to add tasty >= 1.2 to build-depends section of your cabal file.

Here is a basic example:

import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "fibonacci numbers"
    [ bcompare "tenth"  $ bench "fifth"     $ nf fibo  5
    ,                     bench "tenth"     $ nf fibo 10
    , bcompare "tenth"  $ bench "twentieth" $ nf fibo 20
    ]
  ]

More complex examples:

Since: 0.2.4

bcompareWithin Source #

Arguments

∷ Double	Lower bound of relative speed up.
→ Double	Upper bound of relative spped up.
→ String	`tasty` pattern to locate a baseline benchmark.
→ Benchmark	Benchmark to compare against baseline.
→ Benchmark

Same as bcompare, but takes expected lower and upper bounds of comparison. If the result is not within provided bounds, benchmark fails. This allows to create portable performance tests: instead of comparing to an absolute timeout or to previous runs, you can state that one implementation of an algorithm must be faster than another.

E. g., bcompareWithin 2.0 3.0 passes only if a benchmark is at least 2x and at most 3x slower than a baseline.

Since: 0.3.1

env ∷ NFData env ⇒ IO env → (env → Benchmark) → Benchmark Source #

Run benchmarks in the given environment, usually reading large input data from file.

One might wonder why env is needed, when we can simply read all input data before calling defaultMain. The reason is that large data dangling in the heap causes longer garbage collection and slows down all benchmarks, even those which do not use it at all.

It is instrumental not only for proper IO actions, but also for a large statically-known data as well. Instead of a top-level definition, which once evaluated will slow down garbage collection during all subsequent benchmarks,

largeData :: String
largeData = replicate 1000000 'a'

main :: IO ()
main = defaultMain
  [ bench "large" $ nf length largeData, ... ]

use

import Control.DeepSeq (force)
import Control.Exception (evaluate)

main :: IO ()
main = defaultMain
  [ env (evaluate (force (replicate 1000000 'a'))) $ \largeData ->
    bench "large" $ nf length largeData, ... ]

Test.Tasty.Bench.env is provided only for the sake of compatibility with Criterion.env and Gauge.env, and involves unsafePerformIO. Consider using withResource instead.

defaultMain requires that the hierarchy of benchmarks and their names is independent of underlying IO actions. While executing IO inside bench via nfIO is fine, and reading test data from files via env is also fine, using env to choose benchmarks or their names depending on IO side effects will throw a rather cryptic error message:

Unhandled resource. Probably a bug in the runner you're using.

Since: 0.2

envWithCleanup ∷ NFData env ⇒ IO env → (env → IO a) → (env → Benchmark) → Benchmark Source #

Similar to env, but includes an additional argument to clean up created environment.

Provided only for the sake of compatibility with Criterion.envWithCleanup and Gauge.envWithCleanup, and involves unsafePerformIO. Consider using withResource instead.

Since: 0.2

Creating `Benchmarkable`

newtype Benchmarkable Source #

Something that can be benchmarked, produced by nf, whnf, nfIO, whnfIO, nfAppIO, whnfAppIO below.

Drop-in replacement for Criterion.Benchmarkable and Gauge.Benchmarkable.

Since: 0.1

Constructors

Benchmarkable	Since: 0.3
Fields unBenchmarkable ∷ Word64 → IO () Run benchmark given number of times.

Instances

Instances details

IsTest Benchmarkable Source #
Instance details Defined in Test.Tasty.Bench Methods run ∷ OptionSet → Benchmarkable → (Progress → IO ()) → IO Result Source # testOptions ∷ Tagged Benchmarkable [OptionDescription] Source #

nf ∷ NFData b ⇒ (a → b) → a → Benchmarkable Source #

nf f x measures time to compute a normal form (by means of force) of an application of f to x. This does not include time to evaluate f or x themselves. Ideally x should be a primitive data type like Int.

The same thunk of x is shared by multiple calls of f. We cannot evaluate x beforehand: there is no NFData a constraint, and potentially x may be an infinite structure. Thus x will be evaluated in course of the first application of f. This noisy measurement is to be discarded soon, but if x is not a primitive data type, consider forcing its evaluation separately, e. g., via env or withResource.

Here is a textbook anti-pattern: nf sum [1..1000000]. Since an input list is shared by multiple invocations of sum, it will be allocated in memory in full, putting immense pressure on garbage collector. Also no list fusion will happen. A better approach is nf (\n -> sum [1..n]) 1000000.

If you are measuring an inlinable function, it is prudent to ensure that its invocation is fully saturated, otherwise inlining will not happen. That's why one can often see nf (\n -> f n) x instead of nf f x. Same applies to rewrite rules.

While tasty-bench is capable to perform micro- and even nanobenchmarks, such measurements are noisy and involve an overhead. Results are more reliable when f x takes at least several milliseconds.

Note that forcing a normal form requires an additional traverse of the structure. In certain scenarios (imagine benchmarking tail), especially when NFData instance is badly written, this traversal may take non-negligible time and affect results.

Drop-in replacement for Criterion.nf and Gauge.nf.

Since: 0.1

whnf ∷ (a → b) → a → Benchmarkable Source #

whnf f x measures time to compute a weak head normal form of an application of f to x. This does not include time to evaluate f or x themselves. Ideally x should be a primitive data type like Int.

The same thunk of x is shared by multiple calls of f. We cannot evaluate x beforehand: there is no NFData a constraint, and potentially x may be an infinite structure. Thus x will be evaluated in course of the first application of f. This noisy measurement is to be discarded soon, but if x is not a primitive data type, consider forcing its evaluation separately, e. g., via env or withResource.

Computing only a weak head normal form is rarely what intuitively is meant by "evaluation". Beware that many educational materials contain examples with whnf: this is a wrong default. Unless you understand precisely, what is measured, it is recommended to use nf instead.

Here is a textbook anti-pattern: whnf (replicate 1000000) 1. This will succeed in a matter of nanoseconds, because weak head normal form forces only the first element of the list.

Drop-in replacement for Criterion.whnf and Gauge.whnf.

Since: 0.1

nfIO ∷ NFData a ⇒ IO a → Benchmarkable Source #

nfIO x measures time to evaluate side-effects of x and compute its normal form (by means of force).

Pure subexpression of an effectful computation x may be evaluated only once and get cached. To avoid surprising results it is usually preferable to use nfAppIO instead.

Note that forcing a normal form requires an additional traverse of the structure. In certain scenarios, especially when NFData instance is badly written, this traversal may take non-negligible time and affect results.

A typical use case is nfIO (readFile "foo.txt"). However, if your goal is not to benchmark I/O per se, but just read input data from a file, it is cleaner to use env or withResource.

Drop-in replacement for Criterion.nfIO and Gauge.nfIO.

Since: 0.1

whnfIO ∷ IO a → Benchmarkable Source #

whnfIO x measures time to evaluate side-effects of x and compute its weak head normal form.

Pure subexpression of an effectful computation x may be evaluated only once and get cached. To avoid surprising results it is usually preferable to use whnfAppIO instead.

Computing only a weak head normal form is rarely what intuitively is meant by "evaluation". Unless you understand precisely, what is measured, it is recommended to use nfIO instead.

Lazy I/O is treacherous. If your goal is not to benchmark I/O per se, but just read input data from a file, it is cleaner to use env or withResource.

Drop-in replacement for Criterion.whnfIO and Gauge.whnfIO.

Since: 0.1

nfAppIO ∷ NFData b ⇒ (a → IO b) → a → Benchmarkable Source #

nfAppIO f x measures time to evaluate side-effects of an application of f to x. and compute its normal form (by means of force). This does not include time to evaluate f or x themselves. Ideally x should be a primitive data type like Int.

The same thunk of x is shared by multiple calls of f. We cannot evaluate x beforehand: there is no NFData a constraint, and potentially x may be an infinite structure. Thus x will be evaluated in course of the first application of f. This noisy measurement is to be discarded soon, but if x is not a primitive data type, consider forcing its evaluation separately, e. g., via env or withResource.

Note that forcing a normal form requires an additional traverse of the structure. In certain scenarios, especially when NFData instance is badly written, this traversal may take non-negligible time and affect results.

A typical use case is nfAppIO readFile "foo.txt". However, if your goal is not to benchmark I/O per se, but just read input data from a file, it is cleaner to use env or withResource.

Drop-in replacement for Criterion.nfAppIO and Gauge.nfAppIO.

Since: 0.1

whnfAppIO ∷ (a → IO b) → a → Benchmarkable Source #

whnfAppIO f x measures time to evaluate side-effects of an application of f to x. and compute its weak head normal form. This does not include time to evaluate f or x themselves. Ideally x should be a primitive data type like Int.

The same thunk of x is shared by multiple calls of f. We cannot evaluate x beforehand: there is no NFData a constraint, and potentially x may be an infinite structure. Thus x will be evaluated in course of the first application of f. This noisy measurement is to be discarded soon, but if x is not a primitive data type, consider forcing its evaluation separately, e. g., via env or withResource.

Computing only a weak head normal form is rarely what intuitively is meant by "evaluation". Unless you understand precisely, what is measured, it is recommended to use nfAppIO instead.

Lazy I/O is treacherous. If your goal is not to benchmark I/O per se, but just read input data from a file, it is cleaner to use env or withResource.

Drop-in replacement for Criterion.whnfAppIO and Gauge.whnfAppIO.

Since: 0.1

measureCpuTime ∷ Timeout → RelStDev → Benchmarkable → IO Double Source #

An internal routine to measure CPU execution time in seconds for a given timeout (put NoTimeout, or mkTimeout 100000000 for 100 seconds) and a target relative standard deviation (put RelStDev 0.05 for 5% or RelStDev (1/0) to run only one iteration).

Timeout takes soft priority over RelStDev: this function prefers to finish in time even if at cost of precision. However, timeout is guidance not guarantee: measureCpuTime can take longer, if there is not enough time to run at least thrice or an iteration takes unusually long.

Since: 0.3

Ingredients

benchIngredients ∷ [Ingredient] Source #

List of default benchmark ingredients. This is what defaultMain runs.

Since: 0.2

consoleBenchReporter ∷ Ingredient Source #

Run benchmarks and report results in a manner similar to consoleTestReporter.

If --baseline FILE command line option is specified, compare results against an earlier run and mark too slow / too fast benchmarks as failed in accordance to bounds specified by --fail-if-slower PERCENT and --fail-if-faster PERCENT.

Since: 0.2

csvReporter ∷ Ingredient Source #

Run benchmarks and save results in CSV format. It activates when --csv FILE command line option is specified.

Since: 0.1

svgReporter ∷ Ingredient Source #

Run benchmarks and plot results in SVG format. It activates when --svg FILE command line option is specified.

Since: 0.2.4

newtype RelStDev Source #

In addition to --stdev command-line option, one can adjust target relative standard deviation for individual benchmarks and groups of benchmarks using adjustOption and localOption.

E. g., set target relative standard deviation to 2% as follows:

import Test.Tasty (localOption)
localOption (RelStDev 0.02) (bgroup [...])

If you set RelStDev to infinity, a benchmark will be executed only once and its standard deviation will be recorded as zero. This is rather a blunt approach, but it might be a necessary evil for extremely long benchmarks. If you wish to run all benchmarks only once, use command-line option --stdev Infinity.

Since: 0.2

Constructors

RelStDev Double

Instances

Instances details

Read RelStDev Source #
Instance details Defined in Test.Tasty.Bench Methods readsPrec ∷ Int → ReadS RelStDev Source # readList ∷ ReadS [RelStDev] Source # readPrec ∷ ReadPrec RelStDev Source # readListPrec ∷ ReadPrec [RelStDev] Source #
Show RelStDev Source #
Instance details Defined in Test.Tasty.Bench Methods showsPrec ∷ Int → RelStDev → ShowS Source # show ∷ RelStDev → String Source # showList ∷ [RelStDev] → ShowS Source #
IsOption RelStDev Source #
Instance details Defined in Test.Tasty.Bench Methods defaultValue ∷ RelStDev Source # parseValue ∷ String → Maybe RelStDev Source # optionName ∷ Tagged RelStDev String Source # optionHelp ∷ Tagged RelStDev String Source # showDefaultValue ∷ RelStDev → Maybe String Source # optionCLParser ∷ Parser RelStDev Source #

newtype FailIfSlower Source #

In addition to --fail-if-slower command-line option, one can adjust an upper bound of acceptable slow down in comparison to baseline for individual benchmarks and groups of benchmarks using adjustOption and localOption.

E. g., set upper bound of acceptable slow down to 10% as follows:

import Test.Tasty (localOption)
localOption (FailIfSlower 0.10) (bgroup [...])

Since: 0.2

Constructors

FailIfSlower Double

Instances

Instances details

Read FailIfSlower Source #
Instance details Defined in Test.Tasty.Bench Methods readsPrec ∷ Int → ReadS FailIfSlower Source # readList ∷ ReadS [FailIfSlower] Source # readPrec ∷ ReadPrec FailIfSlower Source # readListPrec ∷ ReadPrec [FailIfSlower] Source #
Show FailIfSlower Source #
Instance details Defined in Test.Tasty.Bench Methods showsPrec ∷ Int → FailIfSlower → ShowS Source # show ∷ FailIfSlower → String Source # showList ∷ [FailIfSlower] → ShowS Source #
IsOption FailIfSlower Source #
Instance details Defined in Test.Tasty.Bench Methods defaultValue ∷ FailIfSlower Source # parseValue ∷ String → Maybe FailIfSlower Source # optionName ∷ Tagged FailIfSlower String Source # optionHelp ∷ Tagged FailIfSlower String Source # showDefaultValue ∷ FailIfSlower → Maybe String Source # optionCLParser ∷ Parser FailIfSlower Source #

newtype FailIfFaster Source #

In addition to --fail-if-faster command-line option, one can adjust an upper bound of acceptable speed up in comparison to baseline for individual benchmarks and groups of benchmarks using adjustOption and localOption.

E. g., set upper bound of acceptable speed up to 10% as follows:

import Test.Tasty (localOption)
localOption (FailIfFaster 0.10) (bgroup [...])

Since: 0.2

Constructors

FailIfFaster Double

Instances

Instances details

Read FailIfFaster Source #
Instance details Defined in Test.Tasty.Bench Methods readsPrec ∷ Int → ReadS FailIfFaster Source # readList ∷ ReadS [FailIfFaster] Source # readPrec ∷ ReadPrec FailIfFaster Source # readListPrec ∷ ReadPrec [FailIfFaster] Source #
Show FailIfFaster Source #
Instance details Defined in Test.Tasty.Bench Methods showsPrec ∷ Int → FailIfFaster → ShowS Source # show ∷ FailIfFaster → String Source # showList ∷ [FailIfFaster] → ShowS Source #
IsOption FailIfFaster Source #
Instance details Defined in Test.Tasty.Bench Methods defaultValue ∷ FailIfFaster Source # parseValue ∷ String → Maybe FailIfFaster Source # optionName ∷ Tagged FailIfFaster String Source # optionHelp ∷ Tagged FailIfFaster String Source # showDefaultValue ∷ FailIfFaster → Maybe String Source # optionCLParser ∷ Parser FailIfFaster Source #

newtype CsvPath Source #

A path to write results in CSV format, populated by --csv.

This is an option of csvReporter and can be set only globally. Modifying it via adjustOption or localOption does not have any effect. One can however pass it to tryIngredients benchIngredients. For example, here is how to set a default CSV location:

import Data.Maybe
import System.Exit
import Test.Tasty.Bench
import Test.Tasty.Ingredients
import Test.Tasty.Options
import Test.Tasty.Runners

main :: IO ()
main = do
  let benchmarks = bgroup "All" ...
  opts <- parseOptions benchIngredients benchmarks
  let opts' = changeOption (Just . fromMaybe (CsvPath "foo.csv")) opts
  case tryIngredients benchIngredients opts' benchmarks of
    Nothing -> exitFailure
    Just mb -> mb >>= \b -> if b then exitSuccess else exitFailure

Since: 0.3

Constructors

CsvPath FilePath

Instances

Instances details

IsOption (Maybe CsvPath) Source #
Instance details Defined in Test.Tasty.Bench Methods defaultValue ∷ Maybe CsvPath Source # parseValue ∷ String → Maybe (Maybe CsvPath) Source # optionName ∷ Tagged (Maybe CsvPath) String Source # optionHelp ∷ Tagged (Maybe CsvPath) String Source # showDefaultValue ∷ Maybe CsvPath → Maybe String Source # optionCLParser ∷ Parser (Maybe CsvPath) Source #

newtype BaselinePath Source #

A path to read baseline results in CSV format, populated by --baseline.

This is an option of csvReporter and can be set only globally. Modifying it via adjustOption or localOption does not have any effect. One can however pass it to tryIngredients benchIngredients.

Since: 0.3

Constructors

BaselinePath FilePath

Instances

Instances details

IsOption (Maybe BaselinePath) Source #

Instance details

Defined in Test.Tasty.Bench

Methods

defaultValue ∷ Maybe BaselinePath Source #

parseValue ∷ String → Maybe (Maybe BaselinePath) Source #

optionName ∷ Tagged (Maybe BaselinePath) String Source #

optionHelp ∷ Tagged (Maybe BaselinePath) String Source #

showDefaultValue ∷ Maybe BaselinePath → Maybe String Source #

optionCLParser ∷ Parser (Maybe BaselinePath) Source #

newtype SvgPath Source #

A path to plot results in SVG format, populated by --svg.

This is an option of svgReporter and can be set only globally. Modifying it via adjustOption or localOption does not have any effect. One can however pass it to tryIngredients benchIngredients.

Since: 0.3

Constructors

SvgPath FilePath

Instances

Instances details

IsOption (Maybe SvgPath) Source #
Instance details Defined in Test.Tasty.Bench Methods defaultValue ∷ Maybe SvgPath Source # parseValue ∷ String → Maybe (Maybe SvgPath) Source # optionName ∷ Tagged (Maybe SvgPath) String Source # optionHelp ∷ Tagged (Maybe SvgPath) String Source # showDefaultValue ∷ Maybe SvgPath → Maybe String Source # optionCLParser ∷ Parser (Maybe SvgPath) Source #

data TimeMode Source #

Whether to measure CPU time or wall-clock time. Normally CpuTime is a better option (and default), but consider switching to WallTime to measure multithreaded algorithms or time spent in external processes.

One can switch the default measurement mode globally using --time-mode command-line option, but it is usually better to adjust the mode locally:

import Test.Tasty (localOption)
localOption WallTime (bgroup [...])

section of your cabal file. @since 0.3.2

Constructors

CpuTime	Measure CPU time.
WallTime	Measure wall-clock time. This requires `tasty-1.2.2`, so if you use it it is prudent to add `tasty >= 1.2.2` to `build-depends` section of your cabal file.

Instances

Instances details

IsOption TimeMode Source #
Instance details Defined in Test.Tasty.Bench Methods defaultValue ∷ TimeMode Source # parseValue ∷ String → Maybe TimeMode Source # optionName ∷ Tagged TimeMode String Source # optionHelp ∷ Tagged TimeMode String Source # showDefaultValue ∷ TimeMode → Maybe String Source # optionCLParser ∷ Parser TimeMode Source #

Utils

locateBenchmark ∷ [String] → Expr Source #

Construct an AWK expression to locate an individual element or elements in Benchmark by the suffix of the path. Names are listed in reverse order: from bench's own name to a name of the outermost bgroup.

This function is meant to be used in conjunction with bcompare, e. g., bcompare (printAwkExpr (locateBenchmark path)). See also mapLeafBenchmarks.

locateBenchmark requires tasty-1.0, so if you use it it is prudent to add tasty >= 1.0 to build-depends section of your cabal file.

Real world example: https://hackage.haskell.org/package/text-builder-linear-0.1/src/bench/Main.hs

Since: 0.3.2

mapLeafBenchmarks ∷ ([String] → Benchmark → Benchmark) → Benchmark → Benchmark Source #

Map leaf benchmarks (bench, not bgroup) with a provided function, which has an access to leaf's reversed path.

This helper is useful for bulk application of bcompare. See also locateBenchmark.

Real world example: https://hackage.haskell.org/package/text-builder-linear-0.1/src/bench/Main.hs

Since: 0.3.2

How lightweight is it?

How is it possible?

How to switch?

How to write a benchmark?

How to read results?

Wall-clock time vs. CPU time

Statistical model

Memory usage

Combining tests and benchmarks

Troubleshooting

Isolating interfering benchmarks

Comparison against baseline

Comparison between benchmarks

Plotting results

Build flags

Command-line options

Custom command-line options

Running Benchmark

Creating Benchmarkable

Instances

Ingredients

Instances

Instances

Instances

Instances

Instances

Instances

Instances

Utils

Running `Benchmark`

Creating `Benchmarkable`