| Copyright | (c) 2021 Andrew Lelechenko |
|---|---|
| License | MIT |
| Safe Haskell | None |
| Language | Haskell2010 |
Test.Tasty.Bench
Description
Featherlight benchmark framework (only one file!) for performance measurement with API mimicking criterion and gauge.
How lightweight is it?
There is only one source file Test.Tasty.Bench and no external dependencies
except tasty.
So if you already depend on tasty for a test suite, there
is nothing else to install.
Compare this to criterion (10+ modules, 50+ dependencies) and gauge (40+ modules, depends on basement and vector).
How is it possible?
Our benchmarks are literally regular tasty tests, so we can leverage all existing
machinery for command-line options, resource management, structuring,
listing and filtering benchmarks, running and reporting results. It also means
that tasty-bench can be used in conjunction with other tasty ingredients.
Unlike criterion and gauge we use a very simple statistical model described below.
This is arguably a questionable choice, but it works pretty well in practice.
A rare developer is sufficiently well-versed in probability theory
to make sense and use of all numbers generated by criterion.
How to switch?
Cabal mixins allow to taste tasty-bench instead of criterion or gauge
without changing a single line of code:
cabal-version: 2.0
benchmark foo
...
build-depends:
tasty-bench
mixins:
tasty-bench (Test.Tasty.Bench as Criterion)
This works vice versa as well: if you use tasty-bench, but at some point
need a more comprehensive statistical analysis,
it is easy to switch temporarily back to criterion.
How to write a benchmark?
Benchmarks are declared in a separate section of cabal file:
cabal-version: 2.0 name: bench-fibo version: 0.0 build-type: Simple synopsis: Example of a benchmark benchmark bench-fibo main-is: BenchFibo.hs type: exitcode-stdio-1.0 build-depends: base, tasty-bench
And here is BenchFibo.hs:
import Test.Tasty.Bench
fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
main :: IO ()
main = defaultMain
[ bgroup "fibonacci numbers"
[ bench "fifth" $ nf fibo 5
, bench "tenth" $ nf fibo 10
, bench "twentieth" $ nf fibo 20
]
]
Since tasty-bench provides an API compatible with criterion,
one can refer to its documentation for more examples.
How to read results?
Running the example above (cabal bench or stack bench)
results in the following output:
All
fibonacci numbers
fifth: OK (2.13s)
63 ns ± 3.4 ns
tenth: OK (1.71s)
809 ns ± 73 ns
twentieth: OK (3.39s)
104 μs ± 4.9 μs
All 3 tests passed (7.25s)
The output says that, for instance, the first benchmark was repeatedly executed for 2.13 seconds (wall time), its mean time was 63 nanoseconds and, assuming ideal precision of a system clock, execution time does not often diverge from the mean further than ±3.4 nanoseconds (double standard deviation, which for normal distributions corresponds to 95% probability). Take standard deviation numbers with a grain of salt; there are lies, damned lies, and statistics.
Note that this data is not directly comparable with criterion output:
benchmarking fibonacci numbers/fifth
time 62.78 ns (61.99 ns .. 63.41 ns)
0.999 R² (0.999 R² .. 1.000 R²)
mean 62.39 ns (61.93 ns .. 62.94 ns)
std dev 1.753 ns (1.427 ns .. 2.258 ns)
One might interpret the second line as saying that
95% of measurements fell into 61.99–63.41 ns interval, but this is wrong.
It states that the OLS regression
of execution time (which is not exactly the mean time) is most probably
somewhere between 61.99 ns and 63.41 ns,
but does not say a thing about individual measurements.
To understand how far away a typical measurement deviates
you need to add/subtract double standard deviation yourself
(which gives 62.78 ns ± 3.506 ns, similar to tasty-bench above).
To add to the confusion, gauge in --small mode outputs
not the second line of criterion report as one might expect,
but a mean value from the penultimate line and a standard deviation:
fibonacci numbers/fifth mean 62.39 ns ( +- 1.753 ns )
The interval ±1.753 ns answers for 68% of samples only, double it to estimate the behavior in 95% of cases.
Statistical model
Here is a procedure used by tasty-bench to measure execution time:
- Set \( n \leftarrow 1 \).
- Measure execution time \( t_n \) of \( n \) iterations and execution time \( t_{2n} \) of \( 2n \) iterations.
- Find \( t \) which minimizes deviation of \( (nt, 2nt) \) from \( (t_n, t_{2n}) \).
- If deviation is small enough (see
--stdevbelow), return \( t \) as a mean execution time. - Otherwise set \( n \leftarrow 2n \) and jump back to Step 2.
This is roughly similar to the linear regression approach which criterion takes,
but we fit only two last points. This allows us to simplify away all heavy-weight
statistical analysis. More importantly, earlier measurements,
which are presumably shorter and noisier, do not affect overall result.
This is in contrast to criterion, which fits all measurements and
is biased to use more data points corresponding to shorter runs
(it employs \( n \leftarrow 1.05n \) progression).
An alert reader could object that we measure standard deviation for samples with \( n \) and \( 2n \) iterations, but report it scaled to a single iteration. Strictly speaking, this is justified only if we assume that deviating factors are either roughly periodic (e. g., coarseness of a system clock, garbage collection) or are likely to affect several successive iterations in the same way (e. g., slow down by another concurrent process).
Obligatory disclaimer: statistics is a tricky matter, there is no
one-size-fits-all approach.
In the absence of a good theory
simplistic approaches are as (un)sound as obscure ones.
Those who seek statistical soundness should rather collect raw data
and process it themselves in R/Python. Data reported by tasty-bench
is only of indicative and comparative significance.
Tip
Passing +RTS -T (via cabal bench --benchmark-options '+RTS -T'
or stack bench --ba '+RTS -T') enables tasty-bench to estimate and report
memory usage such as allocated and copied bytes.
Command-line options
Use --help to list command-line options.
-p,--pattern- This is a standard
tastyoption, which allows filtering benchmarks by a pattern orawkexpression. Please refer totastydocumentation for details. --csv- File to write results in CSV format. If specified, suppresses console output.
-t,--timeout- This is a standard
tastyoption, setting timeout for individual benchmarks in seconds. Use it when benchmarks tend to take too long:tasty-benchwill make an effort to report results (even if of subpar quality) before timeout. Setting timeout too tight (insufficient for at least three iterations of benchmark) will result in a benchmark failure. Do not use--timeoutwithout a reason: it forks an additional thread and thus affects reliability of measurements. --stdev- Target relative standard deviation of measurements in percents (5% by default).
Large values correspond to fast and loose benchmarks, and small ones to long and precise.
If it takes far too long, consider setting
--timeout, which will interrupt benchmarks, potentially before reaching the target deviation.
Synopsis
- defaultMain :: [Benchmark] -> IO ()
- type Benchmark = TestTree
- bench :: String -> Benchmarkable -> Benchmark
- bgroup :: String -> [Benchmark] -> Benchmark
- data Benchmarkable
- nf :: NFData b => (a -> b) -> a -> Benchmarkable
- whnf :: (a -> b) -> a -> Benchmarkable
- nfIO :: NFData a => IO a -> Benchmarkable
- whnfIO :: NFData a => IO a -> Benchmarkable
- nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable
- whnfAppIO :: (a -> IO b) -> a -> Benchmarkable
- csvReporter :: Ingredient
Running Benchmark
defaultMain :: [Benchmark] -> IO () Source #
Run benchmarks and report results.
Wrapper around defaultMain (+ csvReporter)
to provide an interface compatible with defaultMain
and defaultMain.
bench :: String -> Benchmarkable -> Benchmark Source #
Attach a name to Benchmarkable.
This is actually a synonym of singleTest
to provide an interface compatible with bench and bench.
Creating Benchmarkable
data Benchmarkable Source #
Something that can be benchmarked.
Drop-in replacement for Benchmarkable and Benchmarkable.
Instances
| IsTest Benchmarkable Source # | |
Defined in Test.Tasty.Bench | |
nf :: NFData b => (a -> b) -> a -> Benchmarkable Source #
nf f x measures time to compute
a normal form (by means of rnf) of f x.
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios (imagine benchmarking tail),
especially when NFData instance is badly written,
this traversal may take non-negligible time and affect results.
whnf :: (a -> b) -> a -> Benchmarkable Source #
nfIO :: NFData a => IO a -> Benchmarkable Source #
nfIO x measures time to evaluate side-effects of x
and compute its normal form (by means of rnf).
Pure subexpression of an effectful computation x
may be evaluated only once and get cached; use nfAppIO
to avoid this.
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios,
especially when NFData instance is badly written,
this traversal may take non-negligible time and affect results.
whnfIO :: NFData a => IO a -> Benchmarkable Source #
whnfIO x measures time to evaluate side-effects of x
and compute its weak head normal form.
Pure subexpression of an effectful computation x
may be evaluated only once and get cached; use whnfAppIO
to avoid this.
Computing only a weak head normal form is
rarely what intuitively is meant by "evaluation".
Unless you understand precisely, what is measured,
it is recommended to use nfIO instead.
nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable Source #
nfAppIO f x measures time to evaluate side-effects of f x
and compute its normal form (by means of rnf).
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios,
especially when NFData instance is badly written,
this traversal may take non-negligible time and affect results.
whnfAppIO :: (a -> IO b) -> a -> Benchmarkable Source #
CSV ingredient
csvReporter :: Ingredient Source #
Add this ingredient to run benchmarks and save results in CSV format.
It activates when --csv FILE command line option is specified.
defaultMainWithIngredients [listingTests, csvReporter, consoleTestReporter] benchmarks
Remember that successful activation of an ingredient suppresses all subsequent
ingredients. If you wish to produce CSV in addition to other reports,
use composeReporters:
defaultMainWithIngredients [listingTests, composeReporters csvReporter consoleTestReporter] benchmarks