Copyright | (c) 2021 Andrew Lelechenko |
---|---|
License | MIT |
Safe Haskell | None |
Language | Haskell2010 |
Featherlight benchmark framework (only one file!) for performance
measurement with API
mimicking criterion
and gauge
.
A prominent feature is built-in comparison against baseline.
How lightweight is it?
There is only one source file Test.Tasty.Bench and no external
dependencies except tasty
. So
if you already depend on tasty
for a test suite, there is nothing else
to install.
Compare this to criterion
(10+ modules, 50+ dependencies) and gauge
(40+ modules, depends on basement
and vector
).
How is it possible?
Our benchmarks are literally regular tasty
tests, so we can leverage
all existing machinery for command-line options, resource management,
structuring, listing and filtering benchmarks, running and reporting
results. It also means that tasty-bench
can be used in conjunction
with other tasty
ingredients.
Unlike criterion
and gauge
we use a very simple statistical model
described below. This is arguably a questionable choice, but it works
pretty well in practice. A rare developer is sufficiently well-versed in
probability theory to make sense and use of all numbers generated by
criterion
.
How to switch?
Cabal mixins
allow to taste tasty-bench
instead of criterion
or gauge
without
changing a single line of code:
cabal-version: 2.0 benchmark foo ... build-depends: tasty-bench mixins: tasty-bench (Test.Tasty.Bench as Criterion)
This works vice versa as well: if you use tasty-bench
, but at some
point need a more comprehensive statistical analysis, it is easy to
switch temporarily back to criterion
.
How to write a benchmark?
Benchmarks are declared in a separate section of cabal
file:
cabal-version: 2.0 name: bench-fibo version: 0.0 build-type: Simple synopsis: Example of a benchmark benchmark bench-fibo main-is: BenchFibo.hs type: exitcode-stdio-1.0 build-depends: base, tasty-bench
And here is BenchFibo.hs
:
import Test.Tasty.Bench fibo :: Int -> Integer fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2) main :: IO () main = defaultMain [ bgroup "fibonacci numbers" [ bench "fifth" $ nf fibo 5 , bench "tenth" $ nf fibo 10 , bench "twentieth" $ nf fibo 20 ] ]
Since tasty-bench
provides an API compatible with criterion
, one can
refer to
its documentation
for more examples.
How to read results?
Running the example above (cabal
bench
or stack
bench
) results in
the following output:
All fibonacci numbers fifth: OK (2.13s) 63 ns ± 3.4 ns tenth: OK (1.71s) 809 ns ± 73 ns twentieth: OK (3.39s) 104 μs ± 4.9 μs All 3 tests passed (7.25s)
The output says that, for instance, the first benchmark was repeatedly executed for 2.13 seconds (wall time), its mean time was 63 nanoseconds and, assuming ideal precision of a system clock, execution time does not often diverge from the mean further than ±3.4 nanoseconds (double standard deviation, which for normal distributions corresponds to 95% probability). Take standard deviation numbers with a grain of salt; there are lies, damned lies, and statistics.
Note that this data is not directly comparable with criterion
output:
benchmarking fibonacci numbers/fifth time 62.78 ns (61.99 ns .. 63.41 ns) 0.999 R² (0.999 R² .. 1.000 R²) mean 62.39 ns (61.93 ns .. 62.94 ns) std dev 1.753 ns (1.427 ns .. 2.258 ns)
One might interpret the second line as saying that 95% of measurements
fell into 61.99–63.41 ns interval, but this is wrong. It states that the
OLS regression of
execution time (which is not exactly the mean time) is most probably
somewhere between 61.99 ns and 63.41 ns, but does not say a thing about
individual measurements. To understand how far away a typical
measurement deviates you need to add/subtract double standard deviation
yourself (which gives 62.78 ns ± 3.506 ns, similar to tasty-bench
above).
To add to the confusion, gauge
in --small
mode outputs not the
second line of criterion
report as one might expect, but a mean value
from the penultimate line and a standard deviation:
fibonacci numbers/fifth mean 62.39 ns ( +- 1.753 ns )
The interval ±1.753 ns answers for 68% of samples only, double it to estimate the behavior in 95% of cases.
Statistical model
Here is a procedure used by tasty-bench
to measure execution time:
- Set \( n \leftarrow 1 \).
- Measure execution time \( t_n \) of \( n \) iterations and execution time \( t_{2n} \) of \( 2n \) iterations.
- Find \( t \) which minimizes deviation of \( (nt, 2nt) \) from \( (t_n, t_{2n}) \).
- If deviation is small enough (see
--stdev
below), return \( t \) as a mean execution time. - Otherwise set \( n \leftarrow 2n \) and jump back to Step 2.
This is roughly similar to the linear regression approach which
criterion
takes, but we fit only two last points. This allows us to
simplify away all heavy-weight statistical analysis. More importantly,
earlier measurements, which are presumably shorter and noisier, do not
affect overall result. This is in contrast to criterion
, which fits
all measurements and is biased to use more data points corresponding to
shorter runs (it employs \( n \leftarrow 1.05n \) progression).
An alert reader could object that we measure standard deviation for samples with \( n \) and \( 2n \) iterations, but report it scaled to a single iteration. Strictly speaking, this is justified only if we assume that deviating factors are either roughly periodic (e. g., coarseness of a system clock, garbage collection) or are likely to affect several successive iterations in the same way (e. g., slow down by another concurrent process).
Obligatory disclaimer: statistics is a tricky matter, there is no
one-size-fits-all approach. In the absence of a good theory simplistic
approaches are as (un)sound as obscure ones. Those who seek statistical
soundness should rather collect raw data and process it themselves using
a proper statistical toolbox. Data reported by tasty-bench
is only of
indicative and comparative significance.
Memory usage
Passing +RTS
-T
(via cabal
bench
--benchmark-options
'+RTS
-T'
or
stack
bench
--ba
'+RTS
-T'
) enables tasty-bench
to estimate and
report memory usage such as allocated and copied bytes:
All fibonacci numbers fifth: OK (2.13s) 63 ns ± 3.4 ns, 223 B allocated, 0 B copied tenth: OK (1.71s) 809 ns ± 73 ns, 2.3 KB allocated, 0 B copied twentieth: OK (3.39s) 104 μs ± 4.9 μs, 277 KB allocated, 59 B copied All 3 tests passed (7.25s)
Combining tests and benchmarks
When optimizing an existing function, it is important to check that its
observable behavior remains unchanged. One can rebuild both tests and
benchmarks after each change, but it would be more convenient to run
sanity checks within benchmark itself. Since our benchmarks are
compatible with tasty
tests, we can easily do so.
Imagine you come up with a faster function myFibo
to generate
Fibonacci numbers:
import Test.Tasty.Bench import Test.Tasty.QuickCheck -- from tasty-quickcheck package fibo :: Int -> Integer fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2) myFibo :: Int -> Integer myFibo n = if n < 3 then toInteger n else myFibo (n - 1) + myFibo (n - 2) main :: IO () main = Test.Tasty.Bench.defaultMain -- not Test.Tasty.defaultMain [ bench "fibo 20" $ nf fibo 20 , bench "myFibo 20" $ nf myFibo 20 , testProperty "myFibo = fibo" $ \n -> fibo n === myFibo n ]
This outputs:
All fibo 20: OK (3.02s) 104 μs ± 4.9 μs myFibo 20: OK (1.99s) 71 μs ± 5.3 μs myFibo = fibo: FAIL *** Failed! Falsified (after 5 tests and 1 shrink): 2 1 /= 2 Use --quickcheck-replay=927711 to reproduce. 1 out of 3 tests failed (5.03s)
We see that myFibo
is indeed significantly faster than fibo
, but
unfortunately does not do the same thing. One should probably look for
another way to speed up generation of Fibonacci numbers.
Troubleshooting
If benchmark results look malformed like below, make sure that you are
invoking defaultMain
and not defaultMain
(the difference is consoleBenchReporter
vs. consoleTestReporter
):
All fibo 20: OK (1.46s) Response {respEstimate = Estimate {estMean = Measurement {measTime = 87496728, measAllocs = 0, measCopied = 0}, estSigma = 694487}, respIfSlower = FailIfSlower {unFailIfSlower = Infinity}, respIfFaster = FailIfFaster {unFailIfFaster = Infinity}}
Comparison against baseline
One can compare benchmark results against an earlier baseline in an
automatic way. To use this feature, first run tasty-bench
with
--csv
FILE
key to dump results to FILE
in CSV format:
Name,Mean (ps),2*Stdev (ps) All.fibonacci numbers.fifth,48453,4060 All.fibonacci numbers.tenth,637152,46744 All.fibonacci numbers.twentieth,81369531,3342646
Note that columns do not match CSV reports of criterion
and gauge
.
If desired, missing columns can be faked with
awk
'BEGIN
{FS=",";OFS=","};
{print
$1,$2,$2,$2,$3/2,$3/2,$3/2}'
or similar.
Now modify implementation and rerun benchmarks with --baseline
FILE
key. This produces a report as follows:
All fibonacci numbers fifth: OK (0.44s) 53 ns ± 2.7 ns, 8% slower than baseline tenth: OK (0.33s) 641 ns ± 59 ns twentieth: OK (0.36s) 77 μs ± 6.4 μs, 5% faster than baseline All 3 tests passed (1.50s)
You can also fail benchmarks, which deviate too far from baseline, using
--fail-if-slower
and --fail-if-faster
options. For example, setting
both of them to 6 will fail the first benchmark above (because it is
more than 6% slower), but the last one still succeeds (even while it is
measurably faster than baseline, deviation is less than 6%). Consider
also using --hide-successes
to show only problematic benchmarks, or
even tasty-rerun
package to focus on rerunning failing items only.
Command-line options
Use --help
to list command-line options.
-p
,--pattern
This is a standard
tasty
option, which allows filtering benchmarks by a pattern orawk
expression. Please refer totasty
documentation for details.-t
,--timeout
This is a standard
tasty
option, setting timeout for individual benchmarks in seconds. Use it when benchmarks tend to take too long:tasty-bench
will make an effort to report results (even if of subpar quality) before timeout. Setting timeout too tight (insufficient for at least three iterations) will result in a benchmark failure.--stdev
Target relative standard deviation of measurements in percents (1% by default). Large values correspond to fast and loose benchmarks, and small ones to long and precise. If it takes far too long, consider setting
--timeout
, which will interrupt benchmarks, potentially before reaching the target deviation.--csv
File to write results in CSV format.
--baseline
File to read baseline results in CSV format (as produced by
--csv
).--fail-if-slower
,--fail-if-faster
Upper bounds of acceptable slow down / speed up in percents. If a benchmark is unacceptably slower / faster than baseline (see
--baseline
), it will be reported as failed. Can be used in conjunction with a standardtasty
option--hide-successes
to show only problematic benchmarks.
Synopsis
- defaultMain :: [Benchmark] -> IO ()
- type Benchmark = TestTree
- bench :: String -> Benchmarkable -> Benchmark
- bgroup :: String -> [Benchmark] -> Benchmark
- env :: NFData env => IO env -> (env -> Benchmark) -> Benchmark
- envWithCleanup :: NFData env => IO env -> (env -> IO a) -> (env -> Benchmark) -> Benchmark
- data Benchmarkable
- nf :: NFData b => (a -> b) -> a -> Benchmarkable
- whnf :: (a -> b) -> a -> Benchmarkable
- nfIO :: NFData a => IO a -> Benchmarkable
- whnfIO :: NFData a => IO a -> Benchmarkable
- nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable
- whnfAppIO :: (a -> IO b) -> a -> Benchmarkable
- benchIngredients :: [Ingredient]
- consoleBenchReporter :: Ingredient
- csvReporter :: Ingredient
- newtype RelStDev = RelStDev Double
- newtype FailIfSlower = FailIfSlower Double
- newtype FailIfFaster = FailIfFaster Double
Running Benchmark
defaultMain :: [Benchmark] -> IO () Source #
Run benchmarks and report results, providing
an interface compatible with defaultMain
and defaultMain
.
bench :: String -> Benchmarkable -> Benchmark Source #
Attach a name to Benchmarkable
.
This is actually a synonym of singleTest
to provide an interface compatible with bench
and bench
.
env :: NFData env => IO env -> (env -> Benchmark) -> Benchmark Source #
Run benchmarks in the given environment, usually reading large input data from file.
One might wonder why env
is needed,
when we can simply read all input data
before calling defaultMain
. The reason is that large data
dangling in the heap causes longer garbage collection
and slows down all benchmarks, even those which do not use it at all.
Provided only for the sake of compatibility with env
and env
,
and involves unsafePerformIO
. Consider using withResource
instead.
envWithCleanup :: NFData env => IO env -> (env -> IO a) -> (env -> Benchmark) -> Benchmark Source #
Similar to env
, but includes an additional argument
to clean up created environment.
Provided only for the sake of compatibility
with envWithCleanup
and envWithCleanup
,
and involves unsafePerformIO
. Consider using withResource
instead.
Creating Benchmarkable
data Benchmarkable Source #
Something that can be benchmarked, produced by nf
, whnf
, nfIO
, whnfIO
,
nfAppIO
, whnfAppIO
below.
Drop-in replacement for Benchmarkable
and Benchmarkable
.
Instances
IsTest Benchmarkable Source # | |
Defined in Test.Tasty.Bench |
nf :: NFData b => (a -> b) -> a -> Benchmarkable Source #
nf
f
x
measures time to compute
a normal form (by means of rnf
) of an application of f
to x
.
This does not include time to evaluate f
or x
themselves.
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios (imagine benchmarking tail
),
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
whnf :: (a -> b) -> a -> Benchmarkable Source #
whnf
f
x
measures time to compute
a weak head normal form of an application of f
to x
.
This does not include time to evaluate f
or x
themselves.
Computing only a weak head normal form is
rarely what intuitively is meant by "evaluation".
Unless you understand precisely, what is measured,
it is recommended to use nf
instead.
nfIO :: NFData a => IO a -> Benchmarkable Source #
nfIO
x
measures time to evaluate side-effects of x
and compute its normal form (by means of rnf
).
Pure subexpression of an effectful computation x
may be evaluated only once and get cached; use nfAppIO
to avoid this.
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios,
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
whnfIO :: NFData a => IO a -> Benchmarkable Source #
whnfIO
x
measures time to evaluate side-effects of x
and compute its weak head normal form.
Pure subexpression of an effectful computation x
may be evaluated only once and get cached; use whnfAppIO
to avoid this.
Computing only a weak head normal form is
rarely what intuitively is meant by "evaluation".
Unless you understand precisely, what is measured,
it is recommended to use nfIO
instead.
nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable Source #
nfAppIO
f
x
measures time to evaluate side-effects of
an application of f
to x
.
and compute its normal form (by means of rnf
).
This does not include time to evaluate f
or x
themselves.
Note that forcing a normal form requires an additional
traverse of the structure. In certain scenarios,
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
whnfAppIO :: (a -> IO b) -> a -> Benchmarkable Source #
whnfAppIO
f
x
measures time to evaluate side-effects of
an application of f
to x
.
and compute its weak head normal form.
This does not include time to evaluate f
or x
themselves.
Computing only a weak head normal form is
rarely what intuitively is meant by "evaluation".
Unless you understand precisely, what is measured,
it is recommended to use nfAppIO
instead.
Ingredients
benchIngredients :: [Ingredient] Source #
List of default benchmark ingredients. This is what defaultMain
runs.
consoleBenchReporter :: Ingredient Source #
Run benchmarks and report results
in a manner similar to consoleTestReporter
.
If --baseline
FILE
command line option is specified,
compare results against an earlier run and mark
too slow / too fast benchmarks as failed in accordance to
bounds specified by --fail-if-slower
PERCENT
and --fail-if-faster
PERCENT
.
csvReporter :: Ingredient Source #
Run benchmarks and save results in CSV format.
It activates when --csv
FILE
command line option is specified.
In addition to --stdev
command-line option,
one can adjust target relative standard deviation
for individual benchmarks and groups of benchmarks
using adjustOption
and localOption
.
E. g., set target relative standard deviation to 2% as follows:
localOption (RelStDev 0.02) (bgroup [...])
newtype FailIfSlower Source #
In addition to --fail-if-slower
command-line option,
one can adjust an upper bound of acceptable slow down
in comparison to baseline for
individual benchmarks and groups of benchmarks
using adjustOption
and localOption
.
E. g., set upper bound of acceptable slow down to 10% as follows:
localOption (FailIfSlower 0.10) (bgroup [...])
Instances
Read FailIfSlower Source # | |
Defined in Test.Tasty.Bench readsPrec :: Int -> ReadS FailIfSlower # readList :: ReadS [FailIfSlower] # | |
Show FailIfSlower Source # | |
Defined in Test.Tasty.Bench showsPrec :: Int -> FailIfSlower -> ShowS # show :: FailIfSlower -> String # showList :: [FailIfSlower] -> ShowS # | |
IsOption FailIfSlower Source # | |
Defined in Test.Tasty.Bench |
newtype FailIfFaster Source #
In addition to --fail-if-faster
command-line option,
one can adjust an upper bound of acceptable speed up
in comparison to baseline for
individual benchmarks and groups of benchmarks
using adjustOption
and localOption
.
E. g., set upper bound of acceptable speed up to 10% as follows:
localOption (FailIfFaster 0.10) (bgroup [...])
Instances
Read FailIfFaster Source # | |
Defined in Test.Tasty.Bench readsPrec :: Int -> ReadS FailIfFaster # readList :: ReadS [FailIfFaster] # | |
Show FailIfFaster Source # | |
Defined in Test.Tasty.Bench showsPrec :: Int -> FailIfFaster -> ShowS # show :: FailIfFaster -> String # showList :: [FailIfFaster] -> ShowS # | |
IsOption FailIfFaster Source # | |
Defined in Test.Tasty.Bench |