golds-gym: Golden testing framework for performance benchmarks

[ library, mit, testing ] [ Propose Tags ] [ Report a vulnerability ]

This version is deprecated.

A Haskell framework for golden testing of timing benchmarks. Benchmarks are saved to golden files on first run and compared against on subsequent runs. Golden files are architecture-specific to account for hardware differences. . Based on hspec and benchpress.

[Skip to Readme]

Modules

[Index] [Quick Jump]

Test
- Hspec
  - Test.Hspec.BenchGolden

Downloads

golds-gym-0.4.0.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

ocramz

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1.0.0, 0.2.0.0, 0.3.0.0, 0.4.0.0, 0.5.0.0 (info)
Change log	CHANGELOG.md
Dependencies	aeson (>=2.0 && <3), base (>=4.14 && <5), benchpress (>=0.2 && <0.3), boxes (>=0.1 && <0.2), bytestring (>=0.10 && <0.13), deepseq (>=1.4 && <2), directory (>=1.3 && <2), filepath (>=1.4 && <2), hspec-core (>=2.10 && <3), microlens (>=0.4 && <0.6), process (>=1.6 && <2), statistics (>=0.16 && <0.17), text (>=1.2 && <3), time (>=1.9 && <2), vector (>=0.12 && <0.14) [details]
License	MIT
Author	Marco Zocca
Maintainer	@ocramz
Uploaded	by ocramz at 2026-02-01T15:45:03Z
Category	Testing
Distributions
Downloads	6 total (6 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2026-02-01 [all 1 reports]

Readme for golds-gym-0.4.0.0

[back to package description]

golds-gym 🏋️

Golden testing for performance benchmarks. Save timing baselines on first run, compare against them on subsequent runs.

Key Features:

Architecture-specific baselines (different hardware = different golden files)
Hybrid tolerance (handles both fast <1ms and slow operations)
Robust statistics mode (outlier detection, trimmed mean)
Lens-based custom expectations (assert "must be faster", compare by median, etc.)

Quick Start

import Test.Hspec
import Test.Hspec.BenchGolden
import Data.List (sort)

main :: IO ()
main = hspec $ do
  describe "Performance" $ do
    -- Pure function with normal form evaluation (deep, full evaluation)
    benchGolden "list append" $
      nf (\xs -> xs ++ xs) [1..1000]

    -- Weak head normal form (shallow, outermost constructor only)
    benchGolden "replicate" $
      whnf (replicate 1000) 42

    -- Custom configuration
    benchGoldenWith defaultBenchConfig
      { iterations = 500
      , tolerancePercent = 10.0
      }
      "sorting" $
      nf sort [1000, 999..1]

Evaluation strategies (required - specify how values are forced):

nf f x - Force result of f x to normal form (deep, full evaluation)
whnf f x - Force result of f x to weak head normal form (shallow, outermost constructor only)
nfIO action - Execute IO action and force result to normal form
whnfIO action - Execute IO action and force result to WHNF
nfAppIO f x - Apply function, execute resulting IO, force result to normal form
whnfAppIO f x - Apply function, execute resulting IO, force result to WHNF
io action - Plain IO action without additional forcing

Why evaluation strategies matter: Without forcing, GHC may optimize away computations or share results across iterations, making benchmarks meaningless. Use nf for most cases unless you specifically want lazy evaluation (whnf).

First run creates .golden/<arch>/list-append.golden with baseline stats.
Subsequent runs compare against baseline. Test fails if mean time changes beyond tolerance (default: ±15% OR ±0.01ms).

Output format :

Metric  Baseline    Actual      Diff
------  --------    ------      ----
Mean    0.150 ms  0.170 ms   +13.3%

Update baselines after intentional changes:

GOLDS_GYM_ACCEPT=1 stack test

How It Works

Golden files store timing statistics per architecture (e.g., .golden/aarch64-darwin-Apple_M1/):

{
  "mean": 1.234,
  "stddev": 0.056,
  "median": 1.201,
  "architecture": "aarch64-darwin-Apple_M1",
  "timestamp": "2026-01-30T12:00:00Z"
}

Hybrid tolerance (default) prevents false failures: benchmarks pass if within ±15% OR ±0.01ms. This handles measurement noise for fast operations (<1ms) while catching real regressions for slower code.

Configuration

Key BenchConfig options:

Field	Default	Description
`iterations`	100	Number of benchmark iterations
`tolerancePercent`	15.0	Allowed mean time deviation (%)
`absoluteToleranceMs`	Just 0.01	Absolute tolerance (ms) - enables hybrid mode
`useRobustStatistics`	False	Use trimmed mean/MAD instead of mean/stddev
`warmupIterations`	5	Warm-up runs before measurement

See BenchConfig type for all options.

Environment variables:

GOLDS_GYM_ACCEPT=1 - Regenerate all golden files
GOLDS_GYM_SKIP=1 - Skip benchmarks entirely (useful in CI)

Advanced: Robust Statistics

Standard mean/stddev are sensitive to outliers (GC pauses, OS scheduling). Robust statistics provide outlier-resistant comparisons:

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True  -- Use trimmed mean + MAD
  , trimPercent = 10.0          -- Remove top/bottom 10%
  , outlierThreshold = 3.0      -- Flag outliers >3 MADs from median
  }
  "noisy benchmark" $
  nf computation input

When to use:

Benchmarking in noisy environments (shared CI, development machines)
Operations with occasional GC pauses or system interruptions
Fast operations (<1ms) with high variance
You see outliers in test output warnings

Advanced: Lens-Based Expectations

For fine-grained control, use lens-based expectations to assert custom performance requirements:

import Test.Hspec.BenchGolden.Lenses

-- Compare by median instead of mean (more robust)
benchGoldenWithExpectation "median comparison" defaultBenchConfig
  [expect _statsMedian (Percent 10.0)]
  (nf myAlgorithm input)

-- Compose multiple requirements (both must pass)
benchGoldenWithExpectation "strict requirements" defaultBenchConfig
  [ expect _statsMean (Percent 15.0) &&~
    expect _statsIQR (Absolute 0.1)     -- Low variance required
  ]
  (nf criticalFunction data)

Available lenses: _statsMean, _statsMedian, _statsTrimmedMean, _statsStddev, _statsMAD, _statsIQR, _statsMin, _statsMax

Tolerance types:

Percent 15.0 - Within ±15%
Absolute 0.01 - Within ±0.01ms
Hybrid 15.0 0.01 - Within ±15% OR ±0.01ms
MustImprove 10.0 - Must be ≥10% faster (for testing optimizations)
MustRegress 5.0 - Must be ≥5% slower (for accepting controlled regressions)

Composition: (&&~) for AND, (||~) for OR

Documentation

API documentation - Full Haddock docs
Example benchmarks - Comprehensive usage examples
CHANGELOG - Version history and migration guides

tasty-bench

License

MIT