golds-gym: Golden testing framework for performance benchmarks

[ library, mit, testing ] [ Propose Tags ] [ Report a vulnerability ]

This version is deprecated.

A Haskell framework for golden testing of timing benchmarks. Benchmarks are saved to golden files on first run and compared against on subsequent runs. Golden files are architecture-specific to account for hardware differences. . Based on hspec and benchpress.

[Skip to Readme]

Modules

[Index] [Quick Jump]

Test
- Hspec
  - Test.Hspec.BenchGolden

Downloads

golds-gym-0.3.0.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

ocramz

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1.0.0, 0.2.0.0, 0.3.0.0, 0.4.0.0, 0.5.0.0 (info)
Change log	CHANGELOG.md
Dependencies	aeson (>=2.0 && <3), base (>=4.14 && <5), benchpress (>=0.2 && <0.3), boxes (>=0.1 && <0.2), bytestring (>=0.10 && <0.13), directory (>=1.3 && <2), filepath (>=1.4 && <2), hspec-core (>=2.10 && <3), microlens (>=0.4 && <0.6), process (>=1.6 && <2), statistics (>=0.16 && <0.17), text (>=1.2 && <3), time (>=1.9 && <2), vector (>=0.12 && <0.14) [details]
License	MIT
Author	Marco Zocca
Maintainer	@ocramz
Uploaded	by ocramz at 2026-02-01T11:55:10Z
Category	Testing
Distributions
Downloads	6 total (6 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2026-02-01 [all 1 reports]

Readme for golds-gym-0.3.0.0

[back to package description]

golds-gym 🏋️

Golden testing for performance benchmarks. Save timing baselines on first run, compare against them on subsequent runs.

Key Features:

Architecture-specific baselines (different hardware = different golden files)
Hybrid tolerance (handles both fast <1ms and slow operations)
Robust statistics mode (outlier detection, trimmed mean)
Lens-based custom expectations (assert "must be faster", compare by median, etc.)

Quick Start

import Test.Hspec
import Test.Hspec.BenchGolden

main :: IO ()
main = hspec $ do
  describe "Performance" $ do
    -- Simple benchmark (100 iterations, ±15% tolerance)
    benchGolden "list append" $
      return $ [1..1000] ++ [1..1000]

    -- Custom configuration
    benchGoldenWith defaultBenchConfig
      { iterations = 500
      , tolerancePercent = 10.0
      }
      "sorting" $
      return $ sort [1000, 999..1]

First run creates .golden/<arch>/list-append.golden with baseline stats.
Subsequent runs compare against baseline. Test fails if mean time changes by >15% (configurable).

Update baselines after intentional changes:

GOLDS_GYM_ACCEPT=1 stack test

How It Works

Golden files store timing statistics per architecture (e.g., .golden/aarch64-darwin-Apple_M1/):

{
  "mean": 1.234,
  "stddev": 0.056,
  "median": 1.201,
  "architecture": "aarch64-darwin-Apple_M1",
  "timestamp": "2026-01-30T12:00:00Z"
}

Hybrid tolerance (default) prevents false failures: benchmarks pass if within ±15% OR ±0.01ms. This handles measurement noise for fast operations (<1ms) while catching real regressions for slower code.

Configuration

Key BenchConfig options:

Field	Default	Description
`iterations`	100	Number of benchmark iterations
`tolerancePercent`	15.0	Allowed mean time deviation (%)
`absoluteToleranceMs`	Just 0.01	Absolute tolerance (ms) - enables hybrid mode
`useRobustStatistics`	False	Use trimmed mean/MAD instead of mean/stddev
`warmupIterations`	5	Warm-up runs before measurement

See BenchConfig type for all options.

Environment variables:

GOLDS_GYM_ACCEPT=1 - Regenerate all golden files
GOLDS_GYM_SKIP=1 - Skip benchmarks entirely (useful in CI)

Advanced: Robust Statistics

Standard mean/stddev are sensitive to outliers (GC pauses, OS scheduling). Robust statistics provide outlier-resistant comparisons:

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True  -- Use trimmed mean + MAD
  , trimPercent = 10.0          -- Remove top/bottom 10%
  , outlierThreshold = 3.0      -- Flag outliers >3 MADs from median
  }
  "noisy benchmark" $
  return $ computation input

When to use:

Benchmarking in noisy environments (shared CI, development machines)
Operations with occasional GC pauses or system interruptions
Fast operations (<1ms) with high variance
You see outliers in test output warnings

Advanced: Lens-Based Expectations

For fine-grained control, use lens-based expectations to assert custom performance requirements:

import Test.Hspec.BenchGolden.Lenses

-- Compare by median instead of mean (more robust)
benchGoldenWithExpectation "median comparison" defaultBenchConfig
  [expect _statsMedian (Percent 10.0)]
  myAction

-- Compose multiple requirements (both must pass)
benchGoldenWithExpectation "strict requirements" defaultBenchConfig
  [ expect _statsMean (Percent 15.0) &&~
    expect _statsIQR (Absolute 0.1)     -- Low variance required
  ]
  myAction

Available lenses: _statsMean, _statsMedian, _statsTrimmedMean, _statsStddev, _statsMAD, _statsIQR, _statsMin, _statsMax

Tolerance types:

Percent 15.0 - Within ±15%
Absolute 0.01 - Within ±0.01ms
Hybrid 15.0 0.01 - Within ±15% OR ±0.01ms
MustImprove 10.0 - Must be ≥10% faster (for testing optimizations)
MustRegress 5.0 - Must be ≥5% slower (for accepting controlled regressions)

Composition: (&&~) for AND, (||~) for OR

Documentation

API documentation - Full Haddock docs
Example benchmarks - Comprehensive usage examples
CHANGELOG - Version history and migration guides

License

MIT