golds-gym: Golden testing framework for performance benchmarks

[ library, mit, testing ] [ Propose Tags ] [ Report a vulnerability ]

A Haskell framework for golden testing of timing benchmarks. Benchmarks are saved to golden files on first run and compared against on subsequent runs. Golden files are architecture-specific to account for hardware differences. . Based on hspec and benchpress.

[Skip to Readme]

Modules

[Index] [Quick Jump]

Test
- Hspec
  - Test.Hspec.BenchGolden

Downloads

golds-gym-0.2.0.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

ocramz

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1.0.0, 0.2.0.0
Change log	CHANGELOG.md
Dependencies	aeson (>=2.0 && <3), base (>=4.14 && <5), benchpress (>=0.2 && <0.3), boxes (>=0.1 && <0.2), bytestring (>=0.10 && <0.13), directory (>=1.3 && <2), filepath (>=1.4 && <2), hspec-core (>=2.10 && <3), process (>=1.6 && <2), statistics (>=0.16 && <0.17), text (>=1.2 && <3), time (>=1.9 && <2), vector (>=0.12 && <0.14) [details]
License	MIT
Author	Marco Zocca
Maintainer	@ocramz
Uploaded	by ocramz at 2026-01-30T15:14:49Z
Category	Testing
Distributions
Downloads	2 total (2 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2026-01-30 [all 1 reports]

Readme for golds-gym-0.2.0.0

[back to package description]

golds-gym 🏋️

A Haskell golden testing framework for performance benchmarks.

Overview

golds-gym allows you to define timing benchmarks that are saved to golden files the first time they run. On subsequent runs, new benchmark results are compared against the golden baselines using configurable tolerance thresholds.

Key Features:

Architecture-specific golden files (different baselines per CPU/OS)
Configurable tolerance for mean time comparison
Robust statistics mode (trimmed mean, MAD, outlier detection)
Optional variance (stddev) warnings
Configurable warm-up iterations
JSON-based golden files for easy inspection
Seamless integration with hspec

Usage

import Test.Hspec
import Test.Hspec.BenchGolden

main :: IO ()
main = hspec $ do
  describe "Performance" $ do
    -- Simple benchmark with defaults (100 iterations, 15% tolerance)
    benchGolden "list append" $
      return $ [1..1000] ++ [1..1000]

    -- Benchmark with custom configuration
    benchGoldenWith defaultBenchConfig
      { iterations = 500
      , tolerancePercent = 10.0
      , warmupIterations = 10
      , warnOnVarianceChange = True
      }
      "sorting" $
      return $ sort [1000, 999..1]

    -- Robust statistics mode (outlier detection, trimmed mean)
    benchGoldenWith defaultBenchConfig
      { useRobustStatistics = True
      , trimPercent = 10.0
      , outlierThreshold = 3.0
      , tolerancePercent = 10.0
      }
      "robust benchmark" $
      return $ expensiveComputation input

    -- IO benchmark
    benchGolden "file operations" $ do
      writeFile "/tmp/test" "hello"
      readFile "/tmp/test"

Golden Files

Golden files are stored in .golden/<architecture>/ with the following structure:

.golden/
├── aarch64-darwin-Apple_M1/
│   ├── list-append.golden
│   ├── list-append.actual
│   └── sorting.golden
└── x86_64-linux-Intel_Core_i7/
    └── list-append.golden

Each .golden file contains JSON with timing statistics:

{
  "mean": 1.234,
  "stddev": 0.056,
  "median": 1.201,
  "min": 1.100,
  "max": 1.456,
  "percentiles": [[50, 1.201], [90, 1.350], [99, 1.440]],
  "architecture": "aarch64-darwin-Apple_M1",
  "timestamp": "2026-01-30T12:00:00Z",
  "trimmedMean": 1.220,
  "mad": 0.042,
  "iqr": 0.085,
  "outliers": [1.456]
}

Updating Baselines

To regenerate golden files (after intentional performance changes):

GOLDS_GYM_ACCEPT=1 cabal test
# Or with stack:
GOLDS_GYM_ACCEPT=1 stack test

Configuration

BenchConfig Options

Field	Default	Description
`iterations`	100	Number of benchmark iterations
`warmupIterations`	5	Warm-up runs (discarded)
`tolerancePercent`	15.0	Allowed mean time deviation (%)
`absoluteToleranceMs`	Just 0.01	Minimum absolute tolerance in milliseconds (hybrid tolerance)
`warnOnVarianceChange`	True	Warn if stddev changes significantly
`varianceTolerancePercent`	50.0	Allowed stddev deviation (%)
`outputDir`	".golden"	Directory for golden files
`failOnFirstRun`	False	Fail if no baseline exists
`useRobustStatistics`	False	Use robust statistics (trimmed mean, MAD)
`trimPercent`	10.0	Percentage to trim from each tail (%)
`outlierThreshold`	3.0	MAD multiplier for outlier detection

Hybrid Tolerance Strategy

New in v0.2.0: Hybrid tolerance prevents false failures from measurement noise.

The framework uses BOTH percentage and absolute tolerance by default:

Benchmark passes if:
  (mean_change <= ±15%) OR (abs_time_diff <= 0.01ms)

Why Hybrid Tolerance?

For extremely fast operations (< 1ms), tiny measurement noise causes huge percentage variations:

Baseline: 0.001 ms
Actual: 0.0015 ms
Percentage difference: +50% ❌ (fails with 15% tolerance)
Absolute difference: +0.0005 ms ✅ (negligible, within 0.01ms tolerance)

The hybrid approach automatically handles this:

Fast operations (< 1ms): Absolute tolerance dominates → noise ignored
Slow operations (> 1ms): Percentage tolerance dominates → regressions caught

Configuration Examples

Default (hybrid tolerance):

benchGolden "fast operation" $ do
  return $ sum [1..100]

Passes if within ±15% or ±0.01ms (10 microseconds).

Percentage-only (disable absolute tolerance):

benchGoldenWith defaultBenchConfig
  { absoluteToleranceMs = Nothing
  , tolerancePercent = 20.0
  }
  "long operation" $ do
  return $ expensiveComputation input

Traditional percentage-only comparison.

Strict absolute tolerance:

benchGoldenWith defaultBenchConfig
  { absoluteToleranceMs = Just 0.001  -- 1 microsecond
  , tolerancePercent = 10.0
  }
  "performance-critical" $ do
  return $ criticalPath data

Very strict for performance-critical code.

Relaxed tolerance for noisy CI:

benchGoldenWith defaultBenchConfig
  { absoluteToleranceMs = Just 0.1  -- 100 microseconds
  , tolerancePercent = 25.0
  }
  "ci benchmark" $ do
  return $ computation input

More forgiving for shared CI runners.

Architecture Detection

The framework automatically detects:

CPU architecture (x86_64, aarch64)
Operating system (darwin, linux, windows)
CPU model (Apple M1, Intel Core i7, etc.)

This ensures benchmarks are only compared against baselines from equivalent hardware.

Robust Statistics

New in 0.1.0: Robust statistical methods for more reliable benchmark comparisons.

Why Use Robust Statistics?

Standard mean and standard deviation are sensitive to outliers. A single anomalous timing (e.g., from GC, OS scheduling) can skew results. Robust statistics provide:

Trimmed Mean: Removes extreme values before averaging
MAD (Median Absolute Deviation): Outlier-resistant measure of variance
Outlier Detection: Identifies and reports anomalous timings
IQR (Interquartile Range): Spread of the middle 50% of data

Enabling Robust Mode

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True  -- Enable robust statistics
  , trimPercent = 10.0          -- Trim 10% from each tail
  , outlierThreshold = 3.0      -- Outliers are 3+ MADs from median
  , tolerancePercent = 10.0     -- Compare trimmed means
  }
  "my benchmark" $ do
  -- your code here

How It Works

Trimmed Mean: Sorts all timing measurements, removes the top and bottom trimPercent, then computes the mean of remaining values.
MAD Calculation: Computes median(|x - median(x)|) - more robust than standard deviation.
Outlier Detection: Any measurement where |x - median| > outlierThreshold * MAD is flagged as an outlier.
Comparison: When enabled, uses trimmed mean instead of mean for regression detection, and MAD instead of stddev for variance checks.

Outlier Warnings

When outliers are detected, you'll see warnings in test output:

Warnings:
  ⚠ 3 outlier(s) detected: 2.1ms 2.3ms 2.5ms

Outliers are reported but not removed - they're preserved in golden files for analysis.

When to Use Robust Statistics

✅ Use robust statistics when:

Benchmarking in noisy environments (shared CI runners)
Operations subject to GC pauses or OS scheduling variability
Fast operations (< 1ms) with high relative variance (CV > 50%)
Sorting already-sorted data or other operations with occasional slowdowns
You see large max/stddev values with small mean times
You need more stable baselines across runs

❌ Standard statistics may be better when:

Benchmarking isolated, long-running operations
You have dedicated benchmark hardware
Outliers are legitimate and should be tracked

Integration with CI

In CI environments, you may want to:

Skip benchmarks (if CI is too noisy):
```
GOLDS_GYM_SKIP=1 cabal test
```

Use relaxed tolerance (for shared CI runners):

benchGoldenWith defaultBenchConfig
  { tolerancePercent = 25.0
  , absoluteToleranceMs = Just 0.1  -- 100 microseconds
  }
  "benchmark" $ ...

Enable robust statistics (outlier detection):

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True
  , tolerancePercent = 20.0
  }
  "benchmark" $ ...

Troubleshooting

Random Test Failures Due to Measurement Noise

Symptom: Tests fail intermittently with small percentage increases despite negligible absolute time differences:

Mean time increased by 35.5% (tolerance: 15.0%)

Metric    Actual  Baseline    Diff
------    ------  --------    ----
Mean    0.001 ms  0.000 ms  +35.5%

Root Cause: Operations taking < 1ms have high relative measurement noise. A 0.0005ms difference is negligible but represents 50% variation.

Solutions:

Use hybrid tolerance (default since v0.2.0):
```
benchGolden "fast operation" $ ...
```
The default absoluteToleranceMs = Just 0.01 prevents these failures.

Adjust absolute tolerance threshold:

benchGoldenWith defaultBenchConfig
  { absoluteToleranceMs = Just 0.001  -- Stricter: 1 microsecond
  }
  "very fast operation" $ ...

Increase iterations for stability:

benchGoldenWith defaultBenchConfig
  { iterations = 500  -- More samples reduce noise
  }
  "noisy operation" $ ...

Use robust statistics:

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True  -- Outlier-resistant
  , trimPercent = 10.0
  }
  "operation with outliers" $ ...

High Variance Warnings

Symptom: Warnings about variance changes despite passing benchmarks:

Warnings:
  ⚠ Variance increased by 65.2% (0.001 ms -> 0.002 ms, tolerance: 50.0%)

Solutions:

Disable variance warnings (if not critical):

benchGoldenWith defaultBenchConfig
  { warnOnVarianceChange = False
  }
  "benchmark" $ ...

Increase variance tolerance:

benchGoldenWith defaultBenchConfig
  { varianceTolerancePercent = 100.0  -- Allow ±100% stddev change
  }
  "benchmark" $ ...

Use robust statistics (MAD instead of stddev):

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True  -- Uses MAD, more stable
  }
  "benchmark" $ ...

Outlier Warnings

Symptom: Outliers detected in benchmark runs:

Warnings:
  ⚠ 3 outlier(s) detected: 2.1ms 2.3ms 2.5ms

Causes:

Garbage collection pauses
OS scheduling interruptions
CPU thermal throttling
Background processes

Solutions:

Increase outlier threshold (less sensitive):

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True
  , outlierThreshold = 5.0  -- More forgiving (default: 3.0)
  }
  "benchmark" $ ...

Increase warm-up iterations:

benchGoldenWith defaultBenchConfig
  { warmupIterations = 20  -- Stabilize before measurement
  }
  "benchmark" $ ...

Minimize system load:
- Close background applications
- Disable system services during benchmarking
- Use dedicated benchmark hardware

Benchmarks Pass Locally But Fail in CI

Cause: Different architecture or noisier environment.

Solutions:

Architecture-specific baselines: Golden files are already per-architecture. Check that your CI architecture ID matches:
```
GOLDS_GYM_ARCH=custom-ci-id cabal test
```

Relaxed CI configuration:

#ifdef CI_BUILD
ciConfig :: BenchConfig
ciConfig = defaultBenchConfig
  { tolerancePercent = 30.0
  , absoluteToleranceMs = Just 0.2
  , useRobustStatistics = True
  }
#endif

Skip benchmarks in CI:

# .github/workflows/ci.yml
- name: Run tests
  run: GOLDS_GYM_SKIP=1 stack test

Regenerating Golden Files

When to regenerate:

Intentional performance improvements/changes
Compiler upgrades affecting code generation
Architecture changes

How:

GOLDS_GYM_ACCEPT=1 stack test

Warning: Only regenerate when you've verified the performance change is expected!

License

MIT