Validation Evidence

This document provides verifiable evidence of temporalcv’s correctness for users who need to audit statistical computations. All evidence is reproducible via the test suite.


Executive Summary

Validation Layer

Tests

Purpose

Golden Reference

5 cases

Verify against R’s forecast package

Monte Carlo Calibration

20+ tests

Type I/II error rates

Property-Based

30+ tests

Statistical invariants (Hypothesis)

Anti-Pattern Detection

10+ tests

Leakage scenarios caught

Benchmark Suite

M4 Competition

4,773 series validation

Reproduction: All tests run via pytest tests/ -v --cov=temporalcv


1. Golden Reference Tests

Location: tests/test_golden_reference.py, tests/fixtures/golden_reference.json

Pre-computed values from R’s forecast::dm.test(). Any deviation from these frozen values fails CI.

DM Test Cases

Case

Scenario

Expected Behavior

case_001

Equal forecasters (null true)

p-value in [0.4, 0.6] range

case_002

Model 1 clearly better

Statistic < -2, low p-value

case_003

Multi-step with HAC

h=3 correction applied

Wild Bootstrap Cases

Case

Scenario

Expected Behavior

case_001

Positive mean fold stats

CI excludes zero → reject

case_002

Zero-mean fold stats

CI includes zero → fail to reject

Regeneration: tests/cross_validation/r_reference/generate_reference.R


2. Monte Carlo Calibration

Location: tests/monte_carlo/

These tests run 400-500 simulations each to verify statistical properties.

DM Test Type I Error Control

Test

Simulations

Target

Acceptance Range

test_dm_null_fail_to_reject_rate

500

90-98%

Must not reject true null

test_dm_type_i_error_control

500

~5%

3-7% Type I error

test_dm_type_i_with_autocorrelated_errors

500

85-99%

HAC correction works

DM Test Power Under Alternative

Test

Effect Size

Expected Power

test_dm_power_moderate_difference

0.8/1.2 MAE ratio

> 40%

test_dm_power_small_difference

0.9/1.1 MAE ratio

> 10%

test_dm_power_large_sample

n=300

> 70%

Conformal Prediction Coverage

Test

Target Coverage

Acceptance Range

test_coverage_95_homoscedastic

95%

93-99%

test_coverage_90_homoscedastic

90%

85-99%

test_small_calibration_set

95% (n=30)

88-99%

test_large_calibration_set

95% (n=150)

92-99%

Wild Bootstrap Coverage

Test

Folds

Acceptance Range

test_type_i_error_5_folds

5

1-15%

test_type_i_error_10_folds

10

2-12%

test_type_i_error_20_folds

20

2-10%


3. Property-Based Tests (Hypothesis)

Location: tests/property/

Uses Hypothesis for exhaustive property testing.

Gate Composition Invariants

# These properties hold for ALL valid gate inputs:
- HALT dominates WARN dominates PASS
- Composed report status = max(individual statuses)
- Report contains all input gates (no loss)
- Gate names preserved in report

Suspicious Improvement Gate

# These properties are verified:
- Zero or negative improvement → never HALT
- >90% improvement → always HALT
- Threshold boundary respected exactly

CV Split Invariants

# Verified across n_samples ∈ [50, 200], n_splits ∈ [3, 10]:
- No train/test overlap
- Temporal order preserved (max(train) < min(test))
- Gap parameter respected

4. Anti-Pattern Detection Tests

Location: tests/anti_patterns/

Tests that intentionally introduce leakage to verify gates catch it.

Bug Category

Test Method

Expected Result

Lag leakage

Compute lag on full series

Gate HALT

Threshold leakage

Percentiles on future data

Gate HALT

Missing gap

train_end == test_start

Gate HALT

Feature selection on target

Correlation with y[full]

Gate HALT

Regime lookahead

Use future regime labels

Gate HALT

These tests prove the gates work by verifying they catch known bugs.


5. Synthetic AR(1) Bounds

Location: tests/validation/test_synthetic_ar1.py

For AR(1) with known parameters (φ, σ), the theoretical minimum MAE is:

MAE_optimal = σ × √(2/π)

Test

Predictor

Expected Behavior

test_mean_predictor_passes

Unconditional mean

MAE >> optimal → PASS

test_optimal_predictor_passes

AR(1) with true φ

MAE ≈ optimal → PASS

test_different_phi_values

φ ∈ {0.1, 0.5, 0.9, 0.99}

All pass bounds

Implication: If your model beats the theoretical bound, it’s overfitting or has leakage.


6. Benchmark Results

Location: docs/benchmarks.md, benchmarks/results/

M4 Competition Validation

Metric

Value

Series

4,773

Frequencies

Yearly, Quarterly, Monthly, Weekly, Daily, Hourly

Models Compared

9 (Naive, AutoARIMA, AutoETS, AutoTheta, etc.)

Runtime

14.3 minutes (128-core AMD EPYC)

Key Findings

Finding

Evidence

AutoETS most robust

Wins 3/6 frequencies

AutoARIMA best mean MAE

-12.9% vs Naive

Daily is hardest

Naive wins (complex models overfit)

Hourly benefits from ARIMA

-31% vs Naive


7. Academic Citations

All statistical tests cite foundational papers:

Test

Primary Reference

Correction

Diebold-Mariano

Diebold & Mariano (1995)

Harvey (1997) small-sample

Pesaran-Timmermann

Pesaran & Timmermann (1992)

Giacomini-White

Giacomini & White (2006)

Conditional ability

Clark-West

Clark & West (2007)

Nested models

HAC Variance

Newey-West (1987)

Autocorrelation-robust

Block Bootstrap

Künsch (1989), Politis & Romano (1994)

Preserve dependence


8. Known Caveats

These are documented limitations, not bugs:

Caveat

Location

Mitigation

3-class PT test variance is heuristic

statistical_tests.py

Runtime warning emitted

Conformal coverage is approximate for time series

conformal.py

Documented in docstring

gate_signal_verification HALT is expected for valid models

gates.py

Interpretation guide in docstring


9. Reproduction Instructions

Run Full Test Suite

# All tests (fast)
pytest tests/ -v

# With coverage report
pytest tests/ --cov=temporalcv --cov-report=html

# Monte Carlo tests only (slow, ~10 min)
pytest tests/monte_carlo/ -v -m monte_carlo

Regenerate R Reference Values

cd tests/cross_validation/r_reference/
Rscript generate_reference.R

Run Benchmarks

python -m temporalcv.benchmarks.run --dataset m4_subset --output results/

10. Audit Checklist

For users auditing this library:

  • Run pytest tests/test_golden_reference.py -v — Verify R agreement

  • Run pytest tests/monte_carlo/ -v -m monte_carlo — Verify calibration

  • Check tests/fixtures/golden_reference.json — Review frozen values

  • Read docs/benchmarks.md — Verify benchmark claims

  • Search for [T3] tags — Review heuristic components


Last Updated: 2026-04-29 • Tests: 1,943 passing, 15 skipped • Coverage: 86% (5,898 statements, 1,956 branches) • Runtime: ~80 s