Validation Evidence¶
This document provides verifiable evidence of temporalcv’s correctness for users who need to audit statistical computations. All evidence is reproducible via the test suite.
Executive Summary¶
Validation Layer |
Tests |
Purpose |
|---|---|---|
Golden Reference |
5 cases |
Verify against R’s |
Monte Carlo Calibration |
20+ tests |
Type I/II error rates |
Property-Based |
30+ tests |
Statistical invariants (Hypothesis) |
Anti-Pattern Detection |
10+ tests |
Leakage scenarios caught |
Benchmark Suite |
M4 Competition |
4,773 series validation |
Reproduction: All tests run via pytest tests/ -v --cov=temporalcv
1. Golden Reference Tests¶
Location: tests/test_golden_reference.py, tests/fixtures/golden_reference.json
Pre-computed values from R’s forecast::dm.test(). Any deviation from these frozen values fails CI.
DM Test Cases¶
Case |
Scenario |
Expected Behavior |
|---|---|---|
|
Equal forecasters (null true) |
p-value in [0.4, 0.6] range |
|
Model 1 clearly better |
Statistic < -2, low p-value |
|
Multi-step with HAC |
h=3 correction applied |
Wild Bootstrap Cases¶
Case |
Scenario |
Expected Behavior |
|---|---|---|
|
Positive mean fold stats |
CI excludes zero → reject |
|
Zero-mean fold stats |
CI includes zero → fail to reject |
Regeneration: tests/cross_validation/r_reference/generate_reference.R
2. Monte Carlo Calibration¶
Location: tests/monte_carlo/
These tests run 400-500 simulations each to verify statistical properties.
DM Test Type I Error Control¶
Test |
Simulations |
Target |
Acceptance Range |
|---|---|---|---|
|
500 |
90-98% |
Must not reject true null |
|
500 |
~5% |
3-7% Type I error |
|
500 |
85-99% |
HAC correction works |
DM Test Power Under Alternative¶
Test |
Effect Size |
Expected Power |
|---|---|---|
|
0.8/1.2 MAE ratio |
> 40% |
|
0.9/1.1 MAE ratio |
> 10% |
|
n=300 |
> 70% |
Conformal Prediction Coverage¶
Test |
Target Coverage |
Acceptance Range |
|---|---|---|
|
95% |
93-99% |
|
90% |
85-99% |
|
95% (n=30) |
88-99% |
|
95% (n=150) |
92-99% |
Wild Bootstrap Coverage¶
Test |
Folds |
Acceptance Range |
|---|---|---|
|
5 |
1-15% |
|
10 |
2-12% |
|
20 |
2-10% |
3. Property-Based Tests (Hypothesis)¶
Location: tests/property/
Uses Hypothesis for exhaustive property testing.
Gate Composition Invariants¶
# These properties hold for ALL valid gate inputs:
- HALT dominates WARN dominates PASS
- Composed report status = max(individual statuses)
- Report contains all input gates (no loss)
- Gate names preserved in report
Suspicious Improvement Gate¶
# These properties are verified:
- Zero or negative improvement → never HALT
- >90% improvement → always HALT
- Threshold boundary respected exactly
CV Split Invariants¶
# Verified across n_samples ∈ [50, 200], n_splits ∈ [3, 10]:
- No train/test overlap
- Temporal order preserved (max(train) < min(test))
- Gap parameter respected
4. Anti-Pattern Detection Tests¶
Location: tests/anti_patterns/
Tests that intentionally introduce leakage to verify gates catch it.
Bug Category |
Test Method |
Expected Result |
|---|---|---|
Lag leakage |
Compute lag on full series |
Gate HALT |
Threshold leakage |
Percentiles on future data |
Gate HALT |
Missing gap |
train_end == test_start |
Gate HALT |
Feature selection on target |
Correlation with y[full] |
Gate HALT |
Regime lookahead |
Use future regime labels |
Gate HALT |
These tests prove the gates work by verifying they catch known bugs.
5. Synthetic AR(1) Bounds¶
Location: tests/validation/test_synthetic_ar1.py
For AR(1) with known parameters (φ, σ), the theoretical minimum MAE is:
MAE_optimal = σ × √(2/π)
Test |
Predictor |
Expected Behavior |
|---|---|---|
|
Unconditional mean |
MAE >> optimal → PASS |
|
AR(1) with true φ |
MAE ≈ optimal → PASS |
|
φ ∈ {0.1, 0.5, 0.9, 0.99} |
All pass bounds |
Implication: If your model beats the theoretical bound, it’s overfitting or has leakage.
6. Benchmark Results¶
Location: docs/benchmarks.md, benchmarks/results/
M4 Competition Validation¶
Metric |
Value |
|---|---|
Series |
4,773 |
Frequencies |
Yearly, Quarterly, Monthly, Weekly, Daily, Hourly |
Models Compared |
9 (Naive, AutoARIMA, AutoETS, AutoTheta, etc.) |
Runtime |
14.3 minutes (128-core AMD EPYC) |
Key Findings¶
Finding |
Evidence |
|---|---|
AutoETS most robust |
Wins 3/6 frequencies |
AutoARIMA best mean MAE |
-12.9% vs Naive |
Daily is hardest |
Naive wins (complex models overfit) |
Hourly benefits from ARIMA |
-31% vs Naive |
7. Academic Citations¶
All statistical tests cite foundational papers:
Test |
Primary Reference |
Correction |
|---|---|---|
Diebold-Mariano |
Diebold & Mariano (1995) |
Harvey (1997) small-sample |
Pesaran-Timmermann |
Pesaran & Timmermann (1992) |
— |
Giacomini-White |
Giacomini & White (2006) |
Conditional ability |
Clark-West |
Clark & West (2007) |
Nested models |
HAC Variance |
Newey-West (1987) |
Autocorrelation-robust |
Block Bootstrap |
Künsch (1989), Politis & Romano (1994) |
Preserve dependence |
8. Known Caveats¶
These are documented limitations, not bugs:
Caveat |
Location |
Mitigation |
|---|---|---|
3-class PT test variance is heuristic |
|
Runtime warning emitted |
Conformal coverage is approximate for time series |
|
Documented in docstring |
|
|
Interpretation guide in docstring |
9. Reproduction Instructions¶
Run Full Test Suite¶
# All tests (fast)
pytest tests/ -v
# With coverage report
pytest tests/ --cov=temporalcv --cov-report=html
# Monte Carlo tests only (slow, ~10 min)
pytest tests/monte_carlo/ -v -m monte_carlo
Regenerate R Reference Values¶
cd tests/cross_validation/r_reference/
Rscript generate_reference.R
Run Benchmarks¶
python -m temporalcv.benchmarks.run --dataset m4_subset --output results/
10. Audit Checklist¶
For users auditing this library:
Run
pytest tests/test_golden_reference.py -v— Verify R agreementRun
pytest tests/monte_carlo/ -v -m monte_carlo— Verify calibrationCheck
tests/fixtures/golden_reference.json— Review frozen valuesRead
docs/benchmarks.md— Verify benchmark claimsSearch for
[T3]tags — Review heuristic components
Last Updated: 2026-04-29 • Tests: 1,943 passing, 15 skipped • Coverage: 86% (5,898 statements, 1,956 branches) • Runtime: ~80 s