Algorithm Decision Tree¶
Quick Reference
This guide helps you choose:
Which cross-validation method for your data
Which performance metric for your problem
Which statistical test to compare models
What to do when a gate returns HALT
Part 1: Choosing Your Cross-Validation Method¶
Decision Flowchart¶
START: Do you have time series data?
│
├─ NO → Use standard sklearn KFold
│
└─ YES → Is your training data limited?
│
├─ YES (small data) → Use WalkForwardCV (expanding window)
│ • Training set grows with each fold
│ • Maximizes training data usage
│ • Best for: < 500 observations
│
└─ NO (large data) → Is the data generating process stable?
│
├─ YES (stable) → Use SlidingWindowCV (fixed window)
│ • Fixed-size training window
│ • Downweights old data
│ • Best for: stable relationships
│
└─ NO (regime changes) → Does overlap matter?
│
├─ YES (financial) → Use PurgedKFold + embargo
│ • Removes label overlap
│ • Gap prevents info leakage
│ • Best for: finance, trading
│
└─ NO → Use BlockingTimeSeriesSplit
• No overlap between folds
• Simpler than purging
Quick Selection Table¶
Scenario |
CV Method |
Key Parameters |
|---|---|---|
Standard forecasting |
|
|
Non-stationary data |
|
|
Financial returns |
|
|
Irregular timestamps |
|
|
Multi-horizon |
|
|
Code Examples¶
from temporalcv import WalkForwardCV, SlidingWindowCV
from temporalcv.cv_financial import PurgedKFold
# Expanding window (default choice)
cv = WalkForwardCV(
n_splits=5,
min_train_periods=100,
gap=1 # 1-step ahead forecast
)
# Fixed window (for regime-changing data)
cv = SlidingWindowCV(
train_size=252, # 1 year of daily data
test_size=21, # 1 month test
gap=5 # 5-day forecast horizon
)
# Financial data (label overlap concerns)
cv = PurgedKFold(
n_splits=5,
embargo_days=5,
pct_embargo=0.01
)
Part 2: Choosing Your Performance Metric¶
Decision Flowchart¶
START: What are you predicting?
│
├─ Point forecast → Is the series highly autocorrelated?
│ │
│ ├─ YES (persistence > 0.9) → Use MASE or Skill Score
│ │ • Compares to naive (random walk) baseline
│ │ • MASE < 1 means better than naive
│ │
│ └─ NO → Use RMSE or MAE
│ • Consider asymmetric loss if errors have different costs
│ • RMSE penalizes large errors more
│
├─ Direction/Sign → Use Directional Accuracy
│ │
│ └─ Also run PT test to verify statistical significance
│
├─ Quantile/Interval → Use Pinball Loss or Coverage
│ │
│ ├─ Point-in-interval: Use coverage (should be ~95% for 95% CI)
│ └─ Quantile accuracy: Use pinball loss (asymmetric)
│
└─ Probability → Use Log Loss or Brier Score
Metric Selection Table¶
Prediction Type |
Primary Metric |
Why |
|---|---|---|
Price level |
MASE |
Naive-adjusted, scale-free |
Returns |
Sharpe-weighted MSE |
Penalizes volatility-adjusted errors |
Direction (up/down) |
Directional Accuracy |
+ PT test for significance |
Volatility |
QLIKE |
Penalizes underestimation |
Quantiles |
Pinball Loss |
Proper scoring rule |
Intervals |
Coverage + Width |
Both calibration and precision |
Code Examples¶
from temporalcv.persistence import compute_persistence_metrics
from temporalcv.metrics import mase, directional_accuracy
from temporalcv.statistical_tests import pt_test
# For persistent series (stock prices, etc.)
metrics = compute_persistence_metrics(y_test, predictions)
print(f"MASE: {metrics['mase']:.3f}") # < 1 is better than naive
# For direction prediction
da = directional_accuracy(y_test, predictions)
pt_result = pt_test(y_test, predictions)
print(f"Directional Accuracy: {da:.1%}")
print(f"PT test p-value: {pt_result.pvalue:.4f}")
# For quantile forecasts
from temporalcv.metrics import pinball_loss
loss = pinball_loss(y_test, quantile_preds, tau=0.95)
Part 3: Choosing Your Statistical Test¶
Decision Flowchart¶
START: Comparing forecast accuracy?
│
├─ Equal complexity models → Use DM test
│ │
│ └─ Is h > 1 (multi-step)? → Apply Harvey adjustment
│ cv = dm_test(e1, e2, h=h, harvey_correction=True)
│
├─ Nested models (Model B = Model A + features)
│ │
│ └─ Use Clark-West test (CW)
│ • DM test is biased for nested models
│ • CW adjusts for "noise estimation" under null
│
├─ Multiple models (>2) → Use Model Confidence Set (MCS)
│ │
│ └─ Returns set of "best" models (statistically equivalent)
│
└─ Conditional accuracy → Use Giacomini-White test
│
└─ Tests if accuracy varies by conditioning variable
• E.g., "Is model A better during recessions?"
Test Selection Table¶
Comparison |
Test |
When to Use |
|---|---|---|
Model A vs B (equal) |
DM test |
Standard pairwise comparison |
Model A vs B (nested) |
CW test |
B = A + extra features |
Models A, B, C, … |
MCS |
Multiple comparison |
Conditional |
GW test |
Accuracy varies by state |
Direction only |
PT test |
Sign prediction |
Code Examples¶
from temporalcv.statistical_tests import dm_test, cw_test, pt_test
# Standard comparison (non-nested models)
result = dm_test(
errors_model_a,
errors_model_b,
h=5, # 5-step ahead
harvey_correction=True
)
print(f"DM statistic: {result.statistic:.3f}")
print(f"p-value: {result.pvalue:.4f}")
# Nested models (Model B adds features to Model A)
result = cw_test(
y_true,
predictions_model_a,
predictions_model_b,
h=1
)
print(f"CW statistic: {result.statistic:.3f}")
# Directional accuracy
result = pt_test(y_true, predictions)
print(f"PT statistic: {result.statistic:.3f}")
Part 4: When Gates Return HALT¶
HALT Decision Tree¶
Gate returns HALT
│
├─ gate_temporal_boundary → Train/test overlap
│ │
│ └─ FIX: Ensure max(train_time) < min(test_time) - gap
│ • Check your CV split logic
│ • Verify date indexing is correct
│ • Add gap parameter if h-step forecast
│
├─ gate_signal_verification → Target looks randomly shuffled
│ │
│ └─ FIX: Your features contain the target
│ • Check rolling calculations (need .shift(1))
│ • Verify no direct target leakage
│ • Review feature engineering pipeline
│
├─ gate_regime_leakage → Future regime info in features
│ │
│ └─ FIX: Regime computed using future data
│ • Use expanding quantiles with .shift()
│ • Don't use full-sample thresholds
│ • Compute regime on train only
│
└─ gate_feature_correlation → Suspicious train/test correlation
│
└─ INVESTIGATE: May be legitimate or leakage
• Check for normalizing on full data
• Verify group statistics computed on train only
• Review any feature transformations
HALT Response Protocol¶
from temporalcv.gates import run_gates
result = run_gates(X_train, y_train, X_test, y_test)
if result.status == "HALT":
# 1. Identify the failing gate
failing_gate = result.failed_gates[0]
print(f"HALT triggered by: {failing_gate.name}")
print(f"Reason: {failing_gate.reason}")
# 2. Do NOT proceed with model training
raise ValueError(f"Data leakage detected: {failing_gate.reason}")
elif result.status == "WARN":
# Investigate but may proceed with caution
print(f"Warning: {result.warnings}")
# Log warning for review
else: # PASS
# Safe to proceed
model.fit(X_train, y_train)
Common HALT Causes and Fixes¶
Gate |
Common Cause |
Fix |
|---|---|---|
|
Dates shuffled during preprocessing |
Sort by date before splitting |
|
|
Add |
|
|
Use |
|
|
Fit scaler on |
Part 5: Complete Decision Workflow¶
1. PREPARE DATA
│
└─ Run gates early → Fix any HALT issues
2. CHOOSE CV METHOD
│
├─ Small data? → WalkForwardCV (expanding)
├─ Regime changes? → SlidingWindowCV (fixed)
└─ Financial? → PurgedKFold (with embargo)
3. CHOOSE METRIC
│
├─ Persistent series? → MASE (vs naive)
├─ Direction matters? → Directional Accuracy
└─ Standard? → MAE/RMSE
4. EVALUATE
│
└─ Run CV → Get fold-wise scores
5. COMPARE MODELS
│
├─ Two equal models? → DM test
├─ Nested models? → CW test
└─ Many models? → MCS
6. DEPLOY
│
└─ Monitor for regime changes → Consider adaptive retraining
Quick Reference Card¶
Decision |
Default Choice |
Alternative When |
|---|---|---|
CV Method |
|
|
Gap Parameter |
|
|
Primary Metric |
MASE |
Directional Accuracy for sign prediction |
Statistical Test |
DM test + Harvey |
CW test for nested models |
On HALT |
Stop, investigate, fix |
Never proceed with HALT |
Next Steps¶
Why Time Series Is Different: Conceptual foundation
Common Pitfalls: Avoid these 8 mistakes
API Reference: Full class documentation