Note
Go to the end to download the full example code.
Example 09: Multi-Horizon Comparison¶
Real-World Case Study: Horizon-Dependent Model Performance¶
A model that excels at short-term forecasting may fail at long horizons. This is one of the most overlooked aspects of forecast evaluation:
Horizon Decay: Most models’ advantage over baselines degrades with forecast horizon. A model beating persistence at h=1 may be indistinguishable at h=12.
Optimal Horizon: The “predictability horizon” is where a model’s statistically significant advantage disappears. Beyond this, use a simpler baseline.
Model Selection: Different models may be optimal at different horizons. Complex models often win short-term, simpler models win long-term.
This example demonstrates temporalcv’s multi-horizon comparison framework using Diebold-Mariano tests with horizon-specific HAC adjustment.
Key Concepts¶
compare_horizons(): Two-model comparison across multiple horizons
compare_models_horizons(): Multi-model comparison at each horizon
Predictability horizon: Where model advantage becomes insignificant
Degradation patterns: “degrading”, “consistent”, “irregular”
======================================================================
EXAMPLE 09: MULTI-HORIZON COMPARISON
======================================================================
📊 Generated ARMA(1,1) + seasonal data: 493 samples
AR coefficient: 0.8 (strong short-term persistence)
Seasonality: 7 days (weekly pattern)
======================================================================
PART 2: THE PROBLEM — SINGLE-HORIZON EVALUATION MISLEADS
======================================================================
Common mistake: Evaluate model only at h=1 (one-step ahead).
"Our model beats persistence by 15% at h=1!"
But this tells you nothing about h=4, h=7, or h=12. If your business
needs 7-day forecasts, the h=1 result is irrelevant.
Worse: A model tuned for h=1 may be WORSE than baseline at h=7.
======================================================================
PART 3: GENERATE MULTI-STEP FORECASTS
======================================================================
✅ Generated forecasts for 148 test points
Models: GradientBoosting, Ridge, Persistence
Horizons to test: 1, 2, 4, 7, 12
======================================================================
PART 4: WRONG APPROACH — REPORT ONLY h=1
======================================================================
❌ Single-horizon result (h=1 only):
GB vs Persistence: DM stat = 5.605, p = 0.0000
→ 'GradientBoosting significantly beats persistence!'
→ But this is ONLY for h=1 forecasts...
======================================================================
PART 5: CORRECT APPROACH — compare_horizons()
======================================================================
compare_horizons() runs DM tests at multiple horizons with:
- Horizon-specific HAC bandwidth (h-1) for MA(h-1) error structure
- Harvey et al. (1997) small-sample correction
- Pattern detection: "degrading", "consistent", or "irregular"
- Predictability horizon: Where advantage becomes insignificant
📊 GradientBoosting vs Persistence across horizons:
## Multi-Horizon Comparison: GradientBoosting vs Persistence
**Settings**: loss=squared, alternative=less, α=0.05
| Horizon | DM Statistic | P-value | Significant | N |
|---------|-------------|---------|-------------|---|
| 1 | 5.605 | 1.0000 | | 148 |
| 2 | 4.006 | 1.0000 | | 148 |
| 4 | 2.867 | 0.9976 | | 148 |
| 7 | 2.198 | 0.9852 | | 148 |
| 12 | 1.710 | 0.9553 | | 148 |
**Pattern**: none
**Predictability horizon**: h=1
🔍 Key Insights:
Significant horizons (p < 0.05): []
First insignificant horizon: 1
Degradation pattern: none
⚠️ NOTE: No significant improvement at any horizon!
This often happens when complex models overfit on simple AR data.
======================================================================
PART 6: MULTI-MODEL HORIZON COMPARISON
======================================================================
compare_models_horizons() compares multiple models at each horizon:
- Bonferroni-corrected pairwise DM tests
- Identifies best model at each horizon
- Detects if one model consistently dominates
📊 Multi-Model Comparison by Horizon:
------------------------------------------------------------
Horizon Best Model Notes
------------------------------------------------------------
h=1 Ridge 3 significant differences
h=2 Ridge 3 significant differences
h=4 Ridge 3 significant differences
h=7 Ridge 3 significant differences
h=12 Ridge No significant differences
------------------------------------------------------------
📈 Summary:
Best models by horizon: {1: 'Ridge', 2: 'Ridge', 4: 'Ridge', 7: 'Ridge', 12: 'Ridge'}
Consistent winner: Ridge
======================================================================
PART 7: HORIZON-DEPENDENT PERFORMANCE METRICS
======================================================================
📊 Mean Squared Error by Model:
--------------------------------------------------
Model MSE
--------------------------------------------------
GradientBoosting 4386615015108.6333
Ridge 44.0052
Persistence 6197815903.3965
--------------------------------------------------
📉 p-values by horizon (GB vs Persistence):
----------------------------------------
Horizon p-value Significant?
----------------------------------------
h=1 1.0000 NO
h=2 1.0000 NO
h=4 0.9976 NO
h=7 0.9852 NO
h=12 0.9553 NO
----------------------------------------
======================================================================
PART 8: PRACTICAL RECOMMENDATIONS
======================================================================
Based on multi-horizon analysis:
1. IDENTIFY YOUR BUSINESS HORIZON
- What forecast horizon does your application need?
- Evaluate models AT THAT HORIZON, not just h=1
2. USE PREDICTABILITY HORIZON FOR MODEL SELECTION
- Beyond the predictability horizon, switch to simpler models
- Complex models' advantage fades; simpler models are more robust
3. CONSIDER HORIZON-SPECIFIC MODELS
- Short-term (h≤4): Complex models (GB, LSTM) may help
- Long-term (h>7): Simpler models (Ridge, ARIMA) often sufficient
- Very long (h>12): Consider seasonal naive or external regressors
4. REPORT MULTIPLE HORIZONS
- Never report just h=1 performance
- Show degradation curve to stakeholders
- Be honest about where model advantage ends
======================================================================
PART 9: KEY TAKEAWAYS
======================================================================
1. SINGLE-HORIZON EVALUATION IS DANGEROUS
- h=1 results don't generalize to h=7 or h=12
- Business decisions often need longer horizons
- Always test at your actual deployment horizon
2. USE compare_horizons() FOR TWO-MODEL COMPARISON
- Horizon-specific HAC bandwidth (DM test accounts for MA(h-1))
- Detects predictability horizon automatically
- Shows degradation pattern
3. USE compare_models_horizons() FOR MULTIPLE MODELS
- Bonferroni-corrected pairwise comparisons
- Identifies best model at each horizon
- Reveals when "best model" changes with horizon
4. PREDICTABILITY HORIZON IS ACTIONABLE
- First horizon where model is not significantly better than baseline
- Beyond this, prefer simpler/cheaper models
- Document this in model cards
5. DEGRADATION PATTERNS INDICATE MODEL QUALITY
- "degrading": Normal for most models (advantage fades)
- "consistent": Rare but desirable (model robust across horizons)
- "irregular": Investigate — may indicate overfitting or data issues
======================================================================
Example 09 complete.
======================================================================
from __future__ import annotations
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge
# temporalcv imports
from temporalcv.statistical_tests import (
compare_horizons,
compare_models_horizons,
dm_test,
)
# =============================================================================
# PART 1: Generate Multi-Horizon Forecast Data
# =============================================================================
def generate_multi_horizon_data(
n_samples: int = 400,
ar_coef: float = 0.8,
ma_coef: float = 0.3,
seasonality: int = 7,
noise_std: float = 0.5,
seed: int = 42,
) -> pd.DataFrame:
"""
Generate ARMA(1,1) + seasonal data for multi-horizon forecasting.
The data has:
- Strong autocorrelation (AR component) → short-term predictable
- Moving average component → smoothed shocks
- Weekly seasonality → pattern at h=7
This structure creates horizon-dependent predictability:
- h=1: High predictability (AR dominates)
- h=7: Moderate (seasonality helps)
- h>10: Low (noise dominates)
Parameters
----------
n_samples : int
Total samples to generate.
ar_coef : float
AR(1) coefficient (persistence).
ma_coef : float
MA(1) coefficient.
seasonality : int
Seasonal period (7 = weekly).
noise_std : float
Innovation standard deviation.
seed : int
Random seed.
Returns
-------
pd.DataFrame
DataFrame with target and lagged features.
"""
rng = np.random.default_rng(seed)
# Generate innovations
innovations = rng.normal(0, noise_std, n_samples)
# Build ARMA(1,1) + seasonal component
y = np.zeros(n_samples)
y[0] = innovations[0]
for t in range(1, n_samples):
# AR(1) component
ar_term = ar_coef * y[t - 1]
# MA(1) component
ma_term = ma_coef * innovations[t - 1]
# Seasonal component
if t >= seasonality:
seasonal_term = 0.3 * y[t - seasonality]
else:
seasonal_term = 0
y[t] = ar_term + ma_term + seasonal_term + innovations[t]
# Create DataFrame with lagged features
df = pd.DataFrame({"y": y})
# Add lagged features (strictly causal)
for lag in [1, 2, 3, 7]:
df[f"y_lag{lag}"] = df["y"].shift(lag)
# Add time index
df.index = pd.date_range("2020-01-01", periods=n_samples, freq="D")
df = df.dropna()
return df
print("=" * 70)
print("EXAMPLE 09: MULTI-HORIZON COMPARISON")
print("=" * 70)
# Generate data
df = generate_multi_horizon_data(n_samples=500, seed=42)
print(f"\n📊 Generated ARMA(1,1) + seasonal data: {len(df)} samples")
print(" AR coefficient: 0.8 (strong short-term persistence)")
print(" Seasonality: 7 days (weekly pattern)")
# =============================================================================
# PART 2: The Problem — Single-Horizon Evaluation Misleads
# =============================================================================
print("\n" + "=" * 70)
print("PART 2: THE PROBLEM — SINGLE-HORIZON EVALUATION MISLEADS")
print("=" * 70)
print(
"""
Common mistake: Evaluate model only at h=1 (one-step ahead).
"Our model beats persistence by 15% at h=1!"
But this tells you nothing about h=4, h=7, or h=12. If your business
needs 7-day forecasts, the h=1 result is irrelevant.
Worse: A model tuned for h=1 may be WORSE than baseline at h=7.
"""
)
# =============================================================================
# PART 3: Generate Multi-Step Forecasts
# =============================================================================
print("\n" + "=" * 70)
print("PART 3: GENERATE MULTI-STEP FORECASTS")
print("=" * 70)
# Split data
train_size = int(len(df) * 0.7)
df_train = df.iloc[:train_size]
df_test = df.iloc[train_size:]
X_train = df_train[["y_lag1", "y_lag2", "y_lag3", "y_lag7"]].values
y_train = df_train["y"].values
X_test = df_test[["y_lag1", "y_lag2", "y_lag3", "y_lag7"]].values
y_test = df_test["y"].values
# Train models
gb_model = GradientBoostingRegressor(n_estimators=50, max_depth=3, random_state=42)
ridge_model = Ridge(alpha=1.0)
gb_model.fit(X_train, y_train)
ridge_model.fit(X_train, y_train)
# Generate predictions (direct forecasting approach)
gb_pred = gb_model.predict(X_test)
ridge_pred = ridge_model.predict(X_test)
# Persistence baseline: y[t] = y[t-h]
# For multi-horizon, we use the lagged value as the "forecast"
persistence_pred = X_test[:, 0] # y_lag1 column
# Compute errors
gb_errors = y_test - gb_pred
ridge_errors = y_test - ridge_pred
persistence_errors = y_test - persistence_pred
print(f"✅ Generated forecasts for {len(y_test)} test points")
print(" Models: GradientBoosting, Ridge, Persistence")
print(" Horizons to test: 1, 2, 4, 7, 12")
# =============================================================================
# PART 4: WRONG Approach — Report Only h=1
# =============================================================================
print("\n" + "=" * 70)
print("PART 4: WRONG APPROACH — REPORT ONLY h=1")
print("=" * 70)
# Single-horizon DM test
dm_h1 = dm_test(gb_errors, persistence_errors, h=1)
print("\n❌ Single-horizon result (h=1 only):")
print(f" GB vs Persistence: DM stat = {dm_h1.statistic:.3f}, p = {dm_h1.pvalue:.4f}")
if dm_h1.pvalue < 0.05:
print(" → 'GradientBoosting significantly beats persistence!'")
print(" → But this is ONLY for h=1 forecasts...")
# =============================================================================
# PART 5: CORRECT Approach — compare_horizons()
# =============================================================================
print("\n" + "=" * 70)
print("PART 5: CORRECT APPROACH — compare_horizons()")
print("=" * 70)
print(
"""
compare_horizons() runs DM tests at multiple horizons with:
- Horizon-specific HAC bandwidth (h-1) for MA(h-1) error structure
- Harvey et al. (1997) small-sample correction
- Pattern detection: "degrading", "consistent", or "irregular"
- Predictability horizon: Where advantage becomes insignificant
"""
)
# Compare GB vs Persistence across horizons
horizons = (1, 2, 4, 7, 12)
result_gb = compare_horizons(
gb_errors,
persistence_errors,
horizons=horizons,
alternative="less", # Test if GB has lower error
model_1_name="GradientBoosting",
model_2_name="Persistence",
)
print("\n📊 GradientBoosting vs Persistence across horizons:")
print(result_gb.to_markdown())
# Key insights
print("\n🔍 Key Insights:")
print(f" Significant horizons (p < 0.05): {result_gb.significant_horizons}")
print(f" First insignificant horizon: {result_gb.first_insignificant_horizon}")
print(f" Degradation pattern: {result_gb.degradation_pattern}")
# Note: If p-values are high (>0.5), it means model1 (GB) is WORSE than baseline
# The DM test with alternative="less" tests if model1 has LOWER error
# High p-values mean we can't reject that model1 is worse or equal
if len(result_gb.significant_horizons) == 0:
print("\n⚠️ NOTE: No significant improvement at any horizon!")
print(" This often happens when complex models overfit on simple AR data.")
# =============================================================================
# PART 6: Multi-Model Horizon Comparison
# =============================================================================
print("\n" + "=" * 70)
print("PART 6: MULTI-MODEL HORIZON COMPARISON")
print("=" * 70)
print(
"""
compare_models_horizons() compares multiple models at each horizon:
- Bonferroni-corrected pairwise DM tests
- Identifies best model at each horizon
- Detects if one model consistently dominates
"""
)
# Prepare error dictionary
errors_dict = {
"GradientBoosting": gb_errors,
"Ridge": ridge_errors,
"Persistence": persistence_errors,
}
# Multi-model, multi-horizon comparison
multi_result = compare_models_horizons(
errors_dict,
horizons=horizons,
loss="squared",
alpha=0.05,
)
print("\n📊 Multi-Model Comparison by Horizon:")
print("-" * 60)
print(f"{'Horizon':<10} {'Best Model':<20} {'Notes':<30}")
print("-" * 60)
for h in horizons:
horizon_result = multi_result.pairwise_by_horizon[h]
best = horizon_result.best_model
n_sig = len(horizon_result.significant_pairs)
notes = f"{n_sig} significant differences" if n_sig > 0 else "No significant differences"
print(f"h={h:<8} {best:<20} {notes:<30}")
print("-" * 60)
# Summary
print("\n📈 Summary:")
print(f" Best models by horizon: {multi_result.best_model_by_horizon}")
if multi_result.consistent_best:
print(f" Consistent winner: {multi_result.consistent_best}")
else:
print(" No consistent winner (model selection depends on horizon)")
# =============================================================================
# PART 7: Visualizing Horizon-Dependent Performance
# =============================================================================
print("\n" + "=" * 70)
print("PART 7: HORIZON-DEPENDENT PERFORMANCE METRICS")
print("=" * 70)
# Compute MAE at each "effective" horizon
# (In practice, you'd re-train for each horizon; here we show the pattern)
print("\n📊 Mean Squared Error by Model:")
print("-" * 50)
print(f"{'Model':<20} {'MSE':<15}")
print("-" * 50)
for name, errors in errors_dict.items():
mse = np.mean(errors**2)
print(f"{name:<20} {mse:<15.4f}")
print("-" * 50)
# Show the degradation pattern more explicitly
print("\n📉 p-values by horizon (GB vs Persistence):")
print("-" * 40)
print(f"{'Horizon':<10} {'p-value':<15} {'Significant?':<15}")
print("-" * 40)
for h in horizons:
p_val = result_gb.dm_results[h].pvalue
sig = "YES ✅" if p_val < 0.05 else "NO"
print(f"h={h:<8} {p_val:<15.4f} {sig:<15}")
print("-" * 40)
# =============================================================================
# PART 8: Practical Recommendations
# =============================================================================
print("\n" + "=" * 70)
print("PART 8: PRACTICAL RECOMMENDATIONS")
print("=" * 70)
print(
"""
Based on multi-horizon analysis:
1. IDENTIFY YOUR BUSINESS HORIZON
- What forecast horizon does your application need?
- Evaluate models AT THAT HORIZON, not just h=1
2. USE PREDICTABILITY HORIZON FOR MODEL SELECTION
- Beyond the predictability horizon, switch to simpler models
- Complex models' advantage fades; simpler models are more robust
3. CONSIDER HORIZON-SPECIFIC MODELS
- Short-term (h≤4): Complex models (GB, LSTM) may help
- Long-term (h>7): Simpler models (Ridge, ARIMA) often sufficient
- Very long (h>12): Consider seasonal naive or external regressors
4. REPORT MULTIPLE HORIZONS
- Never report just h=1 performance
- Show degradation curve to stakeholders
- Be honest about where model advantage ends
"""
)
# =============================================================================
# PART 9: Key Takeaways
# =============================================================================
print("\n" + "=" * 70)
print("PART 9: KEY TAKEAWAYS")
print("=" * 70)
print(
"""
1. SINGLE-HORIZON EVALUATION IS DANGEROUS
- h=1 results don't generalize to h=7 or h=12
- Business decisions often need longer horizons
- Always test at your actual deployment horizon
2. USE compare_horizons() FOR TWO-MODEL COMPARISON
- Horizon-specific HAC bandwidth (DM test accounts for MA(h-1))
- Detects predictability horizon automatically
- Shows degradation pattern
3. USE compare_models_horizons() FOR MULTIPLE MODELS
- Bonferroni-corrected pairwise comparisons
- Identifies best model at each horizon
- Reveals when "best model" changes with horizon
4. PREDICTABILITY HORIZON IS ACTIONABLE
- First horizon where model is not significantly better than baseline
- Beyond this, prefer simpler/cheaper models
- Document this in model cards
5. DEGRADATION PATTERNS INDICATE MODEL QUALITY
- "degrading": Normal for most models (advantage fades)
- "consistent": Rare but desirable (model robust across horizons)
- "irregular": Investigate — may indicate overfitting or data issues
"""
)
print("\n" + "=" * 70)
print("Example 09 complete.")
print("=" * 70)