API Reference: Statistical Tests¶
Statistical tests for forecast evaluation with proper corrections.
When to Use¶
graph TD
A[Compare forecasts] --> B{Model relationship?}
B -->|Non-nested| C[dm_test]
B -->|Nested| D[cw_test]
B -->|Unknown| E[Use cw_test - conservative]
C --> F{Question type?}
D --> F
F -->|Which is better overall?| G[DM or CW test]
F -->|Can I predict which wins?| H[gw_test]
F -->|Is direction accuracy real?| I[pt_test]
G --> J{Multi-horizon?}
J -->|Yes| K[compare_horizons]
J -->|No| L[Single dm_test/cw_test]
Test Selection Guide¶
Situation |
Test |
Reason |
|---|---|---|
RF vs XGBoost |
|
Non-nested models |
AR(2) vs AR(1) |
|
Nested models |
Full model vs subset |
|
Nested models |
Model switching strategy |
|
Conditional predictive ability |
Direction forecasting |
|
Tests sign/direction accuracy |
Common Mistakes¶
Using DM test for nested models
DM is biased when one model nests another
Example 18 shows Monte Carlo evidence of bias
Ignoring h in multi-step forecasts
dm_test(errors1, errors2, h=5)for 5-step forecastsHAC variance needs correct horizon for bandwidth
Comparing raw p-values across horizons
Use
compare_horizons()for consistent comparisonAccounts for sample size differences
See Also: Example 03, Example 13, Example 18
Data Classes¶
DMTestResult¶
Result from Diebold-Mariano test.
@dataclass
class DMTestResult:
statistic: float # DM test statistic
pvalue: float # P-value for the test
h: int # Forecast horizon used
n: int # Number of observations
loss: str # Loss function ("squared" or "absolute")
alternative: str # Alternative hypothesis
harvey_adjusted: bool # Whether small-sample adjustment applied
mean_loss_diff: float # Mean loss differential
variance_method: str # Variance method ("hac" or "self_normalized")
Note: For variance_method="hac", the statistic is asymptotically N(0,1). For variance_method="self_normalized", the statistic has a non-standard distribution with critical values from simulation.
Properties:
Property |
Type |
Description |
|---|---|---|
|
|
Is p-value < 0.05? |
|
|
Is p-value < 0.01? |
String representation: DM(h): statistic (p=pvalue) with significance stars
PTTestResult¶
Result from Pesaran-Timmermann test.
@dataclass
class PTTestResult:
statistic: float # PT test statistic (z-score)
pvalue: float # P-value (one-sided)
accuracy: float # Observed directional accuracy
expected: float # Expected accuracy under null
n: int # Number of observations
n_classes: int # Number of direction classes (2 or 3)
Properties:
Property |
Type |
Description |
|---|---|---|
|
|
Is p-value < 0.05? |
|
|
accuracy - expected |
String representation: PT: accuracy vs expected (z=statistic, p=pvalue)
GWTestResult¶
Result from Giacomini-White test for conditional predictive ability.
@dataclass
class GWTestResult:
statistic: float # GW test statistic (T × R²)
pvalue: float # P-value from χ²(q)
r_squared: float # R² from auxiliary regression
n: int # Effective sample size
n_lags: int # Number of lags (τ)
q: int # Degrees of freedom (1 + n_lags)
loss: str # Loss function ("squared" or "absolute")
alternative: str # Alternative hypothesis
mean_loss_diff: float # Mean loss differential
Properties:
Property |
Type |
Description |
|---|---|---|
|
|
Is p-value < 0.05? |
|
|
Is p-value < 0.01? |
|
|
Alias for |
String representation: GW(τ): statistic (p=pvalue, R²=value) with significance stars
Key Distinction from DM Test:
Test |
Tests |
Question |
|---|---|---|
DM |
Unconditional predictive ability |
“Which model is better on average?” |
GW |
Conditional predictive ability |
“Can we predict which model will win given recent performance?” |
CWTestResult¶
Result from Clark-West test for nested model comparison.
The CW test adjusts the DM test for the bias caused by estimating extra parameters in the unrestricted model that have true value zero. This bias makes the unrestricted model appear worse than it truly is.
Attributes:
Attribute |
Type |
Description |
|---|---|---|
|
|
CW test statistic (asymptotically N(0,1)) |
|
|
P-value for the test |
|
|
Forecast horizon used |
|
|
Number of observations |
|
|
Loss function (“squared” or “absolute”) |
|
|
Alternative hypothesis |
|
|
Whether Harvey correction was applied |
|
|
Mean unadjusted loss differential E[d_t] |
|
|
Mean adjusted loss differential E[d*_t] |
|
|
Mean of (ŷ_r - ŷ_u)² — noise removed |
|
|
“hac” or “self_normalized” |
Properties:
Property |
Returns |
Description |
|---|---|---|
|
|
P-value < 0.05 |
|
|
P-value < 0.01 |
|
|
Ratio of adjustment to unadjusted loss diff |
Example:
>>> result = cw_test(ar2_errors, ar1_errors, ar2_preds, ar1_preds)
>>> print(f"CW: {result.statistic:.3f} (p={result.pvalue:.4f})")
>>> print(f"Adjustment magnitude: {result.adjustment_magnitude:.4f}")
>>> print(f"Adjustment ratio: {result.adjustment_ratio:.2%}")
When to use CW vs DM:
Situation |
Test |
|---|---|
Non-nested models (ARIMA vs Random Forest) |
|
Nested models (AR(2) vs AR(1), Full vs Reduced) |
|
Functions¶
dm_test¶
Diebold-Mariano test for equal predictive accuracy.
def dm_test(
errors_1: np.ndarray,
errors_2: np.ndarray,
h: int = 1,
loss: Literal["squared", "absolute"] = "squared",
alternative: Literal["two-sided", "less", "greater"] = "two-sided",
harvey_correction: bool = True,
variance_method: Literal["hac", "self_normalized"] = "hac",
) -> DMTestResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Errors from model 1 ( |
|
|
required |
Errors from model 2 (baseline) |
|
|
|
Forecast horizon |
|
|
|
Loss function: |
|
|
|
|
|
|
|
Apply small-sample adjustment (HAC only) |
|
|
|
|
Returns: DMTestResult
Raises: ValueError if fewer than 30 samples
Alternative hypotheses:
"two-sided": Models have different accuracy"less": Model 1 more accurate (lower loss)"greater": Model 2 more accurate
Variance methods:
Method |
Pros |
Cons |
When to Use |
|---|---|---|---|
|
Higher power |
Requires bandwidth selection, can be negative |
Default, well-tuned bandwidth |
|
No bandwidth, always positive, robust |
Slightly lower power |
Uncertain bandwidth, small samples |
Example:
from temporalcv import dm_test
# Compute errors: actual - predicted
model_errors = actuals - model_preds
persistence_errors = actuals - 0 # Persistence predicts 0
result = dm_test(
errors_1=model_errors,
errors_2=persistence_errors,
h=2,
loss="absolute",
alternative="less" # Test if model 1 is better
)
print(f"DM statistic: {result.statistic:.3f}")
print(f"p-value: {result.pvalue:.4f}")
print(f"Significant: {result.significant_at_05}")
# Using self-normalized variance for robustness
result_sn = dm_test(
errors_1=model_errors,
errors_2=persistence_errors,
variance_method="self_normalized"
)
print(f"Self-normalized: {result_sn.pvalue:.4f}")
pt_test¶
Pesaran-Timmermann test for directional accuracy.
def pt_test(
actual: np.ndarray,
predicted: np.ndarray,
move_threshold: Optional[float] = None,
) -> PTTestResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Actual values (changes) |
|
|
required |
Predicted values (changes) |
|
|
|
Threshold for 3-class mode |
Returns: PTTestResult
Raises: ValueError if fewer than 20 samples
Modes:
Mode |
Condition |
Classes |
|---|---|---|
2-class |
|
Positive/Negative sign |
3-class |
|
UP/DOWN/FLAT |
Example:
from temporalcv import pt_test, compute_move_threshold
# Compute threshold from training data
threshold = compute_move_threshold(train_actuals, percentile=70)
# Test directional accuracy with 3-class
result = pt_test(
actual=test_actuals,
predicted=test_preds,
move_threshold=threshold
)
print(f"Direction accuracy: {result.accuracy:.1%}")
print(f"Expected (random): {result.expected:.1%}")
print(f"Skill: {result.skill:.1%}")
print(f"Significant: {result.significant_at_05}")
Warning: For h>1 step forecasts, p-values may be optimistic (no HAC correction). Use DM test for rigorous multi-step testing.
gw_test¶
Giacomini-White test for conditional predictive ability.
def gw_test(
errors_1: np.ndarray,
errors_2: np.ndarray,
n_lags: int = 1,
loss: Literal["squared", "absolute"] = "squared",
alternative: Literal["two-sided", "less", "greater"] = "two-sided",
) -> GWTestResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Errors from model 1 ( |
|
|
required |
Errors from model 2 (baseline) |
|
|
|
Lags in conditioning set (1-10) |
|
|
|
Loss function: |
|
|
|
|
Returns: GWTestResult
Raises: ValueError if fewer than 50 effective samples (after lag adjustment)
Alternative hypotheses:
"two-sided": Conditional predictive ability differs"less": Model 1 conditionally better (lower loss given past)"greater": Model 2 conditionally better
Interpretation Table:
DM Result |
GW Result |
Interpretation |
|---|---|---|
Not sig |
Not sig |
No difference in predictive ability |
Sig |
Sig |
Model unconditionally and conditionally better |
Sig |
Not sig |
Better on average, but not predictably |
Not sig |
Sig |
Equal average, but performance is predictable! |
Example:
from temporalcv import dm_test, gw_test
# Run both tests for comprehensive comparison
dm_result = dm_test(model_errors, baseline_errors)
gw_result = gw_test(model_errors, baseline_errors, n_lags=1)
# Key insight: check if conditional predictability exists
if gw_result.conditional_predictability:
print(f"Past performance predicts future superiority!")
print(f"R-squared: {gw_result.r_squared:.3f}")
# Models may have equal average accuracy but...
if not dm_result.significant_at_05:
print("Equal average accuracy, but switching could help!")
# Using multiple lags
gw_result_3lag = gw_test(model_errors, baseline_errors, n_lags=3)
print(f"3-lag R²: {gw_result_3lag.r_squared:.3f}")
When to Use GW vs DM:
Use Case |
Recommended Test |
|---|---|
Which model is better overall? |
DM test |
Should I switch between models? |
GW test |
Comprehensive comparison |
Both |
Model selection for production |
DM test |
Dynamic model switching strategy |
GW test |
cw_test¶
Clark-West test for nested model comparison.
The CW test adjusts the DM test for the bias caused by estimating extra parameters with true value zero. Use this when comparing nested models (e.g., AR(2) vs AR(1), Full vs Reduced).
def cw_test(
errors_unrestricted: np.ndarray,
errors_restricted: np.ndarray,
predictions_unrestricted: np.ndarray,
predictions_restricted: np.ndarray,
h: int = 1,
loss: Literal["squared", "absolute"] = "squared",
alternative: Literal["two-sided", "less", "greater"] = "two-sided",
harvey_correction: bool = True,
variance_method: Literal["hac", "self_normalized"] = "hac",
) -> CWTestResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Errors from unrestricted model (more params) |
|
|
required |
Errors from restricted model (nested) |
|
|
required |
Forecasts from unrestricted model |
|
|
required |
Forecasts from restricted model |
|
|
|
Forecast horizon |
|
|
|
Loss function |
|
|
|
Alternative hypothesis |
|
|
|
Small-sample adjustment |
|
|
|
Variance estimation method |
Returns: CWTestResult
The CW Adjustment:
d*_t = d_t - (ŷ_restricted - ŷ_unrestricted)²
The term (ŷ_r - ŷ_u)² captures the noise cost of estimating extra parameters with true value zero.
Example: Comparing AR(2) vs AR(1):
from temporalcv import cw_test, dm_test
import numpy as np
# AR(2) is unrestricted, AR(1) is restricted (nested)
result = cw_test(
errors_unrestricted=ar2_errors,
errors_restricted=ar1_errors,
predictions_unrestricted=ar2_preds,
predictions_restricted=ar1_preds,
alternative="less", # Test if AR(2) is better
)
print(f"CW statistic: {result.statistic:.3f}")
print(f"P-value: {result.pvalue:.4f}")
print(f"Adjustment: {result.adjustment_magnitude:.4f}")
# Compare with DM test to see bias
dm_result = dm_test(ar2_errors, ar1_errors, alternative="less")
print(f"DM p-value: {dm_result.pvalue:.4f} (may be biased for nested models)")
print(f"CW p-value: {result.pvalue:.4f} (corrected)")
Key Insight: When predictions are identical (zero adjustment), CW equals DM. For nested models with extra parameters, CW removes the bias that makes the unrestricted model appear worse.
Reference: Clark, T.E. & West, K.D. (2007). Approximately normal tests for equal predictive accuracy in nested models. Journal of Econometrics 138(1), 291-311.
compute_hac_variance¶
Compute HAC (Heteroskedasticity and Autocorrelation Consistent) variance.
def compute_hac_variance(
d: np.ndarray,
bandwidth: Optional[int] = None,
) -> float
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Series (typically loss differential) |
|
|
|
Kernel bandwidth. If None, auto-selected. |
Returns: HAC variance estimate
Notes:
Uses Newey-West estimator with Bartlett kernel
For h-step forecasts:
bandwidth = h - 1is appropriateAutomatic bandwidth:
floor(4 * (n/100)^(2/9))
compute_self_normalized_variance¶
Compute self-normalized variance using partial sums [T1].
def compute_self_normalized_variance(d: np.ndarray) -> float
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Series (typically loss differential) |
Returns: Self-normalized variance estimate (always >= 0)
Notes:
Uses partial sums:
S_k = cumsum(d - mean(d))Formula:
V_n = (1/n²) * sum(S_k²)Cannot produce negative variance (unlike HAC)
No bandwidth selection required
Advantages over HAC:
✓ No bandwidth selection required
✓ Cannot produce negative variance
✓ Better size control in small samples
✗ Slightly lower power than well-tuned HAC
Multi-Horizon Comparison¶
MultiHorizonResult¶
Result from comparing two models across multiple forecast horizons.
@dataclass(frozen=True)
class MultiHorizonResult:
horizons: Tuple[int, ...] # Forecast horizons tested
dm_results: Dict[int, DMTestResult] # h -> DM test result
model_1_name: str # Name of model 1
model_2_name: str # Name of model 2
n_per_horizon: Dict[int, int] # Sample size per horizon
loss: str # Loss function used
alternative: str # Alternative hypothesis
alpha: float # Significance level
Properties:
Property |
Type |
Description |
|---|---|---|
|
|
Horizons where p < alpha |
|
|
First h where significance is lost |
|
|
Horizon with smallest p-value |
|
|
“consistent” | “degrading” | “none” | “irregular” |
Methods:
Method |
Returns |
Description |
|---|---|---|
|
|
P-values for each horizon |
|
|
DM statistics for each horizon |
|
|
Human-readable summary |
|
|
Markdown table format |
compare_horizons¶
Compare two models across multiple forecast horizons using DM tests.
def compare_horizons(
errors_1: np.ndarray,
errors_2: np.ndarray,
horizons: Sequence[int] = (1, 2, 3, 4),
*,
loss: Literal["squared", "absolute"] = "squared",
alternative: Literal["two-sided", "less", "greater"] = "two-sided",
alpha: float = 0.05,
harvey_correction: bool = True,
variance_method: Literal["hac", "self_normalized"] = "hac",
model_1_name: str = "Model",
model_2_name: str = "Baseline",
) -> MultiHorizonResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Errors from model 1 (candidate) |
|
|
required |
Errors from model 2 (baseline) |
|
|
|
Forecast horizons to test |
|
|
|
Loss function for DM test |
|
|
|
Alternative hypothesis |
|
|
|
Significance level |
|
|
|
Apply small-sample adjustment |
|
|
|
Variance estimation method |
|
|
|
Name for model 1 |
|
|
|
Name for model 2 |
Returns: MultiHorizonResult
Example:
from temporalcv import compare_horizons
# Compare model to baseline across horizons
result = compare_horizons(
model_errors, baseline_errors,
horizons=(1, 2, 4, 8, 12),
alternative="less", # Test if model is better
)
print(result.significant_horizons) # [1, 2, 4]
print(result.first_insignificant_horizon) # 8
print(result.degradation_pattern) # "degrading"
# Print results table
print(result.to_markdown())
Degradation Patterns:
Pattern |
Definition |
|---|---|
|
All horizons significant OR all insignificant |
|
P-values increase with horizon (advantage fades) |
|
No significant horizons |
|
Non-monotonic pattern (>1 violation) |
MultiModelHorizonResult¶
Result from comparing multiple models across multiple horizons.
@dataclass(frozen=True)
class MultiModelHorizonResult:
horizons: Tuple[int, ...] # Horizons tested
model_names: Tuple[str, ...] # Model names
pairwise_by_horizon: Dict[int, MultiModelComparisonResult] # h -> comparison
alpha: float # Significance level
Properties:
Property |
Type |
Description |
|---|---|---|
|
|
Best model at each horizon |
|
|
Model winning all horizons, or None |
compare_models_horizons¶
Compare multiple models across multiple horizons.
def compare_models_horizons(
errors_dict: Dict[str, np.ndarray],
horizons: Sequence[int] = (1, 2, 3, 4),
*,
loss: Literal["squared", "absolute"] = "squared",
alpha: float = 0.05,
harvey_correction: bool = True,
) -> MultiModelHorizonResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Model name → error array |
|
|
|
Forecast horizons to test |
|
|
|
Loss function |
|
|
|
Significance level |
|
|
|
Apply small-sample adjustment |
Returns: MultiModelHorizonResult
Example:
from temporalcv import compare_models_horizons
# Compare multiple models across horizons
result = compare_models_horizons(
{"ARIMA": arima_errors, "RF": rf_errors, "Naive": naive_errors},
horizons=(1, 4, 12),
)
# Check if one model consistently wins
if result.consistent_best:
print(f"Consistent winner: {result.consistent_best}")
else:
print("Best model varies by horizon:")
for h, model in result.best_model_by_horizon.items():
print(f" h={h}: {model}")
References¶
[T1] Core Statistical Methods:
Diebold, F.X. & Mariano, R.S. (1995). “Comparing Predictive Accuracy.” Journal of Business & Economic Statistics, 13(3), 253-263. DOI: 10.1080/07350015.1995.10524599
Harvey, D., Leybourne, S., & Newbold, P. (1997). “Testing the Equality of Prediction Mean Squared Errors.” International Journal of Forecasting, 13(2), 281-291. DOI: 10.1016/S0169-2070(96)00719-4
Pesaran, M.H. & Timmermann, A. (1992). “A Simple Nonparametric Test of Predictive Performance.” Journal of Business & Economic Statistics, 10(4), 461-465. DOI: 10.1080/07350015.1992.10509922
Newey, W.K. & West, K.D. (1987). “A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica, 55(3), 703-708. DOI: 10.2307/1913610
Shao, X. (2010). “A Self-Normalized Approach to Confidence Interval Construction in Time Series.” Journal of the Royal Statistical Society: Series B, 72(3), 343-366. DOI: 10.1111/j.1467-9868.2009.00737.x
Lobato, I.N. (2001). “Testing That a Dependent Process Is Uncorrelated.” Journal of the American Statistical Association, 96(455), 1066-1076. DOI: 10.1198/016214501753208726
Giacomini, R. & White, H. (2006). “Tests of Conditional Predictive Ability.” Econometrica, 74(6), 1545-1578. DOI: 10.1111/j.1468-0262.2006.00718.x