Benchmark Methodology¶
This document describes how temporalcv benchmark results are generated.
Datasets¶
M4 Competition¶
The M4 Competition (Makridakis et al., 2020) provides 100,000 time series across 6 frequencies:
Frequency |
Series |
Horizon |
Min Length |
|---|---|---|---|
Yearly |
23,000 |
6 |
13 |
Quarterly |
24,000 |
8 |
16 |
Monthly |
48,000 |
18 |
42 |
Weekly |
359 |
13 |
80 |
Daily |
4,227 |
14 |
93 |
Hourly |
414 |
48 |
700 |
Benchmark sampling: To reduce compute time, we sample a subset of series per frequency:
Quick mode: 100 series/frequency
Full mode: 1000 series/frequency
Official train/test splits from the competition are preserved.
M5 Competition¶
The M5 Competition (Makridakis et al., 2022) provides 30,490 hierarchical time series from Walmart:
Horizon: 28 days
Features: Rich exogenous variables (price, promotions, calendar)
Characteristics: Intermittent demand, hierarchical structure
Data access: Due to Kaggle Terms of Service, M5 data cannot be bundled. Users must download manually:
# From Kaggle
kaggle competitions download -c m5-forecasting-accuracy
Models¶
Baseline Models¶
Model |
Description |
Parameters |
|---|---|---|
Naive |
Repeats last observed value |
None |
SeasonalNaive |
Repeats value from same seasonal period |
|
Statsforecast Models¶
Requires: pip install statsforecast
Model |
Description |
Use Case |
|---|---|---|
AutoARIMA |
Automatic ARIMA selection |
General purpose |
AutoETS |
Automatic exponential smoothing |
Trended/seasonal data |
AutoTheta |
Theta method with automatic tuning |
Competition-winning |
CrostonClassic |
Intermittent demand model |
Sparse demand |
ADIDA |
Aggregate-Disaggregate Intermittent |
Intermittent demand |
IMAPA |
Multiple Aggregation Prediction |
Intermittent demand |
HistoricAverage |
Mean of all historical values |
Stable series |
Metrics¶
Error Metrics¶
Metric |
Formula |
Interpretation |
|---|---|---|
MAE |
`mean( |
y - ŷ |
RMSE |
|
Penalizes large errors |
MAPE |
`mean( |
y - ŷ |
Direction Metrics¶
Metric |
Description |
|---|---|
Direction Accuracy |
Proportion of correct direction predictions |
Statistical Tests¶
Diebold-Mariano Test¶
Tests whether forecast accuracy difference between two models is statistically significant.
Null hypothesis: No difference in predictive accuracy
Implementation:
Uses HAC (Heteroskedasticity and Autocorrelation Consistent) variance estimator
Accounts for forecast horizon via Newey-West bandwidth
p < 0.05 indicates significant difference
Reference: Diebold, F.X. & Mariano, R.S. (1995). “Comparing Predictive Accuracy.” JBES 13(3): 253-263.
Reproducibility¶
Running Benchmarks¶
# Quick validation (~1 hour)
python scripts/run_benchmark.py --quick
# Full benchmark (~8-10 hours with 8 cores)
python scripts/run_benchmark.py --full
# Single frequency
python scripts/run_benchmark.py --dataset m4_monthly --sample 1000
# Resume interrupted run
python scripts/run_benchmark.py --resume benchmarks/results/run_abc123
Output Files¶
Results are saved to benchmarks/results/run_<uuid>/:
File |
Content |
|---|---|
|
Structured benchmark results |
|
Markdown-formatted tables |
|
Execution log |
|
Per-dataset checkpoints |
Environment¶
Benchmark metadata includes:
temporalcv version
Python version
Platform
CPU count
Runtime
Limitations¶
Sampling bias: Subsampling may not represent full dataset characteristics
Model configurations: Default parameters used; tuning may improve results
Single train/test split: No cross-validation uncertainty estimates
Compute constraints: Full M4 (100k series) requires significant resources
References¶
Makridakis, S., et al. (2020). “The M4 Competition: 100,000 time series and 61 forecasting methods.” International Journal of Forecasting 36(1): 54-74.
Makridakis, S., et al. (2022). “M5 accuracy competition: Results, findings, and conclusions.” International Journal of Forecasting 38(4): 1346-1364.
Hyndman, R.J. & Athanasopoulos, G. (2021). “Forecasting: Principles and Practice.” 3rd ed. OTexts.