Benchmark Methodology¶

This document describes how temporalcv benchmark results are generated.

Datasets¶

M4 Competition¶

The M4 Competition (Makridakis et al., 2020) provides 100,000 time series across 6 frequencies:

Frequency	Series	Horizon	Min Length
Yearly	23,000	6	13
Quarterly	24,000	8	16
Monthly	48,000	18	42
Weekly	359	13	80
Daily	4,227	14	93
Hourly	414	48	700

Benchmark sampling: To reduce compute time, we sample a subset of series per frequency:

Quick mode: 100 series/frequency
Full mode: 1000 series/frequency

Official train/test splits from the competition are preserved.

M5 Competition¶

The M5 Competition (Makridakis et al., 2022) provides 30,490 hierarchical time series from Walmart:

Horizon: 28 days
Features: Rich exogenous variables (price, promotions, calendar)
Characteristics: Intermittent demand, hierarchical structure

Data access: Due to Kaggle Terms of Service, M5 data cannot be bundled. Users must download manually:

# From Kaggle
kaggle competitions download -c m5-forecasting-accuracy

Models¶

Baseline Models¶

Model	Description	Parameters
Naive	Repeats last observed value	None
SeasonalNaive	Repeats value from same seasonal period	`season_length`

Statsforecast Models¶

Requires: pip install statsforecast

Model	Description	Use Case
AutoARIMA	Automatic ARIMA selection	General purpose
AutoETS	Automatic exponential smoothing	Trended/seasonal data
AutoTheta	Theta method with automatic tuning	Competition-winning
CrostonClassic	Intermittent demand model	Sparse demand
ADIDA	Aggregate-Disaggregate Intermittent	Intermittent demand
IMAPA	Multiple Aggregation Prediction	Intermittent demand
HistoricAverage	Mean of all historical values	Stable series

Metrics¶

Error Metrics¶

Metric	Formula	Interpretation
MAE	`mean(	y - ŷ
RMSE	`sqrt(mean((y - ŷ)²))`	Penalizes large errors
MAPE	`mean(	y - ŷ

Direction Metrics¶

Metric	Description
Direction Accuracy	Proportion of correct direction predictions

Statistical Tests¶

Diebold-Mariano Test¶

Tests whether forecast accuracy difference between two models is statistically significant.

Null hypothesis: No difference in predictive accuracy

Implementation:

Uses HAC (Heteroskedasticity and Autocorrelation Consistent) variance estimator
Accounts for forecast horizon via Newey-West bandwidth
p < 0.05 indicates significant difference

Reference: Diebold, F.X. & Mariano, R.S. (1995). “Comparing Predictive Accuracy.” JBES 13(3): 253-263.

Reproducibility¶

Running Benchmarks¶

# Quick validation (~1 hour)
python scripts/run_benchmark.py --quick

# Full benchmark (~8-10 hours with 8 cores)
python scripts/run_benchmark.py --full

# Single frequency
python scripts/run_benchmark.py --dataset m4_monthly --sample 1000

# Resume interrupted run
python scripts/run_benchmark.py --resume benchmarks/results/run_abc123

Output Files¶

Results are saved to benchmarks/results/run_<uuid>/:

File	Content
`results.json`	Structured benchmark results
`results.md`	Markdown-formatted tables
`benchmark.log`	Execution log
`checkpoints/`	Per-dataset checkpoints

Environment¶

Benchmark metadata includes:

temporalcv version
Python version
Platform
CPU count
Runtime

Limitations¶

Sampling bias: Subsampling may not represent full dataset characteristics
Model configurations: Default parameters used; tuning may improve results
Single train/test split: No cross-validation uncertainty estimates
Compute constraints: Full M4 (100k series) requires significant resources

References¶

Makridakis, S., et al. (2020). “The M4 Competition: 100,000 time series and 61 forecasting methods.” International Journal of Forecasting 36(1): 54-74.
Makridakis, S., et al. (2022). “M5 accuracy competition: Results, findings, and conclusions.” International Journal of Forecasting 38(4): 1346-1364.
Hyndman, R.J. & Athanasopoulos, G. (2021). “Forecasting: Principles and Practice.” 3rd ed. OTexts.