Benchmark Results¶
Last updated: 2025-12-31 temporalcv version: 1.0.0-rc1 Run ID: run_c700bfd9 (full benchmark)
Overview¶
This document presents benchmark results comparing forecasting models on the M4 Competition dataset using temporalcv’s model comparison framework. The benchmarks evaluate baseline models against statsforecast’s automatic model selection algorithms.
Summary¶
Metric |
Value |
|---|---|
Datasets |
6 (M4 Competition, all frequencies) |
Total series |
4,773 (1000 yearly + 1000 quarterly + 1000 monthly + 359 weekly + 1000 daily + 414 hourly) |
Models compared |
9 |
Total runtime |
14.3 minutes |
Model Wins by Frequency¶
Model |
Wins |
Winning Frequencies |
|---|---|---|
AutoETS |
3 |
quarterly, monthly, weekly |
AutoTheta |
1 |
yearly |
Naive |
1 |
daily |
AutoARIMA |
1 |
hourly |
Mean MAE Across All Frequencies¶
Rank |
Model |
Mean MAE |
vs Naive |
|---|---|---|---|
1 |
AutoARIMA |
475.9 |
-12.9% |
2 |
AutoETS |
518.7 |
-5.0% |
3 |
AutoTheta |
521.8 |
-4.5% |
4 |
Naive |
546.2 |
— |
5 |
ADIDA |
609.5 |
+11.6% |
6 |
IMAPA |
609.5 |
+11.6% |
7 |
CrostonClassic |
698.8 |
+27.9% |
8 |
HistoricAverage |
784.4 |
+43.6% |
9 |
SeasonalNaive_12 |
800.4 |
+46.5% |
Note: SeasonalNaive_12 uses season_length=12 for all frequencies, which is only appropriate for monthly data. This explains its poor performance on yearly, quarterly, weekly, daily, and hourly data.
Key insight: AutoARIMA has the best mean MAE overall (-12.9% vs Naive), but AutoETS wins the most individual frequencies (3/6). This suggests AutoETS is more robust across frequency types, while AutoARIMA excels at high-frequency data (hourly).
Per-Frequency Results¶
M4 Yearly (1000 series, horizon=6)¶
Rank |
Model |
MAE |
|---|---|---|
1 |
AutoTheta |
625.6 |
2 |
AutoETS |
628.8 |
3 |
AutoARIMA |
703.3 |
4 |
Naive |
704.5 |
5 |
ADIDA |
896.3 |
6 |
IMAPA |
896.3 |
7 |
HistoricAverage |
1256.1 |
8 |
CrostonClassic |
1277.5 |
9 |
SeasonalNaive_12 |
1548.4 |
M4 Quarterly (1000 series, horizon=8)¶
Rank |
Model |
MAE |
|---|---|---|
1 |
AutoETS |
426.2 |
2 |
AutoTheta |
435.0 |
3 |
AutoARIMA |
444.3 |
4 |
Naive |
455.6 |
5 |
ADIDA |
482.6 |
6 |
IMAPA |
482.6 |
7 |
CrostonClassic |
629.8 |
8 |
SeasonalNaive_12 |
692.4 |
9 |
HistoricAverage |
700.2 |
M4 Monthly (1000 series, horizon=18)¶
Rank |
Model |
MAE |
|---|---|---|
1 |
AutoETS |
479.7 |
2 |
AutoARIMA |
486.8 |
3 |
AutoTheta |
492.7 |
4 |
ADIDA |
516.1 |
5 |
IMAPA |
516.1 |
6 |
Naive |
537.5 |
7 |
CrostonClassic |
573.7 |
8 |
SeasonalNaive_12 |
587.9 |
9 |
HistoricAverage |
868.6 |
M4 Weekly (359 series, horizon=13)¶
Rank |
Model |
MAE |
|---|---|---|
1 |
AutoETS |
247.2 |
2 |
Naive |
249.1 |
3 |
AutoTheta |
251.8 |
4 |
AutoARIMA |
266.6 |
5 |
ADIDA |
276.3 |
6 |
IMAPA |
276.3 |
7 |
CrostonClassic |
318.1 |
8 |
SeasonalNaive_12 |
366.1 |
9 |
HistoricAverage |
408.2 |
M4 Daily (1000 series, horizon=14)¶
Rank |
Model |
MAE |
|---|---|---|
1 |
Naive |
109.6 |
2 |
AutoETS |
110.8 |
3 |
AutoTheta |
111.5 |
4 |
AutoARIMA |
114.2 |
5 |
ADIDA |
120.4 |
6 |
IMAPA |
120.4 |
7 |
CrostonClassic |
148.0 |
8 |
SeasonalNaive_12 |
156.3 |
9 |
HistoricAverage |
287.2 |
Note: On daily data, the simple Naive baseline wins. This is notable — for short-horizon daily forecasting, complex models may overfit.
M4 Hourly (414 series, horizon=48)¶
Rank |
Model |
MAE |
|---|---|---|
1 |
AutoARIMA |
840.2 |
2 |
HistoricAverage |
1186.2 |
3 |
AutoTheta |
1214.3 |
4 |
AutoETS |
1219.8 |
5 |
Naive |
1220.7 |
6 |
CrostonClassic |
1245.5 |
7 |
ADIDA |
1365.2 |
8 |
IMAPA |
1365.2 |
9 |
SeasonalNaive_12 |
1451.5 |
Note: AutoARIMA significantly outperforms all other models on hourly data (-31% vs Naive), suggesting ARIMA captures high-frequency patterns better than exponential smoothing methods.
Key Findings¶
1. AutoETS Most Robust Across Frequencies¶
AutoETS wins 3/6 frequencies (quarterly, monthly, weekly), making it the most robust choice for general-purpose forecasting. However, AutoARIMA has the best mean MAE overall due to its exceptional performance on hourly data.
2. Frequency-Specific Winners¶
Frequency |
Best Model |
Key Insight |
|---|---|---|
Yearly |
AutoTheta |
Captures long-term trends with damping |
Quarterly |
AutoETS |
Smooth exponential patterns |
Monthly |
AutoETS |
Handles seasonal variation well |
Weekly |
AutoETS |
Short-term smoothing effective |
Daily |
Naive |
Simple baseline wins — complex models overfit |
Hourly |
AutoARIMA |
ARIMA excels at high-frequency patterns |
3. Naive Baseline Surprisingly Strong on Daily Data¶
The Naive baseline won on M4 daily data. This is notable — for short-horizon daily forecasting, complex models may introduce unnecessary variance.
4. AutoARIMA Dominates High-Frequency Data¶
AutoARIMA achieved -31% improvement vs Naive on hourly data, by far the largest improvement. ARIMA’s autoregressive structure captures high-frequency patterns that exponential smoothing misses.
5. Intermittent Demand Models Underperform¶
CrostonClassic, ADIDA, and IMAPA are designed for intermittent demand (many zeros). They underperform on M4 data which contains continuous demand patterns.
6. Seasonality Mismatch Hurts Performance¶
SeasonalNaive_12 performs poorly because it uses a fixed 12-period seasonal lag regardless of the actual data frequency. Proper seasonal period selection is critical.
Methodology¶
Data¶
Source: M4 Competition (Makridakis et al., 2018)
Frequencies: Yearly, Quarterly, Monthly, Weekly, Daily, Hourly
Series: 1000/1000/1000/359/1000/414 per frequency (4,773 total)
Split: Official M4 train/test splits
Sampling: Random seed 42 (M4 weekly/hourly have fewer than 1000 series total)
Models¶
Model |
Package |
Type |
|---|---|---|
Naive |
temporalcv |
Baseline |
SeasonalNaive |
temporalcv |
Baseline |
AutoARIMA |
statsforecast |
Automatic |
AutoETS |
statsforecast |
Automatic |
AutoTheta |
statsforecast |
Automatic |
CrostonClassic |
statsforecast |
Intermittent |
ADIDA |
statsforecast |
Intermittent |
IMAPA |
statsforecast |
Intermittent |
HistoricAverage |
statsforecast |
Simple |
Metrics¶
MAE: Mean Absolute Error (primary metric)
RMSE: Root Mean Squared Error
MAPE: Mean Absolute Percentage Error
Direction Accuracy: Fraction of correct direction predictions
Statistical Testing¶
Diebold-Mariano (DM) test with HAC variance estimation compares each model against the best model per frequency. Significance level: p < 0.05 (marked with *).
Reproduction¶
# Install dependencies
pip install temporalcv[compare] datasetsforecast statsforecast
# Run quick benchmark (100 series/freq, ~4 minutes)
python scripts/run_benchmark.py --quick --models all
# Run full benchmark (1000 series/freq, ~15 minutes)
python scripts/run_benchmark.py --full --models all
# Results saved to benchmarks/results/run_<id>/
Hardware: 128-core AMD EPYC (8 jobs parallel via joblib) Runtime: 14.3 minutes for full benchmark
See docs/benchmarks/reproduce.md for detailed instructions.
References¶
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). The M4 Competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4), 802-808.
Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253-263.
Nixtla. (2023). statsforecast: Lightning fast forecasting with statistical and econometric models. https://github.com/Nixtla/statsforecast