Reproducing Benchmark Results¶

Step-by-step guide to reproducing temporalcv benchmark results.

Prerequisites¶

Required¶

# Install temporalcv with benchmark dependencies
pip install temporalcv[benchmarks]

# Or from source
pip install -e ".[dev]"

Optional¶

# For statsforecast models
pip install statsforecast

# For M5 dataset
pip install kaggle

Quick Start¶

Validation Run (~1 hour)¶

# 100 series per frequency, baseline + statsforecast models
python scripts/run_benchmark.py --quick

Full Benchmark (~8-10 hours)¶

# 1000 series per frequency
python scripts/run_benchmark.py --full

Dataset-Specific Runs¶

M4 Competition¶

# All M4 frequencies
python scripts/run_benchmark.py --dataset m4_yearly --sample 1000
python scripts/run_benchmark.py --dataset m4_quarterly --sample 1000
python scripts/run_benchmark.py --dataset m4_monthly --sample 1000
python scripts/run_benchmark.py --dataset m4_weekly --sample 1000
python scripts/run_benchmark.py --dataset m4_daily --sample 1000
python scripts/run_benchmark.py --dataset m4_hourly --sample 1000

M5 Competition¶

M5 requires manual download due to Kaggle TOS:

# 1. Download from Kaggle
kaggle competitions download -c m5-forecasting-accuracy

# 2. Extract to a directory
unzip m5-forecasting-accuracy.zip -d ~/data/m5/

# 3. Run benchmark
python scripts/run_benchmark.py --dataset m5 --m5-path ~/data/m5/ --sample 1000

Model Selection¶

Baseline Only (Fast)¶

python scripts/run_benchmark.py --quick --models baseline

Statsforecast Only¶

python scripts/run_benchmark.py --quick --models statsforecast

All Models (Default)¶

python scripts/run_benchmark.py --quick --models all

Resuming Interrupted Runs¶

Benchmarks save checkpoints after each dataset. To resume:

# Find your run directory
ls benchmarks/results/

# Resume from checkpoint
python scripts/run_benchmark.py --resume benchmarks/results/run_abc123

Output Interpretation¶

results.json¶

{
  "metadata": {
    "run_id": "abc12345",
    "timestamp": "2025-01-15T10:30:00Z",
    "temporalcv_version": "1.0.0",
    "total_runtime_seconds": 3600.0
  },
  "report": {
    "results": [
      {
        "dataset_name": "M4_monthly",
        "best_model": "AutoETS",
        "models": [...]
      }
    ],
    "summary": {
      "wins_by_model": {"AutoETS": 3, "AutoARIMA": 2, ...}
    }
  }
}

results.md¶

Markdown tables ready for documentation:

## Summary

| Dataset | Best Model | MAE | vs Naive |
|---------|------------|-----|----------|
| M4_yearly | AutoETS | 0.1234 | -15.2% |
...

Generating Documentation¶

After running benchmarks:

from temporalcv.compare import (
    load_benchmark_results,
    generate_benchmark_docs,
)
from pathlib import Path

# Load results
report, metadata = load_benchmark_results(
    Path("benchmarks/results/run_abc123/results.json")
)

# Generate comprehensive documentation
docs = generate_benchmark_docs(report, metadata)

# Save to docs
Path("docs/benchmarks.md").write_text(docs)

Troubleshooting¶

ImportError: statsforecast not found¶

pip install statsforecast

M5 DatasetNotFoundError¶

Download M5 data from Kaggle and provide path:

python scripts/run_benchmark.py --m5-path /path/to/m5/

Out of Memory¶

Reduce sample size:

python scripts/run_benchmark.py --sample 100

Or run single frequency:

python scripts/run_benchmark.py --dataset m4_monthly --sample 500

Slow Performance¶

Enable parallel execution (requires joblib):

from temporalcv.compare.adapters.multi_series import MultiSeriesAdapter

# Wrap adapter for parallel execution
adapter = MultiSeriesAdapter(base_adapter, n_jobs=8)

Hardware Requirements¶

Mode	RAM	Time (8-core)	Time (1-core)
Quick	4GB	~1 hour	~6-8 hours
Full	8GB	~8-10 hours	~60+ hours

Verifying Results¶

Compare your results to published benchmarks:

from temporalcv.compare import load_benchmark_results

# Load your results
my_report, _ = load_benchmark_results("benchmarks/results/run_xyz/results.json")

# Load reference results (if available)
ref_report, _ = load_benchmark_results("benchmarks/reference/results.json")

# Compare
for my_result, ref_result in zip(my_report.results, ref_report.results):
    print(f"{my_result.dataset_name}:")
    print(f"  Your best: {my_result.best_model}")
    print(f"  Reference: {ref_result.best_model}")