Reproducing Benchmark Results

Step-by-step guide to reproducing temporalcv benchmark results.

Prerequisites

Required

# Install temporalcv with benchmark dependencies
pip install temporalcv[benchmarks]

# Or from source
pip install -e ".[dev]"

Optional

# For statsforecast models
pip install statsforecast

# For M5 dataset
pip install kaggle

Quick Start

Validation Run (~1 hour)

# 100 series per frequency, baseline + statsforecast models
python scripts/run_benchmark.py --quick

Full Benchmark (~8-10 hours)

# 1000 series per frequency
python scripts/run_benchmark.py --full

Dataset-Specific Runs

M4 Competition

# All M4 frequencies
python scripts/run_benchmark.py --dataset m4_yearly --sample 1000
python scripts/run_benchmark.py --dataset m4_quarterly --sample 1000
python scripts/run_benchmark.py --dataset m4_monthly --sample 1000
python scripts/run_benchmark.py --dataset m4_weekly --sample 1000
python scripts/run_benchmark.py --dataset m4_daily --sample 1000
python scripts/run_benchmark.py --dataset m4_hourly --sample 1000

M5 Competition

M5 requires manual download due to Kaggle TOS:

# 1. Download from Kaggle
kaggle competitions download -c m5-forecasting-accuracy

# 2. Extract to a directory
unzip m5-forecasting-accuracy.zip -d ~/data/m5/

# 3. Run benchmark
python scripts/run_benchmark.py --dataset m5 --m5-path ~/data/m5/ --sample 1000

Model Selection

Baseline Only (Fast)

python scripts/run_benchmark.py --quick --models baseline

Statsforecast Only

python scripts/run_benchmark.py --quick --models statsforecast

All Models (Default)

python scripts/run_benchmark.py --quick --models all

Resuming Interrupted Runs

Benchmarks save checkpoints after each dataset. To resume:

# Find your run directory
ls benchmarks/results/

# Resume from checkpoint
python scripts/run_benchmark.py --resume benchmarks/results/run_abc123

Output Interpretation

results.json

{
  "metadata": {
    "run_id": "abc12345",
    "timestamp": "2025-01-15T10:30:00Z",
    "temporalcv_version": "1.0.0",
    "total_runtime_seconds": 3600.0
  },
  "report": {
    "results": [
      {
        "dataset_name": "M4_monthly",
        "best_model": "AutoETS",
        "models": [...]
      }
    ],
    "summary": {
      "wins_by_model": {"AutoETS": 3, "AutoARIMA": 2, ...}
    }
  }
}

results.md

Markdown tables ready for documentation:

## Summary

| Dataset | Best Model | MAE | vs Naive |
|---------|------------|-----|----------|
| M4_yearly | AutoETS | 0.1234 | -15.2% |
...

Generating Documentation

After running benchmarks:

from temporalcv.compare import (
    load_benchmark_results,
    generate_benchmark_docs,
)
from pathlib import Path

# Load results
report, metadata = load_benchmark_results(
    Path("benchmarks/results/run_abc123/results.json")
)

# Generate comprehensive documentation
docs = generate_benchmark_docs(report, metadata)

# Save to docs
Path("docs/benchmarks.md").write_text(docs)

Troubleshooting

ImportError: statsforecast not found

pip install statsforecast

M5 DatasetNotFoundError

Download M5 data from Kaggle and provide path:

python scripts/run_benchmark.py --m5-path /path/to/m5/

Out of Memory

Reduce sample size:

python scripts/run_benchmark.py --sample 100

Or run single frequency:

python scripts/run_benchmark.py --dataset m4_monthly --sample 500

Slow Performance

Enable parallel execution (requires joblib):

from temporalcv.compare.adapters.multi_series import MultiSeriesAdapter

# Wrap adapter for parallel execution
adapter = MultiSeriesAdapter(base_adapter, n_jobs=8)

Hardware Requirements

Mode

RAM

Time (8-core)

Time (1-core)

Quick

4GB

~1 hour

~6-8 hours

Full

8GB

~8-10 hours

~60+ hours

Verifying Results

Compare your results to published benchmarks:

from temporalcv.compare import load_benchmark_results

# Load your results
my_report, _ = load_benchmark_results("benchmarks/results/run_xyz/results.json")

# Load reference results (if available)
ref_report, _ = load_benchmark_results("benchmarks/reference/results.json")

# Compare
for my_result, ref_result in zip(my_report.results, ref_report.results):
    print(f"{my_result.dataset_name}:")
    print(f"  Your best: {my_result.best_model}")
    print(f"  Reference: {ref_result.best_model}")