Testing Strategy¶

Version: 1.0.0 | Last Updated: 2025-01-05

This document defines the testing approach for temporalcv, a library focused on temporal validation with leakage protection.

Core Testing Principles¶

1. Real Tests Only¶

NO stubs, TODOs, or placeholder tests. Every test must:

Actually execute the code being tested
Assert meaningful correctness properties
Use real data (not just “doesn’t crash”)

# ❌ BAD - Stub test
def test_walk_forward_cv():
    pass  # TODO: implement

# ✅ GOOD - Real test
def test_walk_forward_cv():
    X = np.random.randn(100, 5)
    cv = WalkForwardCV(n_splits=3, test_size=10)
    splits = list(cv.split(X))
    assert len(splits) == 3
    # Verify temporal ordering
    for train_idx, test_idx in splits:
        assert train_idx[-1] < test_idx[0]

2. Coverage Targets¶

Code Type	Target Coverage	Rationale
Modules	80%+	Core functionality must be well-tested
Scripts	60%+	Scripts have more I/O, lower bar
Core systems (gates, CV)	90%+	Critical path for leakage prevention

3. Test Categories¶

temporalcv uses a 6-layer validation architecture adapted from ~/Claude/lever_of_archimedes/patterns/testing.md:

Layer 1: Unit tests (pure functions, isolated logic)
Layer 2: Integration tests (CV + metrics, gates + validators)
Layer 3: Anti-pattern tests (detect leakage bugs)
Layer 4: Property tests (statistical properties hold)
Layer 5: Benchmarks (performance + accuracy on M4/M5)
Layer 6: End-to-end validation (full workflow)

Testing Layers¶

Layer 1: Unit Tests¶

Purpose: Test individual functions in isolation.

Location: tests/test_*.py (one file per module)

Coverage:

All public functions in temporalcv.*
Edge cases (empty arrays, single sample, etc.)
Parameter validation (raises on invalid input)

Example:

def test_compute_mae():
    predictions = np.array([1.0, 2.0, 3.0])
    actuals = np.array([1.5, 2.5, 2.5])
    mae = compute_mae(predictions, actuals)
    np.testing.assert_almost_equal(mae, 0.5)

Layer 2: Integration Tests¶

Purpose: Test interactions between components.

Location: tests/integration/test_*.py

Coverage:

CV splitters + model fitting
Gates + metrics
Statistical tests + HAC variance

Example:

def test_walk_forward_with_gates(self):
    """Verify gates pass on valid walk-forward CV."""
    cv = WalkForwardCV(n_splits=3, horizon=2, extra_gap=0, test_size=10)
    for train_idx, test_idx in cv.split(X):
        gate = gate_temporal_boundary(
            train_end_idx=train_idx[-1],
            test_start_idx=test_idx[0],
            horizon=2,
            extra_gap=0
        )
        assert gate.status == GateStatus.PASS

Layer 3: Anti-Pattern Tests¶

Purpose: Verify that leakage bugs are caught by gates.

Location: tests/anti_patterns/test_*.py

Coverage:

10 bug categories from data_leakage_prevention.md
Each anti-pattern must HALT validation

Example:

def test_lag_leakage_detected():
    """Gate should HALT when lag computed on full series."""
    # Intentional leakage: compute lag-1 on FULL series (train + test)
    X_full = np.random.randn(100, 1)
    X_leaky = np.column_stack([X_full, np.roll(X_full[:, 0], 1)])

    # Shuffled target gate should catch this
    result = gate_signal_verification(DummyModel(), X_leaky, y, n_shuffles=100)
    assert result.status == GateStatus.HALT

Bug categories tested (from docs/knowledge/leakage_audit_trail.md):

Lag leakage (full series computation)
Temporal boundary violations (no gap)
Threshold leakage (full series percentiles)
Feature selection on target
Regime computation lookahead

Layer 4: Property Tests¶

Purpose: Verify statistical properties hold over many random inputs.

Location: tests/property/test_*.py (uses hypothesis or manual property checks)

Coverage:

Gates always respect temporal boundaries
CV splits are non-overlapping
Statistical test p-values are uniform under H0

Example:

@pytest.mark.parametrize("n_samples", [50, 100, 200])
@pytest.mark.parametrize("n_splits", [3, 5, 10])
def test_walk_forward_no_overlap(n_samples, n_splits):
    """Verify no train/test overlap across random configurations."""
    X = np.random.randn(n_samples, 5)
    cv = WalkForwardCV(n_splits=n_splits, test_size=10)

    for train_idx, test_idx in cv.split(X):
        # Property: train and test indices must be disjoint
        assert len(set(train_idx) & set(test_idx)) == 0

Layer 5: Benchmarks¶

Purpose: Validate accuracy and performance on real datasets.

Location: benchmarks/ (separate from unit tests)

Coverage:

M4 Competition (6 frequencies, 100k series)
M5 Competition (30k hierarchical series)
Monash Time Series Repository

Metrics:

Forecast accuracy (MAE, RMSE)
Runtime (seconds per series)
Memory usage

See: docs/benchmarks/methodology.md for full details.

Layer 6: End-to-End Validation¶

Purpose: Test complete user workflows from data loading to deployment.

Location: tests/integration/test_e2e_*.py

Coverage:

Load data → Run gates → CV → Metrics → Statistical tests → Deploy
Notebook execution (jupyter nbconvert --execute)
Example scripts run without errors

Example:

def test_full_validation_workflow():
    """End-to-end: gates → CV → metrics → DM test."""
    # 1. Run gates
    gates = [
        gate_signal_verification(model, X, y, n_shuffles=100),
        gate_suspicious_improvement(model_mae=0.10, baseline_mae=0.15),
    ]
    report = run_gates(gates)
    assert report.status == GateStatus.PASS

    # 2. Walk-forward CV
    cv = WalkForwardCV(n_splits=5, horizon=1, extra_gap=0, test_size=20)
    results = walk_forward_evaluate(model, X, y, cv)

    # 3. Compute metrics
    mae = compute_mae(results.predictions, results.actuals)

    # 4. Statistical test vs baseline
    baseline_errors = compute_naive_error(y)
    model_errors = results.predictions - results.actuals
    dm = dm_test(model_errors, baseline_errors)

    assert dm.pvalue < 0.05  # Model significantly better

Test Organization¶

tests/
├── test_cv.py                      # Layer 1: CV unit tests
├── test_gates.py                   # Layer 1: Gate unit tests
├── test_statistical_tests.py       # Layer 1: Statistical test unit tests
├── test_metrics.py                 # Layer 1: Metrics unit tests
├── test_conformal.py               # Layer 1: Conformal prediction unit tests
├── integration/
│   ├── test_full_workflow.py       # Layer 2 & 6: Integration + E2E
│   └── test_cv_gates.py            # Layer 2: CV + gates integration
├── anti_patterns/
│   ├── test_lag_leakage.py         # Layer 3: Lag computation leakage
│   ├── test_threshold_leakage.py   # Layer 3: Threshold leakage
│   └── test_feature_selection.py   # Layer 3: Feature selection leakage
├── property/
│   └── test_cv_properties.py       # Layer 4: Statistical properties
└── validation/
    └── test_shuffled_target.py     # Layer 4: Shuffled target gate properties

Running Tests¶

Quick Validation (CI)¶

# Run all tests with coverage
pytest tests/ -v --cov=temporalcv --cov-report=term-missing

# Fail if coverage < 80%
pytest tests/ --cov=temporalcv --cov-fail-under=80

Selective Testing¶

# Unit tests only (fast)
pytest tests/test_*.py -v

# Integration tests
pytest tests/integration/ -v

# Anti-pattern tests (leakage detection)
pytest tests/anti_patterns/ -v

# Property tests (may be slow)
pytest tests/property/ -v -m property

# End-to-end validation
pytest tests/integration/test_e2e_*.py -v

Benchmarks (separate from CI)¶

# Quick benchmark (~4 minutes)
python scripts/run_benchmark.py --quick

# Full benchmark (~15 minutes)
python scripts/run_benchmark.py --full

Continuous Integration (CI)¶

GitHub Actions Workflow¶

Triggers:

Every PR
Push to main
Nightly (for full benchmarks)

Matrix:

OS: Ubuntu, macOS, Windows
Python: 3.9, 3.10, 3.11, 3.12
Dependencies: Minimum versions, latest versions

Jobs:

Unit tests: Layer 1-2 (fast, <5 min)
Anti-pattern tests: Layer 3 (critical path)
Integration tests: Layer 6 (full workflow)
Notebook validation: Execute all notebooks
Documentation build: Sphinx with -W (warnings as errors)
Type checking: mypy src/temporalcv
Linting: ruff check src/ tests/

Test Data Strategy¶

Synthetic Data (Default)¶

Most tests use synthetic data for reproducibility:

np.random.seed(42)
X = np.random.randn(100, 5)
y = np.random.randn(100)

Pros:

Fast generation
Deterministic (reproducible)
No external dependencies

Cons:

May miss real-world edge cases

Real Data (Benchmarks Only)¶

Benchmarks use real datasets:

M4 Competition (via datasetsforecast)
M5 Competition (manual download)
Monash Repository (via temporalcv.benchmarks)

Pros:

Validates on production-like data
Catches real-world issues

Cons:

Slower (not for CI unit tests)
May have licensing restrictions

Assertions Style¶

Preferred Patterns¶

# ✅ Use numpy.testing for float comparisons
np.testing.assert_almost_equal(result, expected, decimal=6)
np.testing.assert_allclose(result, expected, rtol=1e-5)

# ✅ Use pytest.raises for error checking
with pytest.raises(ValueError, match="n_splits must be >= 2"):
    WalkForwardCV(n_splits=1)

# ✅ Use descriptive assertion messages
assert len(splits) == 3, f"Expected 3 splits, got {len(splits)}"

Anti-Patterns¶

# ❌ Direct float comparison (floating point errors)
assert result == 0.333333  # May fail due to precision

# ❌ Vague assertions
assert result  # What property are we testing?

# ❌ Multiple unrelated assertions in one test
def test_everything():
    assert cv.n_splits == 5
    assert gate.status == "PASS"
    assert mae < 0.1
    # Too many concerns → split into separate tests

Common Testing Patterns¶

Testing Gates¶

def test_gate_pass():
    """Gate passes on valid configuration."""
    gate = gate_temporal_boundary(
        train_end_idx=79,
        test_start_idx=85,
        horizon=5,
        extra_gap=0
    )
    assert gate.status == GateStatus.PASS

def test_gate_halt():
    """Gate halts on invalid configuration."""
    gate = gate_temporal_boundary(
        train_end_idx=79,
        test_start_idx=82,  # Gap of 2, but horizon=5
        horizon=5,
        extra_gap=0
    )
    assert gate.status == GateStatus.HALT
    assert "actual_gap (2) < required_gap (5)" in gate.message

Testing Statistical Tests¶

def test_dm_test_identical_models():
    """DM test p-value = 1.0 for identical predictions."""
    errors1 = np.array([0.1, -0.2, 0.3, -0.1])
    errors2 = errors1.copy()  # Identical

    result = dm_test(errors1, errors2, horizon=1)

    assert result.statistic == 0.0
    assert result.pvalue == 1.0

def test_dm_test_significantly_different():
    """DM test detects significant difference."""
    errors1 = np.zeros(100)  # Perfect model
    errors2 = np.random.randn(100)  # Random errors

    result = dm_test(errors1, errors2, horizon=1)

    assert result.pvalue < 0.01  # Highly significant

Test-Driven Development (TDD)¶

For new features, follow the Red-Green-Refactor cycle:

Red: Write failing test
Green: Implement minimal code to pass
Refactor: Clean up while keeping tests passing

Example: Adding a new gate

# Step 1: RED - Write test first
def test_gate_residual_autocorrelation_pass():
    """Gate passes when residuals are white noise."""
    residuals = np.random.randn(100)  # White noise
    gate = gate_residual_diagnostics(residuals)
    assert gate.status == GateStatus.PASS

# Step 2: GREEN - Implement
def gate_residual_diagnostics(residuals):
    from scipy.stats import acorr_ljungbox
    lb_stat, lb_pvalue = acorr_ljungbox(residuals, lags=10)
    status = GateStatus.PASS if lb_pvalue[-1] > 0.05 else GateStatus.HALT
    return GateResult(status=status, message=f"Ljung-Box p={lb_pvalue[-1]:.3f}")

# Step 3: REFACTOR - Add edge cases, improve implementation

Randomness and Reproducibility¶

Seeds in Tests¶

ALL tests must be deterministic. Use fixed random seeds:

def test_with_random_data():
    np.random.seed(42)  # Fixed seed for reproducibility
    X = np.random.randn(100, 5)
    # ... test logic

Seeds in Production Code¶

Functions that use randomness accept random_state parameter:

def gate_signal_verification(model, X, y, n_shuffles=100, random_state=None):
    """
    Parameters
    ----------
    random_state : int or None
        Random seed for reproducibility. If None, use system randomness.
    """
    rng = np.random.default_rng(random_state)
    # ... use rng for all random operations

Future Enhancements¶

Planned Additions (v1.1+)¶

Property-based testing with hypothesis for exhaustive edge case coverage
Mutation testing to verify test suite quality
Performance regression tests to catch slowdowns
Notebook execution CI for tutorial validation
Visual regression tests for plots in examples

References¶

6-Layer Validation Architecture: ~/Claude/lever_of_archimedes/patterns/testing.md
Data Leakage Prevention: ~/Claude/lever_of_archimedes/patterns/data_leakage_prevention.md
pytest Documentation: https://docs.pytest.org
numpy.testing Guide: https://numpy.org/doc/stable/reference/routines.testing.html
hypothesis Property Testing: https://hypothesis.readthedocs.io