Tutorial: Leakage Detection with Validation Gates¶

Data leakage is the #1 cause of inflated ML results. This tutorial shows how to catch it before it corrupts your work.

The Problem¶

Time-series leakage is insidious:

# WRONG: Lag features from full series
X = np.column_stack([np.roll(y, i) for i in range(1, 6)])  # Leaks future!

# WRONG: Thresholds from full data
threshold = np.percentile(np.abs(y), 70)  # Future info in classification!

# WRONG: No gap between train and test
train, test = y[:100], y[100:]  # y[99] leaks into test via features!

If your model shows >20% improvement over persistence baseline, be suspicious.

Gate 1: Shuffled Target Test (Definitive)¶

The shuffled target test is the definitive leakage detector:

from temporalcv.gates import gate_signal_verification
from sklearn.linear_model import Ridge

model = Ridge()

result = gate_signal_verification(
    model=model,
    X=X,
    y=y,
    n_shuffles=5,      # Average over 5 shuffles
    threshold=0.05,    # Max 5% improvement allowed
    random_state=42
)

print(result)

How It Works¶

Fit model on real (X, y), compute MAE
Shuffle y randomly (destroys temporal relationship)
Fit model on (X, y_shuffled), compute MAE
If real MAE < shuffled MAE * (1 - threshold), something is wrong

If your model beats a shuffled target, features contain temporal information that shouldn’t exist.

Interpreting Results¶

Status	Meaning	Action
`PASS`	Model doesn’t beat shuffled baseline	Safe to proceed
`HALT`	Model significantly beats shuffled	Investigate immediately

if result.status.name == "HALT":
    print(f"LEAKAGE DETECTED!")
    print(f"Real MAE: {result.details['mae_real']:.4f}")
    print(f"Shuffled MAE: {result.details['mae_shuffled_avg']:.4f}")
    print(f"Improvement: {result.details['improvement_ratio']:.1%}")

Gate 2: Synthetic AR(1) Bounds¶

Verify your model on data with known theoretical properties:

from temporalcv.gates import gate_synthetic_ar1

result = gate_synthetic_ar1(
    model=model,
    phi=0.95,          # AR(1) coefficient
    sigma=1.0,         # Innovation std
    n_samples=500,
    n_lags=5,          # Number of lag features
    tolerance=1.5,     # Allow 50% above theoretical MAE
    random_state=42
)

print(result)

How It Works¶

For AR(1) with coefficient φ, the theoretical optimal 1-step MAE is:

MAE_optimal = σ * sqrt(2/π) ≈ 0.798 * σ

If your model achieves MAE significantly below this, something is wrong.

When to Use¶

Model development sanity check
Validating new feature engineering
Regression testing after code changes

Gate 3: Suspicious Improvement¶

Automatically flag too-good results:

from temporalcv.gates import gate_suspicious_improvement

# After computing metrics
model_mae = 0.05
baseline_mae = 0.10  # 50% improvement - suspicious!

result = gate_suspicious_improvement(
    model_metric=model_mae,
    baseline_metric=baseline_mae,
    threshold=0.20,      # >20% improvement = HALT
    warn_threshold=0.10, # >10% improvement = WARN
    metric_name="MAE"
)

print(result)

Why 20%?¶

From empirical studies on financial time series:

5-10% improvement: Plausible with good features
10-20%: Unusual, warrants investigation
20%: Almost certainly a bug or leakage

Gate 4: Temporal Boundary¶

Verify proper separation between train and test:

from temporalcv.gates import gate_temporal_boundary

result = gate_temporal_boundary(
    train_end_idx=99,
    test_start_idx=100,
    horizon=2,
    gap=0  # Current gap between train and test
)

if result.status.name == "HALT":
    print(f"Gap too small! Need {result.details['required_gap']}, have {gap}")

Gap Calculation¶

For h-step forecasting, the last training observation must be at least h steps before the first test observation:

train_end_idx + gap >= test_start_idx - horizon + 1

Combining Gates¶

Run multiple gates and aggregate results:

from temporalcv import run_gates
from temporalcv.gates import (
    gate_signal_verification,
    gate_suspicious_improvement,
    gate_temporal_boundary,
)

# Collect gate results
gates = [
    gate_signal_verification(model, X, y, random_state=42),
    gate_suspicious_improvement(model_mae, baseline_mae),
    gate_temporal_boundary(train_end, test_start, horizon=2),
]

# Aggregate
report = run_gates(gates)

print(report.summary())

# Check overall status
if report.status == "HALT":
    print("\n❌ VALIDATION FAILED")
    for failure in report.failures:
        print(f"  - {failure.name}: {failure.message}")
elif report.status == "WARN":
    print("\n⚠️ WARNINGS PRESENT")
    for warning in report.warnings:
        print(f"  - {warning.name}: {warning.message}")
else:
    print("\n✓ All gates passed")

Real-World Example: Detecting Lag Leakage¶

import numpy as np
from sklearn.linear_model import Ridge
from temporalcv.gates import gate_signal_verification

# Generate high-persistence series
np.random.seed(42)
n = 300
y = np.zeros(n)
for t in range(1, n):
    y[t] = 0.98 * y[t-1] + np.random.randn() * 0.05

# WRONG: Create lag features from full series (LEAKAGE!)
X_leaky = np.column_stack([np.roll(y, i) for i in range(1, 4)])[3:]
y_leaky = y[3:]

# Test for leakage
model = Ridge(alpha=1.0)
result = gate_signal_verification(model, X_leaky, y_leaky, random_state=42)

print(f"Status: {result.status.name}")
print(f"Real MAE: {result.details['mae_real']:.4f}")
print(f"Shuffled MAE: {result.details['mae_shuffled_avg']:.4f}")
# Status: HALT (leakage detected!)

The Fix¶

Compute lag features within each CV split:

from temporalcv import WalkForwardCV

cv = WalkForwardCV(n_splits=5, window_type="sliding", window_size=100, horizon=2, extra_gap=0)

for train_idx, test_idx in cv.split(y):
    # Create features ONLY from training portion
    y_train = y[train_idx]

    # Now use these features...

Best Practices¶

Always run shuffled target test before trusting results
Set threshold to 20% for suspicious improvement (lower for financial data)
Use AR(1) synthetic test during development
Enforce gap >= horizon in all CV splits
Compute thresholds from training data only

Tutorial: Leakage Detection with Validation Gates¶

The Problem¶

Gate 1: Shuffled Target Test (Definitive)¶

How It Works¶

Interpreting Results¶

Gate 2: Synthetic AR(1) Bounds¶

How It Works¶

When to Use¶

Gate 3: Suspicious Improvement¶

Why 20%?¶

Gate 4: Temporal Boundary¶

Gap Calculation¶

Combining Gates¶

Real-World Example: Detecting Lag Leakage¶

The Fix¶

Best Practices¶

API Reference¶