Tutorial: Leakage Detection with Validation Gates¶
Data leakage is the #1 cause of inflated ML results. This tutorial shows how to catch it before it corrupts your work.
The Problem¶
Time-series leakage is insidious:
# WRONG: Lag features from full series
X = np.column_stack([np.roll(y, i) for i in range(1, 6)]) # Leaks future!
# WRONG: Thresholds from full data
threshold = np.percentile(np.abs(y), 70) # Future info in classification!
# WRONG: No gap between train and test
train, test = y[:100], y[100:] # y[99] leaks into test via features!
If your model shows >20% improvement over persistence baseline, be suspicious.
Gate 1: Shuffled Target Test (Definitive)¶
The shuffled target test is the definitive leakage detector:
from temporalcv.gates import gate_signal_verification
from sklearn.linear_model import Ridge
model = Ridge()
result = gate_signal_verification(
model=model,
X=X,
y=y,
n_shuffles=5, # Average over 5 shuffles
threshold=0.05, # Max 5% improvement allowed
random_state=42
)
print(result)
How It Works¶
Fit model on real
(X, y), compute MAEShuffle
yrandomly (destroys temporal relationship)Fit model on
(X, y_shuffled), compute MAEIf real MAE < shuffled MAE * (1 - threshold), something is wrong
If your model beats a shuffled target, features contain temporal information that shouldn’t exist.
Interpreting Results¶
Status |
Meaning |
Action |
|---|---|---|
|
Model doesn’t beat shuffled baseline |
Safe to proceed |
|
Model significantly beats shuffled |
Investigate immediately |
if result.status.name == "HALT":
print(f"LEAKAGE DETECTED!")
print(f"Real MAE: {result.details['mae_real']:.4f}")
print(f"Shuffled MAE: {result.details['mae_shuffled_avg']:.4f}")
print(f"Improvement: {result.details['improvement_ratio']:.1%}")
Gate 2: Synthetic AR(1) Bounds¶
Verify your model on data with known theoretical properties:
from temporalcv.gates import gate_synthetic_ar1
result = gate_synthetic_ar1(
model=model,
phi=0.95, # AR(1) coefficient
sigma=1.0, # Innovation std
n_samples=500,
n_lags=5, # Number of lag features
tolerance=1.5, # Allow 50% above theoretical MAE
random_state=42
)
print(result)
How It Works¶
For AR(1) with coefficient φ, the theoretical optimal 1-step MAE is:
MAE_optimal = σ * sqrt(2/π) ≈ 0.798 * σ
If your model achieves MAE significantly below this, something is wrong.
When to Use¶
Model development sanity check
Validating new feature engineering
Regression testing after code changes
Gate 3: Suspicious Improvement¶
Automatically flag too-good results:
from temporalcv.gates import gate_suspicious_improvement
# After computing metrics
model_mae = 0.05
baseline_mae = 0.10 # 50% improvement - suspicious!
result = gate_suspicious_improvement(
model_metric=model_mae,
baseline_metric=baseline_mae,
threshold=0.20, # >20% improvement = HALT
warn_threshold=0.10, # >10% improvement = WARN
metric_name="MAE"
)
print(result)
Why 20%?¶
From empirical studies on financial time series:
5-10% improvement: Plausible with good features
10-20%: Unusual, warrants investigation
20%: Almost certainly a bug or leakage
Gate 4: Temporal Boundary¶
Verify proper separation between train and test:
from temporalcv.gates import gate_temporal_boundary
result = gate_temporal_boundary(
train_end_idx=99,
test_start_idx=100,
horizon=2,
gap=0 # Current gap between train and test
)
if result.status.name == "HALT":
print(f"Gap too small! Need {result.details['required_gap']}, have {gap}")
Gap Calculation¶
For h-step forecasting, the last training observation must be at least h steps before the first test observation:
train_end_idx + gap >= test_start_idx - horizon + 1
Combining Gates¶
Run multiple gates and aggregate results:
from temporalcv import run_gates
from temporalcv.gates import (
gate_signal_verification,
gate_suspicious_improvement,
gate_temporal_boundary,
)
# Collect gate results
gates = [
gate_signal_verification(model, X, y, random_state=42),
gate_suspicious_improvement(model_mae, baseline_mae),
gate_temporal_boundary(train_end, test_start, horizon=2),
]
# Aggregate
report = run_gates(gates)
print(report.summary())
# Check overall status
if report.status == "HALT":
print("\n❌ VALIDATION FAILED")
for failure in report.failures:
print(f" - {failure.name}: {failure.message}")
elif report.status == "WARN":
print("\n⚠️ WARNINGS PRESENT")
for warning in report.warnings:
print(f" - {warning.name}: {warning.message}")
else:
print("\n✓ All gates passed")
Real-World Example: Detecting Lag Leakage¶
import numpy as np
from sklearn.linear_model import Ridge
from temporalcv.gates import gate_signal_verification
# Generate high-persistence series
np.random.seed(42)
n = 300
y = np.zeros(n)
for t in range(1, n):
y[t] = 0.98 * y[t-1] + np.random.randn() * 0.05
# WRONG: Create lag features from full series (LEAKAGE!)
X_leaky = np.column_stack([np.roll(y, i) for i in range(1, 4)])[3:]
y_leaky = y[3:]
# Test for leakage
model = Ridge(alpha=1.0)
result = gate_signal_verification(model, X_leaky, y_leaky, random_state=42)
print(f"Status: {result.status.name}")
print(f"Real MAE: {result.details['mae_real']:.4f}")
print(f"Shuffled MAE: {result.details['mae_shuffled_avg']:.4f}")
# Status: HALT (leakage detected!)
The Fix¶
Compute lag features within each CV split:
from temporalcv import WalkForwardCV
cv = WalkForwardCV(n_splits=5, window_type="sliding", window_size=100, horizon=2, extra_gap=0)
for train_idx, test_idx in cv.split(y):
# Create features ONLY from training portion
y_train = y[train_idx]
# Now use these features...
Best Practices¶
Always run shuffled target test before trusting results
Set threshold to 20% for suspicious improvement (lower for financial data)
Use AR(1) synthetic test during development
Enforce gap >= horizon in all CV splits
Compute thresholds from training data only