Note
Go to the end to download the full example code.
Example 01: Detecting Data Leakage with the Shuffled Target Test¶
This example demonstrates temporalcv’s most powerful validation gate: the shuffled target test. We show how it catches subtle leakage bugs that would otherwise inflate apparent model performance.
Real-World Case Study: Interest Rate Forecasting¶
We forecast 10-Year Treasury rates using lagged features. A common bug is computing rolling statistics (like 13-week volatility) on the FULL series before train/test split, leaking future information into features.
The shuffled target test catches this: if your model beats a baseline trained on randomized targets, the features themselves contain target information — a definitive leakage signal.
Key Insight¶
A properly constructed model should NOT significantly outperform a model trained on shuffled (randomized) targets. If it does, your features are encoding temporal position or future information.
Usage¶
# With FRED API key (recommended): export FRED_API_KEY=your_key_here python 01_leakage_detection.py
# Without API key (uses synthetic data): python 01_leakage_detection.py
Requirements¶
pip install temporalcv[fred] # For FRED data # or pip install temporalcv scikit-learn # Minimum requirements
======================================================================
TEMPORALCV: Detecting Data Leakage with Shuffled Target Test
======================================================================
Note: Using synthetic data (set FRED_API_KEY for real data)
Data source: Synthetic (Treasury-like)
Observations: 600
Mean rate: 2.32%
Std dev: 0.32%
ACF(1): 0.971 (high persistence)
======================================================================
SCENARIO 1: Clean Features (only lagged values)
======================================================================
Feature shape: (595, 5)
Features: y_{t-1}, y_{t-2}, ..., y_{t-5}
Shuffled Target Test Result: [HALT] signal_verification: Permutation test: p=0.0099 < α=0.05 (model has signal)
- MAE (real target): 0.0684
- MAE (shuffled avg): 0.2867
- P-value: 0.0099
- Improvement ratio: 76.2%
======================================================================
SCENARIO 2: Leaky Features (includes future info)
======================================================================
Feature shape: (592, 6)
BUG: 'Smoothed' feature uses centered window (includes y_t, y_{t+1}, y_{t+2})
Shuffled Target Test Result: [HALT] signal_verification: Permutation test: p=0.0099 < α=0.05 (model has signal)
- MAE (real target): 0.0493
- MAE (shuffled avg): 0.2867
- P-value: 0.0099
- Improvement ratio: 82.8%
======================================================================
COMPARISON: Clean vs Leaky Features
======================================================================
Clean features improvement: 76.2%
Leaky features improvement: 82.8%
Difference (leaky - clean): 6.6%
LEAKAGE DETECTED!
The leaky features show significantly higher improvement,
indicating they contain information about the target's position.
======================================================================
PRACTICAL GATE USAGE
======================================================================
Running gates with production thresholds...
============================================================
VALIDATION REPORT
============================================================
[HALT] signal_verification: Permutation test: p=0.0099 < α=0.05 (model has signal)
[PASS] suspicious_improvement: Improvement -1.3% is reasonable
============================================================
OVERALL STATUS: HALT
============================================================
HALTED GATES (require investigation):
- signal_verification: Model has predictive signal. Investigate source: legitimate temporal patterns (expected for AR models) OR data leakage (check feature engineering).
======================================================================
KEY TAKEAWAYS
======================================================================
1. The SHUFFLED TARGET TEST is the definitive leakage detector.
- If your model beats randomized targets, features encode target info.
- This catches rolling stats computed on full series, lookahead bias, etc.
2. Common leakage sources in time-series:
- Rolling statistics computed before train/test split
- Normalization (mean/std) computed on full dataset
- Feature selection using future data
- Information from the test period in feature engineering
3. Run the shuffled test BEFORE trusting impressive results.
- 40%+ improvement over persistence? Probably leakage.
- Run shuffled test, then investigate if it HALTs.
4. temporalcv gates follow HALT > WARN > PASS priority:
- HALT: Stop and investigate (critical failure)
- WARN: Proceed with caution (verify externally)
- PASS: Validation passed
Final status: HALT
from __future__ import annotations
import warnings
import numpy as np
from sklearn.linear_model import Ridge
from temporalcv.gates import (
gate_signal_verification,
gate_suspicious_improvement,
run_gates,
)
# Suppress sklearn convergence warnings for cleaner output
warnings.filterwarnings("ignore", category=UserWarning)
# =============================================================================
# Data Loading: FRED or Synthetic Fallback
# =============================================================================
def load_treasury_data() -> tuple[np.ndarray, str]:
"""
Load 10-Year Treasury rate data from FRED, or generate realistic synthetic.
Returns
-------
rates : np.ndarray
Weekly rate observations (500+ points)
source : str
"FRED" or "synthetic"
"""
try:
from temporalcv.benchmarks import load_fred_rates
dataset = load_fred_rates(
series="DGS10",
start="2010-01-01",
frequency="W",
)
return dataset.values, "FRED (10-Year Treasury)"
except Exception:
# Generate realistic synthetic data mimicking Treasury characteristics
print("Note: Using synthetic data (set FRED_API_KEY for real data)")
return _generate_synthetic_rates(), "Synthetic (Treasury-like)"
def _generate_synthetic_rates(
n_samples: int = 600,
initial_rate: float = 2.5,
phi: float = 0.995, # High persistence (typical for rates)
sigma: float = 0.08, # Weekly volatility
seed: int = 42,
) -> np.ndarray:
"""
Generate synthetic interest rate data with realistic characteristics.
Mimics Treasury rate dynamics:
- High persistence (AR(1) coefficient ~0.995)
- Mean-reverting around long-run level
- Realistic volatility regime
"""
rng = np.random.default_rng(seed)
rates = np.zeros(n_samples)
rates[0] = initial_rate
# AR(1) with mean reversion
long_run_mean = 2.5
for t in range(1, n_samples):
innovation = sigma * rng.normal()
rates[t] = phi * rates[t - 1] + (1 - phi) * long_run_mean + innovation
return rates
# =============================================================================
# Feature Engineering: Clean vs Leaky
# =============================================================================
def create_clean_features(rates: np.ndarray, n_lags: int = 5) -> tuple[np.ndarray, np.ndarray]:
"""
Create features WITHOUT leakage — the correct way.
Features are computed using ONLY past data relative to each observation.
For walk-forward validation, we split THEN compute features.
"""
n = len(rates)
features = []
# Lag features (no leakage — just past values)
for lag in range(1, n_lags + 1):
lagged = np.full(n, np.nan)
lagged[lag:] = rates[:-lag]
features.append(lagged)
# Stack and remove NaN rows
X = np.column_stack(features)
y = rates.copy()
# Keep only rows with complete features
valid_mask = ~np.isnan(X).any(axis=1)
return X[valid_mask], y[valid_mask]
def create_leaky_features(rates: np.ndarray, n_lags: int = 5) -> tuple[np.ndarray, np.ndarray]:
"""
Create features WITH leakage — the WRONG way (intentionally buggy).
This mimics a catastrophic bug: including future target values as features.
In real codebases, this happens through:
- Off-by-one errors in lag computation
- Target encoding computed on full dataset
- Improperly aligned rolling windows
"""
n = len(rates)
features = []
# Lag features (same as clean)
for lag in range(1, n_lags + 1):
lagged = np.full(n, np.nan)
lagged[lag:] = rates[:-lag]
features.append(lagged)
# BUG: "Smoothed target" feature that accidentally includes current value
# This mimics an off-by-one error in rolling window computation
# Real example: using pandas rolling(..., center=True) by mistake
smoothed = np.full(n, np.nan)
window = 3
for t in range(window, n - window):
# BUG: includes rates[t] and rates[t+1], rates[t+2] (future!)
smoothed[t] = np.mean(rates[t - window : t + window + 1])
features.append(smoothed)
# Stack and remove NaN rows
X = np.column_stack(features)
y = rates.copy()
valid_mask = ~np.isnan(X).any(axis=1)
return X[valid_mask], y[valid_mask]
# =============================================================================
# Demonstration: Shuffled Target Test Catches Leakage
# =============================================================================
def demonstrate_leakage_detection():
"""
Demonstrate how the shuffled target test catches data leakage.
We compare two scenarios:
1. Clean features (no leakage) — shuffled test should PASS
2. Leaky features (future info) — shuffled test should HALT
"""
print("=" * 70)
print("TEMPORALCV: Detecting Data Leakage with Shuffled Target Test")
print("=" * 70)
# Load data
rates, source = load_treasury_data()
print(f"\nData source: {source}")
print(f"Observations: {len(rates)}")
print(f"Mean rate: {np.mean(rates):.2f}%")
print(f"Std dev: {np.std(rates):.2f}%")
print(f"ACF(1): {np.corrcoef(rates[1:], rates[:-1])[0, 1]:.3f} (high persistence)")
# =========================================================================
# Scenario 1: Clean Features (Baseline Comparison)
# =========================================================================
print("\n" + "=" * 70)
print("SCENARIO 1: Clean Features (only lagged values)")
print("=" * 70)
X_clean, y_clean = create_clean_features(rates)
print(f"Feature shape: {X_clean.shape}")
print("Features: y_{t-1}, y_{t-2}, ..., y_{t-5}")
# Use Ridge regression (less prone to overfitting)
model_clean = Ridge(alpha=1.0)
# Run shuffled target test (permutation mode - default)
# For high-persistence data, models WILL beat shuffled significantly
# because lag features genuinely predict the target
# Note: With permutation mode, metric_value is the p-value
result_clean = gate_signal_verification(
model=model_clean,
X=X_clean,
y=y_clean,
n_shuffles=100, # Need >=100 for statistical power in permutation mode
random_state=42,
)
pvalue_clean = result_clean.metric_value # p-value in permutation mode
improvement_clean = result_clean.details.get("improvement_ratio", 0.0)
print(f"\nShuffled Target Test Result: {result_clean}")
print(f" - MAE (real target): {result_clean.details['mae_real']:.4f}")
print(f" - MAE (shuffled avg): {result_clean.details['mae_shuffled_avg']:.4f}")
print(f" - P-value: {pvalue_clean:.4f}")
print(f" - Improvement ratio: {improvement_clean:.1%}")
# =========================================================================
# Scenario 2: Leaky Features (Should show MUCH higher improvement)
# =========================================================================
print("\n" + "=" * 70)
print("SCENARIO 2: Leaky Features (includes future info)")
print("=" * 70)
X_leaky, y_leaky = create_leaky_features(rates)
print(f"Feature shape: {X_leaky.shape}")
print("BUG: 'Smoothed' feature uses centered window (includes y_t, y_{t+1}, y_{t+2})")
# Use same model for fair comparison
model_leaky = Ridge(alpha=1.0)
result_leaky = gate_signal_verification(
model=model_leaky,
X=X_leaky,
y=y_leaky,
n_shuffles=100, # Need >=100 for statistical power in permutation mode
random_state=42,
)
pvalue_leaky = result_leaky.metric_value # p-value in permutation mode
improvement_leaky = result_leaky.details.get("improvement_ratio", 0.0)
print(f"\nShuffled Target Test Result: {result_leaky}")
print(f" - MAE (real target): {result_leaky.details['mae_real']:.4f}")
print(f" - MAE (shuffled avg): {result_leaky.details['mae_shuffled_avg']:.4f}")
print(f" - P-value: {pvalue_leaky:.4f}")
print(f" - Improvement ratio: {improvement_leaky:.1%}")
# =========================================================================
# Compare the two scenarios
# =========================================================================
print("\n" + "=" * 70)
print("COMPARISON: Clean vs Leaky Features")
print("=" * 70)
print(f"\n Clean features improvement: {improvement_clean:.1%}")
print(f" Leaky features improvement: {improvement_leaky:.1%}")
print(f" Difference (leaky - clean): {(improvement_leaky - improvement_clean):.1%}")
if improvement_leaky > improvement_clean + 0.05:
print("\n LEAKAGE DETECTED!")
print(" The leaky features show significantly higher improvement,")
print(" indicating they contain information about the target's position.")
else:
print("\n Note: Both scenarios show similar improvement patterns.")
# =========================================================================
# Practical Gate Usage
# =========================================================================
print("\n" + "=" * 70)
print("PRACTICAL GATE USAGE")
print("=" * 70)
# For production use, set a threshold based on domain knowledge
# Typical guideline: if improvement over shuffled > 95%, suspect leakage
print("\nRunning gates with production thresholds...")
# Compute baseline for suspicious improvement gate
# Use train-test split for OUT-OF-SAMPLE evaluation
split_idx = int(len(y_clean) * 0.8)
X_train, X_test = X_clean[:split_idx], X_clean[split_idx:]
y_train, y_test = y_clean[:split_idx], y_clean[split_idx:]
# Persistence baseline on TEST data
persistence_preds = X_test[:, 0] # First lag is y[t-1]
persistence_mae = np.mean(np.abs(y_test - persistence_preds))
# Model predictions on TEST data (out-of-sample)
model_clean.fit(X_train, y_train)
model_preds = model_clean.predict(X_test)
model_mae = np.mean(np.abs(y_test - model_preds))
# Run multiple gates with production thresholds
# In permutation mode (default), gate HALTs if p-value < alpha (0.05)
result_shuffled = gate_signal_verification(
model=Ridge(alpha=1.0),
X=X_leaky, # Test the LEAKY features
y=y_leaky,
n_shuffles=100, # Need >=100 for statistical power in permutation mode
random_state=42,
)
gates = [
result_shuffled,
gate_suspicious_improvement(
model_metric=model_mae,
baseline_metric=persistence_mae,
threshold=0.20,
warn_threshold=0.10,
),
]
report = run_gates(gates)
print(report.summary())
# =========================================================================
# Key Takeaways
# =========================================================================
print("\n" + "=" * 70)
print("KEY TAKEAWAYS")
print("=" * 70)
print(
"""
1. The SHUFFLED TARGET TEST is the definitive leakage detector.
- If your model beats randomized targets, features encode target info.
- This catches rolling stats computed on full series, lookahead bias, etc.
2. Common leakage sources in time-series:
- Rolling statistics computed before train/test split
- Normalization (mean/std) computed on full dataset
- Feature selection using future data
- Information from the test period in feature engineering
3. Run the shuffled test BEFORE trusting impressive results.
- 40%+ improvement over persistence? Probably leakage.
- Run shuffled test, then investigate if it HALTs.
4. temporalcv gates follow HALT > WARN > PASS priority:
- HALT: Stop and investigate (critical failure)
- WARN: Proceed with caution (verify externally)
- PASS: Validation passed
"""
)
return report
if __name__ == "__main__":
report = demonstrate_leakage_detection()
print(f"\nFinal status: {report.status}")