API Reference: Validation Gates¶
Three-stage validation framework with HALT/PASS/WARN/SKIP decisions for leakage detection.
When to Use¶
graph TD
A[New Pipeline] --> B{Which leakage type?}
B -->|Features encode target| C[gate_signal_verification]
B -->|Train/test overlap| D[gate_temporal_boundary]
B -->|Too-good results| E[gate_suspicious_improvement]
B -->|Model beats theory| F[gate_synthetic_ar1]
B -->|Unsure| G[Run all gates]
C -->|HALT| H[Fix: Check .shift on rolling features]
D -->|HALT| I[Fix: Set horizon parameter]
E -->|HALT| J[Fix: Investigate data pipeline]
Common Mistakes¶
Using
n_shuffles=10for permutation testingMinimum p-value is 1/11 ≈ 0.09, can’t detect leakage
Use
n_shuffles >= 100for statistical power
Ignoring WARN status
WARNs often precede HALTs in production
Always investigate WARNs before deployment
Running gates after training
Gates should run before investing time in model development
Pattern: validate → develop → validate again
See Also: Guardrails Tutorial, Example 16-20
Enums¶
GateStatus¶
Validation gate status enumeration.
Value |
Meaning |
|---|---|
|
Critical failure - stop and investigate |
|
Caution - continue but verify |
|
Validation passed |
|
Insufficient data to run gate |
Data Classes¶
GateResult¶
Result from a validation gate.
@dataclass
class GateResult:
name: str # Gate identifier
status: GateStatus # HALT, WARN, PASS, or SKIP
message: str # Human-readable description
metric_value: Optional[float] # Primary metric for this gate
threshold: Optional[float] # Threshold used for decision
details: dict[str, Any] # Additional metrics and diagnostics
recommendation: str # What to do if not PASS
String representation: [STATUS] name: message
ValidationReport¶
Complete validation report across all gates.
@dataclass
class ValidationReport:
gates: List[GateResult]
Properties:
Property |
Type |
Description |
|---|---|---|
|
|
Overall status: “HALT” if any HALT, “WARN” if any WARN, else “PASS” |
|
|
Gates that HALTed |
|
|
Gates that WARNed |
Methods:
summary() -> str: Human-readable report summary
Gate Functions¶
gate_signal_verification¶
Definitive leakage detection. If a model beats a shuffled target, features contain temporal information that shouldn’t exist.
def gate_signal_verification(
model: FitPredictModel,
X: ArrayLike,
y: ArrayLike,
n_shuffles: Optional[int] = None,
threshold: float = 0.05,
n_cv_splits: int = 3,
permutation: Literal["iid", "block"] = "block",
block_size: Union[int, Literal["auto"]] = "auto",
random_state: Optional[int] = None,
*,
method: Literal["effect_size", "permutation"] = "permutation",
alpha: float = 0.05,
strict: bool = False,
# Bootstrap CI parameters
bootstrap_ci: bool = False,
n_bootstrap: int = 100,
bootstrap_alpha: float = 0.05,
bootstrap_block_length: Union[int, Literal["auto"]] = "auto",
) -> GateResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Model with |
|
|
required |
Feature matrix (n_samples, n_features) |
|
|
required |
Target vector (n_samples,) |
|
|
|
Number of shuffles. Defaults to 100 for permutation mode, 5 for effect_size mode |
|
|
|
Maximum allowed improvement ratio (effect_size mode only) |
|
|
|
Number of CV splits for evaluation |
|
|
|
Shuffle type: “iid” (random) or “block” (preserves autocorrelation) |
|
|
|
Block size for block permutation |
|
|
|
Random seed for reproducibility |
|
|
|
Decision method: “permutation” (p-value) or “effect_size” (improvement ratio) |
|
|
|
Significance level for permutation mode (HALT if p < alpha) |
|
|
|
If True, use strict inequality for p-value comparison |
|
|
|
Enable block bootstrap confidence intervals |
|
|
|
Number of bootstrap replications |
|
|
|
Significance level for CI (0.05 = 95% CI) |
|
|
|
Block length for MBB; “auto” uses n^(1/3) |
Methods:
Method |
|
HALT condition |
|---|---|---|
|
p-value |
|
|
improvement ratio |
|
Important: For permutation mode, use n_shuffles >= 100 for adequate statistical power. With small n_shuffles, the minimum achievable p-value is 1/(n_shuffles+1), which may prevent HALT even for blatant leakage.
Returns: GateResult with status HALT if model beats shuffled baseline
Details dict:
mae_real: MAE on real targetmae_shuffled_avg: Mean MAE on shuffled targetsmae_shuffled_all: List of all shuffled MAEsn_shuffles: Number of shuffles performedimprovement_ratio: 1 - mae_real/mae_shuffled_avgpvalue: p-value from permutation test (permutation mode only)
Additional fields when bootstrap_ci=True:
ci_lower: Lower bound of confidence interval for MAEci_upper: Upper bound of confidence interval for MAEci_alpha: Significance level used (e.g., 0.05)bootstrap_std: Bootstrap standard errorn_bootstrap: Number of bootstrap samplesbootstrap_block_length: Block length used
gate_synthetic_ar1¶
Test model on synthetic AR(1) where theoretical optimum is known.
def gate_synthetic_ar1(
model: FitPredictModel,
phi: float = 0.95,
sigma: float = 1.0,
n_samples: int = 500,
n_lags: int = 5,
tolerance: float = 1.5,
random_state: Optional[int] = None,
# Bootstrap CI parameters
bootstrap_ci: bool = False,
n_bootstrap: int = 100,
bootstrap_alpha: float = 0.05,
bootstrap_block_length: Union[int, Literal["auto"]] = "auto",
) -> GateResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Model to test |
|
|
|
AR(1) coefficient |
|
|
|
Innovation standard deviation |
|
|
|
Samples to generate |
|
|
|
Lag features to create |
|
|
|
How much better model can be than theoretical |
|
|
|
Random seed |
|
|
|
Enable block bootstrap confidence intervals |
|
|
|
Number of bootstrap replications |
|
|
|
Significance level for CI (0.05 = 95% CI) |
|
|
|
Block length for MBB; “auto” uses n^(1/3) |
Returns: GateResult with status HALT if model beats theoretical by too much
Details dict when bootstrap_ci=True:
ci_lower: Lower bound of CI for model MAEci_upper: Upper bound of CI for model MAEci_alpha: Significance level usedbootstrap_std: Bootstrap standard errorn_bootstrap: Number of bootstrap samplesbootstrap_block_length: Block length used
Theoretical bound: MAE_optimal = σ × √(2/π) ≈ 0.798 × σ
gate_suspicious_improvement¶
Flag too-good-to-be-true improvements over baseline.
def gate_suspicious_improvement(
model_metric: float,
baseline_metric: float,
threshold: float = 0.20,
warn_threshold: float = 0.10,
metric_name: str = "MAE",
) -> GateResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Model’s error metric (lower = better) |
|
|
required |
Baseline error metric |
|
|
|
Improvement that triggers HALT (20%) |
|
|
|
Improvement that triggers WARN (10%) |
|
|
|
Metric name for messages |
Returns: GateResult with appropriate status based on improvement
gate_temporal_boundary¶
Verify proper gap between training and test for h-step forecasts.
def gate_temporal_boundary(
train_end_idx: int,
test_start_idx: int,
horizon: int,
gap: int = 0,
) -> GateResult
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Last training index (inclusive) |
|
|
required |
First test index |
|
|
required |
Forecast horizon (h) |
|
|
|
Additional gap beyond horizon |
Returns: GateResult with status HALT if boundary violated
Requirement: test_start_idx >= train_end_idx + horizon + gap
Runner¶
run_gates¶
Aggregate gate results into a validation report.
def run_gates(gates: List[GateResult]) -> ValidationReport
Parameters:
Parameter |
Type |
Description |
|---|---|---|
|
|
Pre-computed gate results |
Returns: ValidationReport with overall status and summary
Example:
from temporalcv.gates import (
run_gates,
gate_signal_verification,
gate_suspicious_improvement,
)
results = [
gate_signal_verification(model, X, y, random_state=42),
gate_suspicious_improvement(model_mae, persistence_mae),
]
report = run_gates(results)
if report.status == "HALT":
print(report.summary())
for failure in report.failures:
print(f" - {failure.name}: {failure.recommendation}")
Bootstrap Confidence Intervals¶
Gates that support bootstrap_ci=True provide uncertainty quantification for their metrics using Moving Block Bootstrap (Kunsch 1989). This preserves temporal dependence while computing confidence intervals.
Usage Example¶
from temporalcv.gates import gate_signal_verification
result = gate_signal_verification(
model=my_model,
X=X_train,
y=y_train,
bootstrap_ci=True,
n_bootstrap=200,
bootstrap_alpha=0.05, # 95% CI
random_state=42,
)
# Access CI from details
print(f"MAE: {result.details['mae_real']:.3f}")
print(f"95% CI: [{result.details['ci_lower']:.3f}, {result.details['ci_upper']:.3f}]")
print(f"SE: {result.details['bootstrap_std']:.3f}")
When to Use¶
Reporting results: Provides uncertainty bounds for publication
Comparing models: Check if CIs overlap before declaring winner
Small samples: When asymptotic inference is unreliable
Block Length Selection¶
The "auto" setting uses the asymptotically optimal n^(1/3) rule. Override when:
Multi-step forecasting:
block_length = max(horizon, n^(1/3))Known autocorrelation structure: Match to decorrelation time
References¶
[T1] Permutation Testing:
Phipson, B. & Smyth, G.K. (2010). “Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn.” Statistical Applications in Genetics and Molecular Biology, 9(1), Article 39. DOI: 10.2202/1544-6115.1585
[T1] Block Bootstrap for Time Series:
Künsch, H.R. (1989). “The Jackknife and the Bootstrap for General Stationary Observations.” The Annals of Statistics, 17(3), 1217-1241. DOI: 10.1214/aos/1176347265
[T1] Optimal Block Length Selection:
Politis, D.N. & Romano, J.P. (1994). “The Stationary Bootstrap.” Journal of the American Statistical Association, 89(428), 1303-1313.
[T1] Theoretical AR(1) Bounds:
MAE of N(0, σ) = σ√(2/π) ≈ 0.798σ: Standard result from order statistics. For AR(1) with innovation variance σ², the 1-step forecast error is σ·ε_t, hence MAE = E[|σ·ε|] = σ·√(2/π) when ε ~ N(0,1).