API Reference: Validation Gates¶

Three-stage validation framework with HALT/PASS/WARN/SKIP decisions for leakage detection.

When to Use¶

        graph TD
    A[New Pipeline] --> B{Which leakage type?}

    B -->|Features encode target| C[gate_signal_verification]
    B -->|Train/test overlap| D[gate_temporal_boundary]
    B -->|Too-good results| E[gate_suspicious_improvement]
    B -->|Model beats theory| F[gate_synthetic_ar1]
    B -->|Unsure| G[Run all gates]

    C -->|HALT| H[Fix: Check .shift on rolling features]
    D -->|HALT| I[Fix: Set horizon parameter]
    E -->|HALT| J[Fix: Investigate data pipeline]

Common Mistakes¶

Using n_shuffles=10 for permutation testing
- Minimum p-value is 1/11 ≈ 0.09, can’t detect leakage
- Use n_shuffles >= 100 for statistical power
Ignoring WARN status
- WARNs often precede HALTs in production
- Always investigate WARNs before deployment
Running gates after training
- Gates should run before investing time in model development
- Pattern: validate → develop → validate again

See Also: Guardrails Tutorial, Example 16-20

Enums¶

`GateStatus`¶

Validation gate status enumeration.

Value	Meaning
`HALT`	Critical failure - stop and investigate
`WARN`	Caution - continue but verify
`PASS`	Validation passed
`SKIP`	Insufficient data to run gate

Data Classes¶

`GateResult`¶

Result from a validation gate.

@dataclass
class GateResult:
    name: str                          # Gate identifier
    status: GateStatus                 # HALT, WARN, PASS, or SKIP
    message: str                       # Human-readable description
    metric_value: Optional[float]      # Primary metric for this gate
    threshold: Optional[float]         # Threshold used for decision
    details: dict[str, Any]            # Additional metrics and diagnostics
    recommendation: str                # What to do if not PASS

String representation: [STATUS] name: message

`ValidationReport`¶

Complete validation report across all gates.

@dataclass
class ValidationReport:
    gates: List[GateResult]

Properties:

Property	Type	Description
`status`	`str`	Overall status: “HALT” if any HALT, “WARN” if any WARN, else “PASS”
`failures`	`List[GateResult]`	Gates that HALTed
`warnings`	`List[GateResult]`	Gates that WARNed

Methods:

summary() -> str: Human-readable report summary

Gate Functions¶

`gate_signal_verification`¶

Definitive leakage detection. If a model beats a shuffled target, features contain temporal information that shouldn’t exist.

def gate_signal_verification(
    model: FitPredictModel,
    X: ArrayLike,
    y: ArrayLike,
    n_shuffles: Optional[int] = None,
    threshold: float = 0.05,
    n_cv_splits: int = 3,
    permutation: Literal["iid", "block"] = "block",
    block_size: Union[int, Literal["auto"]] = "auto",
    random_state: Optional[int] = None,
    *,
    method: Literal["effect_size", "permutation"] = "permutation",
    alpha: float = 0.05,
    strict: bool = False,
    # Bootstrap CI parameters
    bootstrap_ci: bool = False,
    n_bootstrap: int = 100,
    bootstrap_alpha: float = 0.05,
    bootstrap_block_length: Union[int, Literal["auto"]] = "auto",
) -> GateResult

Parameters:

Parameter	Type	Default	Description
`model`	`FitPredictModel`	required	Model with `fit(X, y)` and `predict(X)` methods
`X`	`ArrayLike`	required	Feature matrix (n_samples, n_features)
`y`	`ArrayLike`	required	Target vector (n_samples,)
`n_shuffles`	`int`	`None`	Number of shuffles. Defaults to 100 for permutation mode, 5 for effect_size mode
`threshold`	`float`	`0.05`	Maximum allowed improvement ratio (effect_size mode only)
`n_cv_splits`	`int`	`3`	Number of CV splits for evaluation
`permutation`	`str`	`"block"`	Shuffle type: “iid” (random) or “block” (preserves autocorrelation)
`block_size`	`int\|"auto"`	`"auto"`	Block size for block permutation
`random_state`	`int`	`None`	Random seed for reproducibility
`method`	`str`	`"permutation"`	Decision method: “permutation” (p-value) or “effect_size” (improvement ratio)
`alpha`	`float`	`0.05`	Significance level for permutation mode (HALT if p < alpha)
`strict`	`bool`	`False`	If True, use strict inequality for p-value comparison
`bootstrap_ci`	`bool`	`False`	Enable block bootstrap confidence intervals
`n_bootstrap`	`int`	`100`	Number of bootstrap replications
`bootstrap_alpha`	`float`	`0.05`	Significance level for CI (0.05 = 95% CI)
`bootstrap_block_length`	`int\|"auto"`	`"auto"`	Block length for MBB; “auto” uses n^(1/3)

Methods:

Method	`metric_value`	HALT condition
`"permutation"`	p-value	`pvalue < alpha`
`"effect_size"`	improvement ratio	`improvement > threshold`

Important: For permutation mode, use n_shuffles >= 100 for adequate statistical power. With small n_shuffles, the minimum achievable p-value is 1/(n_shuffles+1), which may prevent HALT even for blatant leakage.

Returns: GateResult with status HALT if model beats shuffled baseline

Details dict:

mae_real: MAE on real target
mae_shuffled_avg: Mean MAE on shuffled targets
mae_shuffled_all: List of all shuffled MAEs
n_shuffles: Number of shuffles performed
improvement_ratio: 1 - mae_real/mae_shuffled_avg
pvalue: p-value from permutation test (permutation mode only)

Additional fields when bootstrap_ci=True:

ci_lower: Lower bound of confidence interval for MAE
ci_upper: Upper bound of confidence interval for MAE
ci_alpha: Significance level used (e.g., 0.05)
bootstrap_std: Bootstrap standard error
n_bootstrap: Number of bootstrap samples
bootstrap_block_length: Block length used

`gate_synthetic_ar1`¶

Test model on synthetic AR(1) where theoretical optimum is known.

def gate_synthetic_ar1(
    model: FitPredictModel,
    phi: float = 0.95,
    sigma: float = 1.0,
    n_samples: int = 500,
    n_lags: int = 5,
    tolerance: float = 1.5,
    random_state: Optional[int] = None,
    # Bootstrap CI parameters
    bootstrap_ci: bool = False,
    n_bootstrap: int = 100,
    bootstrap_alpha: float = 0.05,
    bootstrap_block_length: Union[int, Literal["auto"]] = "auto",
) -> GateResult

Parameters:

Parameter	Type	Default	Description
`model`	`FitPredictModel`	required	Model to test
`phi`	`float`	`0.95`	AR(1) coefficient
`sigma`	`float`	`1.0`	Innovation standard deviation
`n_samples`	`int`	`500`	Samples to generate
`n_lags`	`int`	`5`	Lag features to create
`tolerance`	`float`	`1.5`	How much better model can be than theoretical
`random_state`	`int`	`None`	Random seed
`bootstrap_ci`	`bool`	`False`	Enable block bootstrap confidence intervals
`n_bootstrap`	`int`	`100`	Number of bootstrap replications
`bootstrap_alpha`	`float`	`0.05`	Significance level for CI (0.05 = 95% CI)
`bootstrap_block_length`	`int\|"auto"`	`"auto"`	Block length for MBB; “auto” uses n^(1/3)

Returns: GateResult with status HALT if model beats theoretical by too much

Details dict when bootstrap_ci=True:

ci_lower: Lower bound of CI for model MAE
ci_upper: Upper bound of CI for model MAE
ci_alpha: Significance level used
bootstrap_std: Bootstrap standard error
n_bootstrap: Number of bootstrap samples
bootstrap_block_length: Block length used

Theoretical bound: MAE_optimal = σ × √(2/π) ≈ 0.798 × σ

`gate_suspicious_improvement`¶

Flag too-good-to-be-true improvements over baseline.

def gate_suspicious_improvement(
    model_metric: float,
    baseline_metric: float,
    threshold: float = 0.20,
    warn_threshold: float = 0.10,
    metric_name: str = "MAE",
) -> GateResult

Parameters:

Parameter	Type	Default	Description
`model_metric`	`float`	required	Model’s error metric (lower = better)
`baseline_metric`	`float`	required	Baseline error metric
`threshold`	`float`	`0.20`	Improvement that triggers HALT (20%)
`warn_threshold`	`float`	`0.10`	Improvement that triggers WARN (10%)
`metric_name`	`str`	`"MAE"`	Metric name for messages

Returns: GateResult with appropriate status based on improvement

`gate_temporal_boundary`¶

Verify proper gap between training and test for h-step forecasts.

def gate_temporal_boundary(
    train_end_idx: int,
    test_start_idx: int,
    horizon: int,
    gap: int = 0,
) -> GateResult

Parameters:

Parameter	Type	Default	Description
`train_end_idx`	`int`	required	Last training index (inclusive)
`test_start_idx`	`int`	required	First test index
`horizon`	`int`	required	Forecast horizon (h)
`gap`	`int`	`0`	Additional gap beyond horizon

Returns: GateResult with status HALT if boundary violated

Requirement: test_start_idx >= train_end_idx + horizon + gap

Runner¶

`run_gates`¶

Aggregate gate results into a validation report.

def run_gates(gates: List[GateResult]) -> ValidationReport

Parameters:

Parameter	Type	Description
`gates`	`List[GateResult]`	Pre-computed gate results

Returns: ValidationReport with overall status and summary

Example:

from temporalcv.gates import (
    run_gates,
    gate_signal_verification,
    gate_suspicious_improvement,
)

results = [
    gate_signal_verification(model, X, y, random_state=42),
    gate_suspicious_improvement(model_mae, persistence_mae),
]

report = run_gates(results)

if report.status == "HALT":
    print(report.summary())
    for failure in report.failures:
        print(f"  - {failure.name}: {failure.recommendation}")

Bootstrap Confidence Intervals¶

Gates that support bootstrap_ci=True provide uncertainty quantification for their metrics using Moving Block Bootstrap (Kunsch 1989). This preserves temporal dependence while computing confidence intervals.

Usage Example¶

from temporalcv.gates import gate_signal_verification

result = gate_signal_verification(
    model=my_model,
    X=X_train,
    y=y_train,
    bootstrap_ci=True,
    n_bootstrap=200,
    bootstrap_alpha=0.05,  # 95% CI
    random_state=42,
)

# Access CI from details
print(f"MAE: {result.details['mae_real']:.3f}")
print(f"95% CI: [{result.details['ci_lower']:.3f}, {result.details['ci_upper']:.3f}]")
print(f"SE: {result.details['bootstrap_std']:.3f}")

When to Use¶

Reporting results: Provides uncertainty bounds for publication
Comparing models: Check if CIs overlap before declaring winner
Small samples: When asymptotic inference is unreliable

Block Length Selection¶

The "auto" setting uses the asymptotically optimal n^(1/3) rule. Override when:

Multi-step forecasting: block_length = max(horizon, n^(1/3))
Known autocorrelation structure: Match to decorrelation time

References¶

[T1] Permutation Testing:

Phipson, B. & Smyth, G.K. (2010). “Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn.” Statistical Applications in Genetics and Molecular Biology, 9(1), Article 39. DOI: 10.2202/1544-6115.1585

[T1] Block Bootstrap for Time Series:

Künsch, H.R. (1989). “The Jackknife and the Bootstrap for General Stationary Observations.” The Annals of Statistics, 17(3), 1217-1241. DOI: 10.1214/aos/1176347265

[T1] Optimal Block Length Selection:

Politis, D.N. & Romano, J.P. (1994). “The Stationary Bootstrap.” Journal of the American Statistical Association, 89(428), 1303-1313.

[T1] Theoretical AR(1) Bounds:

MAE of N(0, σ) = σ√(2/π) ≈ 0.798σ: Standard result from order statistics. For AR(1) with innovation variance σ², the 1-step forecast error is σ·ε_t, hence MAE = E[|σ·ε|] = σ·√(2/π) when ε ~ N(0,1).

API Reference: Validation Gates¶

When to Use¶

Common Mistakes¶

Enums¶

GateStatus¶

Data Classes¶

GateResult¶

ValidationReport¶

Gate Functions¶

gate_signal_verification¶

gate_synthetic_ar1¶

gate_suspicious_improvement¶

gate_temporal_boundary¶

Runner¶

run_gates¶

Bootstrap Confidence Intervals¶

Usage Example¶

When to Use¶

Block Length Selection¶

References¶

`GateStatus`¶

`GateResult`¶

`ValidationReport`¶

`gate_signal_verification`¶

`gate_synthetic_ar1`¶

`gate_suspicious_improvement`¶

`gate_temporal_boundary`¶

`run_gates`¶