Inference¶
Statistical inference tools for cross-validation results.
Overview¶
Provides bootstrap-based inference for test statistics computed across CV folds, particularly useful when standard asymptotic inference is unreliable due to few folds.
Knowledge Tier: [T2] - Wild bootstrap is established, but CV fold independence assumption requires domain-specific validation.
Data Classes¶
WildBootstrapResult¶
@dataclass
class WildBootstrapResult:
statistic: float # Original test statistic
p_value: float # Bootstrap p-value
ci_lower: float # Lower confidence bound
ci_upper: float # Upper confidence bound
n_bootstrap: int # Number of bootstrap samples
bootstrap_dist: np.ndarray # Bootstrap distribution
Functions¶
wild_cluster_bootstrap¶
Wild cluster bootstrap for dependent data:
from temporalcv.inference import wild_cluster_bootstrap
# Bootstrap inference on fold statistics
result = wild_cluster_bootstrap(
fold_statistics=fold_maes,
n_bootstrap=1000,
confidence_level=0.95,
)
print(f"Statistic: {result.statistic:.4f}")
print(f"95% CI: [{result.ci_lower:.4f}, {result.ci_upper:.4f}]")
print(f"p-value: {result.p_value:.4f}")
Usage Example¶
from temporalcv import WalkForwardCV
from temporalcv.inference import wild_cluster_bootstrap
import numpy as np
# Collect fold statistics
cv = WalkForwardCV(n_splits=10)
fold_maes = []
for train_idx, test_idx in cv.split(X):
model.fit(X[train_idx], y[train_idx])
preds = model.predict(X[test_idx])
fold_maes.append(np.mean(np.abs(y[test_idx] - preds)))
fold_maes = np.array(fold_maes)
# Bootstrap inference
result = wild_cluster_bootstrap(fold_maes, n_bootstrap=1000)
print(f"Mean MAE: {result.statistic:.4f}")
print(f"95% CI: [{result.ci_lower:.4f}, {result.ci_upper:.4f}]")
Block Bootstrap Confidence Intervals¶
For time series data, use block bootstrap to preserve temporal dependence:
from temporalcv.inference import block_bootstrap_ci
import numpy as np
# Time series data with autocorrelation
np.random.seed(42)
n = 200
errors = np.zeros(n)
for t in range(1, n):
errors[t] = 0.7 * errors[t-1] + np.random.randn()
# Compute block bootstrap CI for the mean
result = block_bootstrap_ci(
data=errors,
statistic_func=np.mean,
n_bootstrap=1000,
block_length='auto', # Uses n^(1/3) rule
confidence_level=0.95,
)
print(f"Point estimate: {result.statistic:.4f}")
print(f"95% CI: [{result.ci_lower:.4f}, {result.ci_upper:.4f}]")
print(f"Block length used: {result.block_length}")
Comparing Two Models with Bootstrap¶
from temporalcv.inference import wild_cluster_bootstrap
import numpy as np
# MAE from two models across 10 CV folds
model_a_maes = np.array([0.45, 0.52, 0.48, 0.51, 0.47, 0.49, 0.53, 0.46, 0.50, 0.48])
model_b_maes = np.array([0.42, 0.48, 0.44, 0.46, 0.43, 0.45, 0.49, 0.42, 0.46, 0.44])
# Test if Model B is significantly better
differences = model_a_maes - model_b_maes # Positive = B is better
result = wild_cluster_bootstrap(differences, n_bootstrap=1000)
print(f"Mean improvement: {result.statistic:.4f}")
print(f"95% CI: [{result.ci_lower:.4f}, {result.ci_upper:.4f}]")
print(f"p-value (one-sided): {result.p_value:.4f}")
if result.ci_lower > 0:
print("Model B is significantly better (CI excludes zero)")
When to Use¶
Few CV folds (< 20): Asymptotic inference unreliable
Dependent folds: Standard errors underestimate uncertainty
Confidence intervals: When point estimates alone are insufficient
See Also¶
Statistical Tests - DM test for model comparison
Wild Cluster Bootstrap - Cluster-robust inference
References¶
Cameron, Gelbach & Miller (2008). “Bootstrap-Based Improvements for Inference with Clustered Errors.” Review of Economics and Statistics.
MacKinnon & Webb (2017). “Wild Bootstrap Inference for Wildly Different Cluster Sizes.” JASA.