Common Pitfalls and Best Practices¶
About This Guide
This page documents the most common mistakes when applying machine learning to time series data. Each pitfall includes:
What goes wrong and why
Don’t/Do code examples
Which validation gate catches it
Pitfall #1: Using KFold on Time Series¶
The Problem¶
Standard k-fold cross-validation shuffles data randomly, mixing future observations into the training set. This creates temporal leakage—the model learns patterns it won’t have access to in production.
Don’t¶
from sklearn.model_selection import KFold, cross_val_score
# WRONG: Shuffles time series data
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=cv)
Do¶
from temporalcv import WalkForwardCV
# RIGHT: Respects temporal order
cv = WalkForwardCV(n_splits=5, gap=horizon)
scores = cross_val_score(model, X, y, cv=cv)
Gate Detection¶
from temporalcv.gates import gate_signal_verification
# Detects whether the model has any signal beyond a shuffled-target baseline.
# HALT means signal exists — investigate whether legitimate temporal pattern or leakage.
result = gate_signal_verification(model, X, y, n_shuffles=100)
Pitfall #2: Rolling Features Without Shift¶
The Problem¶
Rolling statistics (mean, std, max) include the current observation by default.
When predicting y[t], using rolling_mean[t] includes y[t] in the calculation—
this is look-ahead bias.
Don’t¶
# WRONG: rolling_mean[t] includes price[t]
df['rolling_mean'] = df['price'].rolling(5).mean()
df['rolling_std'] = df['price'].rolling(5).std()
Do¶
# RIGHT: Shift to exclude current observation
df['rolling_mean'] = df['price'].shift(1).rolling(5).mean()
df['rolling_std'] = df['price'].shift(1).rolling(5).std()
Gate Detection¶
from temporalcv.gates import gate_signal_verification
# Detects rolling-feature leakage: features built on the full series will let the
# model beat a shuffled-target baseline at p < 0.05, triggering HALT.
result = gate_signal_verification(model, X, y, n_shuffles=100)
Pitfall #3: Normalizing on Full Dataset¶
The Problem¶
Fitting a scaler on the entire dataset (train + test) leaks test set statistics into training. The model learns the scale of future data it shouldn’t know about.
Don’t¶
from sklearn.preprocessing import StandardScaler
# WRONG: Scaler sees test data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fit on ALL data
X_train, X_test = X_scaled[:split], X_scaled[split:]
Do¶
from sklearn.preprocessing import StandardScaler
# RIGHT: Fit only on training data
scaler = StandardScaler()
X_train = scaler.fit_transform(X[:split]) # Fit on train only
X_test = scaler.transform(X[split:]) # Transform test with train params
Important: Do This Per Fold¶
from temporalcv import WalkForwardCV
from sklearn.pipeline import Pipeline
# Use a pipeline to ensure scaler fits per fold
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor())
])
cv = WalkForwardCV(n_splits=5)
scores = cross_val_score(pipeline, X, y, cv=cv) # Scaler refits each fold
Pitfall #4: Computing Thresholds on All Data¶
The Problem¶
Regime indicators (high volatility, bull/bear market, recession) are often defined using quantiles of the full dataset. This means future extreme values inform historical regime classifications.
Don’t¶
# WRONG: Threshold computed from ALL data (including future)
vol_threshold = df['volatility'].quantile(0.90)
df['high_vol'] = df['volatility'] > vol_threshold
Do¶
# RIGHT: Expanding threshold uses only past data
df['high_vol'] = df['volatility'] > (
df['volatility'].expanding().quantile(0.90).shift(1)
)
Gate Detection¶
from temporalcv.gates import gate_regime_leakage
result = gate_regime_leakage(
regime_indicator=df['high_vol'],
target=df['returns'],
train_mask=train_mask
)
# Returns HALT if regime indicator shows impossible predictive power
Pitfall #5: Insufficient Gap for Multi-Step Forecasting¶
The Problem¶
When forecasting h steps ahead, features at time t should only use data through
time t-h. Without a gap, features implicitly contain information about the
prediction target.
Don’t¶
# WRONG: No gap for 5-day forecast
cv = WalkForwardCV(n_splits=5) # gap defaults to 0
# Features computed at t might use data through t
# But predicting y[t+5] shouldn't see data after t
Do¶
# RIGHT: Gap matches forecast horizon
cv = WalkForwardCV(n_splits=5, gap=5) # 5-day gap
# Or explicitly with SlidingWindowCV
cv = SlidingWindowCV(
train_size=252,
test_size=5,
gap=5 # Matches h-step ahead forecast
)
Gate Detection¶
from temporalcv.gates import gate_temporal_boundary
result = gate_temporal_boundary(
train_times=train_dates,
test_times=test_dates,
gap_required=5
)
# Returns HALT if gap is insufficient
Pitfall #6: Trusting MAE on High-Persistence Series¶
The Problem¶
For highly autocorrelated series (e.g., stock prices, GDP), a naive “predict last value” model achieves low MAE. Your sophisticated model may look good but add no value.
Don’t¶
# WRONG: Raw MAE without baseline comparison
mae = mean_absolute_error(y_test, predictions)
print(f"MAE: {mae:.2f}") # Looks good, but is it better than naive?
Do¶
from temporalcv.persistence import compute_persistence_metrics, compare_to_naive
# RIGHT: Compare to naive baseline
results = compare_to_naive(
y_test,
predictions,
naive_predictions=y_test_shifted, # Previous value
metric='mase' # Mean Absolute Scaled Error
)
print(f"MASE: {results['mase']:.3f}") # <1 means better than naive
print(f"Skill Score: {results['skill_score']:.1%}") # % improvement over naive
Statistical Test¶
from temporalcv.statistical_tests import dm_test
# Is the improvement statistically significant?
result = dm_test(
errors_naive,
errors_model,
h=forecast_horizon
)
print(f"p-value: {result.pvalue:.4f}")
Pitfall #7: Using center=True for Rolling Windows¶
The Problem¶
Pandas rolling(..., center=True) centers the window, using both past AND future
values. This is useful for smoothing visualizations but creates look-ahead bias
in features.
Don’t¶
# WRONG: center=True uses future values
df['smooth_price'] = df['price'].rolling(5, center=True).mean()
Do¶
# RIGHT: Default center=False uses only past values
df['smooth_price'] = df['price'].rolling(5).mean() # center=False is default
# Explicitly for clarity
df['smooth_price'] = df['price'].rolling(5, center=False).mean()
Pitfall #8: GroupBy Transform Including Test Data¶
The Problem¶
When computing group statistics (e.g., sector mean returns), using .transform()
on the full DataFrame includes test data in the calculation.
Don’t¶
# WRONG: Group mean computed on ALL data
df['sector_mean'] = df.groupby('sector')['returns'].transform('mean')
# Then split
train = df[:split]
test = df[split:] # test sector_mean includes test data!
Do¶
# RIGHT: Compute group statistics on training data only
train = df[:split]
test = df[split:]
sector_means = train.groupby('sector')['returns'].mean()
train['sector_mean'] = train['sector'].map(sector_means)
test['sector_mean'] = test['sector'].map(sector_means)
For Expanding Windows¶
# RIGHT: Expanding group mean (uses only past data)
df['sector_mean'] = (
df.groupby('sector')['returns']
.transform(lambda x: x.expanding().mean().shift(1))
)
Summary: Validation Gate Coverage¶
Pitfall |
Gate |
Status When Violated |
|---|---|---|
#1 KFold on time series |
|
HALT |
#2 Rolling without shift |
|
HALT |
#3 Normalizing on full data |
|
WARN |
#4 Thresholds on all data |
|
HALT |
#5 Insufficient gap |
|
HALT |
#6 Trusting raw MAE |
(use |
— |
#7 center=True rolling |
|
HALT |
#8 GroupBy with test data |
|
WARN |
Running All Gates¶
from temporalcv.gates import run_gates
result = run_gates(
X_train, y_train,
X_test, y_test,
dates_train=train_dates,
dates_test=test_dates
)
if result.status == "HALT":
raise ValueError(f"Critical leakage: {result.reason}")
elif result.status == "WARN":
print(f"Warning: {result.reason}")
else:
print("All gates passed!")
Next Steps¶
Why Time Series Is Different: Conceptual foundation
Algorithm Decision Tree: Choose the right CV and metrics
API: Validation Gates: Full gate reference documentation