Common Pitfalls and Best Practices¶

About This Guide

This page documents the most common mistakes when applying machine learning to time series data. Each pitfall includes:

What goes wrong and why
Don’t/Do code examples
Which validation gate catches it

Pitfall #1: Using KFold on Time Series¶

The Problem¶

Standard k-fold cross-validation shuffles data randomly, mixing future observations into the training set. This creates temporal leakage—the model learns patterns it won’t have access to in production.

Don’t¶

from sklearn.model_selection import KFold, cross_val_score

# WRONG: Shuffles time series data
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=cv)

Do¶

from temporalcv import WalkForwardCV

# RIGHT: Respects temporal order
cv = WalkForwardCV(n_splits=5, gap=horizon)
scores = cross_val_score(model, X, y, cv=cv)

Gate Detection¶

from temporalcv.gates import gate_signal_verification

# Detects whether the model has any signal beyond a shuffled-target baseline.
# HALT means signal exists — investigate whether legitimate temporal pattern or leakage.
result = gate_signal_verification(model, X, y, n_shuffles=100)

Pitfall #2: Rolling Features Without Shift¶

The Problem¶

Rolling statistics (mean, std, max) include the current observation by default. When predicting y[t], using rolling_mean[t] includes y[t] in the calculation— this is look-ahead bias.

Don’t¶

# WRONG: rolling_mean[t] includes price[t]
df['rolling_mean'] = df['price'].rolling(5).mean()
df['rolling_std'] = df['price'].rolling(5).std()

Do¶

# RIGHT: Shift to exclude current observation
df['rolling_mean'] = df['price'].shift(1).rolling(5).mean()
df['rolling_std'] = df['price'].shift(1).rolling(5).std()

Gate Detection¶

from temporalcv.gates import gate_signal_verification

# Detects rolling-feature leakage: features built on the full series will let the
# model beat a shuffled-target baseline at p < 0.05, triggering HALT.
result = gate_signal_verification(model, X, y, n_shuffles=100)

Pitfall #3: Normalizing on Full Dataset¶

The Problem¶

Fitting a scaler on the entire dataset (train + test) leaks test set statistics into training. The model learns the scale of future data it shouldn’t know about.

Don’t¶

from sklearn.preprocessing import StandardScaler

# WRONG: Scaler sees test data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Fit on ALL data
X_train, X_test = X_scaled[:split], X_scaled[split:]

Do¶

from sklearn.preprocessing import StandardScaler

# RIGHT: Fit only on training data
scaler = StandardScaler()
X_train = scaler.fit_transform(X[:split])  # Fit on train only
X_test = scaler.transform(X[split:])       # Transform test with train params

Important: Do This Per Fold¶

from temporalcv import WalkForwardCV
from sklearn.pipeline import Pipeline

# Use a pipeline to ensure scaler fits per fold
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor())
])

cv = WalkForwardCV(n_splits=5)
scores = cross_val_score(pipeline, X, y, cv=cv)  # Scaler refits each fold

Pitfall #4: Computing Thresholds on All Data¶

The Problem¶

Regime indicators (high volatility, bull/bear market, recession) are often defined using quantiles of the full dataset. This means future extreme values inform historical regime classifications.

Don’t¶

# WRONG: Threshold computed from ALL data (including future)
vol_threshold = df['volatility'].quantile(0.90)
df['high_vol'] = df['volatility'] > vol_threshold

Do¶

# RIGHT: Expanding threshold uses only past data
df['high_vol'] = df['volatility'] > (
    df['volatility'].expanding().quantile(0.90).shift(1)
)

Gate Detection¶

from temporalcv.gates import gate_regime_leakage

result = gate_regime_leakage(
    regime_indicator=df['high_vol'],
    target=df['returns'],
    train_mask=train_mask
)
# Returns HALT if regime indicator shows impossible predictive power

Pitfall #5: Insufficient Gap for Multi-Step Forecasting¶

The Problem¶

When forecasting h steps ahead, features at time t should only use data through time t-h. Without a gap, features implicitly contain information about the prediction target.

Don’t¶

# WRONG: No gap for 5-day forecast
cv = WalkForwardCV(n_splits=5)  # gap defaults to 0

# Features computed at t might use data through t
# But predicting y[t+5] shouldn't see data after t

Do¶

# RIGHT: Gap matches forecast horizon
cv = WalkForwardCV(n_splits=5, gap=5)  # 5-day gap

# Or explicitly with SlidingWindowCV
cv = SlidingWindowCV(
    train_size=252,
    test_size=5,
    gap=5  # Matches h-step ahead forecast
)

Gate Detection¶

from temporalcv.gates import gate_temporal_boundary

result = gate_temporal_boundary(
    train_times=train_dates,
    test_times=test_dates,
    gap_required=5
)
# Returns HALT if gap is insufficient

Pitfall #6: Trusting MAE on High-Persistence Series¶

The Problem¶

For highly autocorrelated series (e.g., stock prices, GDP), a naive “predict last value” model achieves low MAE. Your sophisticated model may look good but add no value.

Don’t¶

# WRONG: Raw MAE without baseline comparison
mae = mean_absolute_error(y_test, predictions)
print(f"MAE: {mae:.2f}")  # Looks good, but is it better than naive?

Do¶

from temporalcv.persistence import compute_persistence_metrics, compare_to_naive

# RIGHT: Compare to naive baseline
results = compare_to_naive(
    y_test,
    predictions,
    naive_predictions=y_test_shifted,  # Previous value
    metric='mase'  # Mean Absolute Scaled Error
)

print(f"MASE: {results['mase']:.3f}")  # <1 means better than naive
print(f"Skill Score: {results['skill_score']:.1%}")  # % improvement over naive

Statistical Test¶

from temporalcv.statistical_tests import dm_test

# Is the improvement statistically significant?
result = dm_test(
    errors_naive,
    errors_model,
    h=forecast_horizon
)
print(f"p-value: {result.pvalue:.4f}")

Pitfall #7: Using center=True for Rolling Windows¶

The Problem¶

Pandas rolling(..., center=True) centers the window, using both past AND future values. This is useful for smoothing visualizations but creates look-ahead bias in features.

Don’t¶

# WRONG: center=True uses future values
df['smooth_price'] = df['price'].rolling(5, center=True).mean()

Do¶

# RIGHT: Default center=False uses only past values
df['smooth_price'] = df['price'].rolling(5).mean()  # center=False is default

# Explicitly for clarity
df['smooth_price'] = df['price'].rolling(5, center=False).mean()

Pitfall #8: GroupBy Transform Including Test Data¶

The Problem¶

When computing group statistics (e.g., sector mean returns), using .transform() on the full DataFrame includes test data in the calculation.

Don’t¶

# WRONG: Group mean computed on ALL data
df['sector_mean'] = df.groupby('sector')['returns'].transform('mean')

# Then split
train = df[:split]
test = df[split:]  # test sector_mean includes test data!

Do¶

# RIGHT: Compute group statistics on training data only
train = df[:split]
test = df[split:]

sector_means = train.groupby('sector')['returns'].mean()
train['sector_mean'] = train['sector'].map(sector_means)
test['sector_mean'] = test['sector'].map(sector_means)

For Expanding Windows¶

# RIGHT: Expanding group mean (uses only past data)
df['sector_mean'] = (
    df.groupby('sector')['returns']
    .transform(lambda x: x.expanding().mean().shift(1))
)

Summary: Validation Gate Coverage¶

Pitfall	Gate	Status When Violated
#1 KFold on time series	`gate_signal_verification`	HALT
#2 Rolling without shift	`gate_signal_verification`	HALT
#3 Normalizing on full data	`gate_feature_correlation`	WARN
#4 Thresholds on all data	`gate_regime_leakage`	HALT
#5 Insufficient gap	`gate_temporal_boundary`	HALT
#6 Trusting raw MAE	(use `compare_to_naive`)	—
#7 center=True rolling	`gate_signal_verification`	HALT
#8 GroupBy with test data	`gate_feature_correlation`	WARN

Running All Gates¶

from temporalcv.gates import run_gates

result = run_gates(
    X_train, y_train,
    X_test, y_test,
    dates_train=train_dates,
    dates_test=test_dates
)

if result.status == "HALT":
    raise ValueError(f"Critical leakage: {result.reason}")
elif result.status == "WARN":
    print(f"Warning: {result.reason}")
else:
    print("All gates passed!")

Next Steps¶

Why Time Series Is Different: Conceptual foundation
Algorithm Decision Tree: Choose the right CV and metrics
API: Validation Gates: Full gate reference documentation