Feature Engineering Safety Guide¶
Audience: ML practitioners who know sklearn but are new to time-series feature engineering.
Purpose: Teach you to identify safe vs dangerous features without memorization — understand the principle, apply it anywhere.
The One Rule¶
Every feature at time
tmust be computable using ONLY data from times≤ t-gap.
That’s it. If you internalize this rule, you’ll never accidentally leak information.
Why gap? If you’re predicting h steps ahead, your features at time t cannot use any information from [t, t+h). The gap ensures temporal separation.
Quick Decision Tree¶
Question 1: Does this feature use future data (y[t+1], x[t+1], ...)?
YES → 🚫 LEAKAGE - Never use
NO → Continue
Question 2: Does this feature use the target variable y?
YES → Question 2a: Is it computed on training data only?
YES → ⚠️ DANGEROUS - Requires careful handling
NO → 🚫 LEAKAGE - Target-derived features on full series
NO → Continue
Question 3: Does this feature use centered windows (data from both sides of t)?
YES → 🚫 LEAKAGE - Bidirectional operations
NO → ✅ SAFE - Backward-looking only
Feature Categories¶
✅ SAFE: Backward-Looking Only¶
These features use only past data and are always safe when properly lagged.
Feature Type |
Example |
Why Safe |
|---|---|---|
Lag features |
|
Explicitly past |
Expanding statistics |
|
Only uses [0, t) |
Rolling (left-aligned) |
|
Only uses past window |
Cumulative sums |
|
Only uses [0, t) |
Calendar features |
|
Deterministic, no data needed |
External regressors |
|
Past values only |
Code Example — Safe Rolling Mean:
# ✅ SAFE: Rolling mean using only past values
def safe_rolling_mean(series, window=5):
"""Compute rolling mean using only past values (t-window to t-1)."""
result = np.full_like(series, np.nan)
for t in range(window, len(series)):
result[t] = series[t-window:t].mean() # excludes t!
return result
# Equivalent pandas (note: shift BEFORE rolling!)
df['safe_rolling'] = df['y'].shift(1).rolling(window=5).mean()
⚠️ DANGEROUS: Requires Careful Handling¶
These features are legitimate but easy to implement incorrectly.
Feature Type |
Danger |
Safe Implementation |
|---|---|---|
Rolling statistics |
Default |
Always |
Percentile ranks |
Full-series percentiles |
Compute on training only, apply to test |
Standardization |
Fit on all data |
Fit on training only |
Target encoding |
Include test targets |
Use only training targets |
Regime indicators |
Computed on full series |
Use changepoint detection with lag |
Code Example — Dangerous vs Safe Percentile:
# 🚫 WRONG: Percentile uses full series (future information!)
def leaky_percentile_rank(series, value):
return (series < value).sum() / len(series)
# ✅ SAFE: Percentile uses only training data
def safe_percentile_rank(training_series, value):
return (training_series < value).sum() / len(training_series)
# In practice:
train_percentiles = np.percentile(y_train, [25, 50, 75])
# Apply these thresholds to test data — never recompute on test
Code Example — Dangerous vs Safe Rolling:
# 🚫 WRONG: Center=True uses future values
df['leaky'] = df['y'].rolling(window=5, center=True).mean()
# 🚫 WRONG: No shift means y[t] is used to predict y[t]
df['also_leaky'] = df['y'].rolling(window=5, center=False).mean()
# ✅ SAFE: Shift first, then roll
df['safe'] = df['y'].shift(1).rolling(window=5, center=False).mean()
🚫 LEAKAGE: Never Use These¶
These features inherently use future information and cannot be fixed.
Feature Type |
Why It Leaks |
What Happens |
|---|---|---|
Centered rolling |
Uses |
Model “sees” future |
Full-series normalization |
Mean/std include test |
Test distribution leaked |
Target encoding (full) |
Test targets in encoding |
Direct target leakage |
Cross-validation leakage |
KFold shuffles time |
Future in training |
Forward-looking indicators |
Any |
Crystal ball |
How Leakage Manifests:
# Generate high-persistence AR(1) data
np.random.seed(42)
y = np.zeros(500)
for t in range(1, 500):
y[t] = 0.95 * y[t-1] + np.random.normal()
# 🚫 LEAKY FEATURE: Rolling mean includes current value
X_leaky = pd.DataFrame({
'rolling_mean': pd.Series(y).rolling(5).mean() # Uses y[t]!
})
# Train/test split
X_train, y_train = X_leaky.iloc[:400], y[:400]
X_test, y_test = X_leaky.iloc[400:], y[400:]
# Model will appear to perform impossibly well on test set
# because X_test features contain y_test information
Common Mistakes by Category¶
1. Pandas Rolling Window Traps¶
# Trap 1: Default includes current value
df['bad1'] = df['y'].rolling(5).mean() # y[t-4:t+1].mean(), includes t
# Trap 2: center=True uses both sides
df['bad2'] = df['y'].rolling(5, center=True).mean() # y[t-2:t+3]
# Trap 3: min_periods allows partial windows at start (ok), but still includes t
df['bad3'] = df['y'].rolling(5, min_periods=1).mean() # still includes t
# ✅ CORRECT: shift(1) BEFORE rolling
df['good'] = df['y'].shift(1).rolling(5, min_periods=1).mean()
2. Sklearn Transformer Pitfalls¶
from sklearn.preprocessing import StandardScaler
# 🚫 WRONG: Fit on all data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Uses mean/std of entire dataset
X_train_scaled = X_scaled[:split_idx] # Test info leaked into training!
# ✅ CORRECT: Fit on training only
scaler = StandardScaler()
scaler.fit(X_train) # Only training statistics
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Apply training statistics
3. GroupBy Aggregation Leakage¶
# 🚫 WRONG: Group statistics include all data
df['sector_mean'] = df.groupby('sector')['returns'].transform('mean')
# If sector appears in both train and test, test values contaminate training
# ✅ CORRECT: Compute on training, merge to full
train_sector_means = df.loc[train_idx].groupby('sector')['returns'].mean()
df['sector_mean'] = df['sector'].map(train_sector_means)
4. Technical Indicators Leakage¶
# Many TA-Lib indicators are fine, but some use future data
# ⚠️ CHECK: Does the indicator use future values?
# - MACD, RSI, Bollinger: Safe (backward-looking)
# - Pivot Points: Often calculated for "today" using "today's" HLOC
# → Safe if using previous day's HLOC
# ✅ SAFE RSI (standard implementation is backward-looking)
import talib
df['RSI'] = talib.RSI(df['close'].shift(1), timeperiod=14) # shift for safety
Validation: How to Detect Leakage¶
Method 1: Shuffled Target Test¶
The gold standard. If your model beats a shuffled target, you have leakage.
from temporalcv import gate_signal_verification
result = gate_signal_verification(
model=your_model,
X=X_train,
y=y_train,
n_shuffles=5,
random_state=42
)
if result.status == "HALT":
print("LEAKAGE DETECTED!")
print(f"Real MAE: {result.real_mae:.4f}")
print(f"Shuffled MAE: {result.shuffled_mae:.4f}")
# Model performs nearly as well on shuffled target
# → Features encode target position, not predictive signal
Method 2: Too-Good-to-Be-True Check¶
If you’re beating persistence by >20% on high-persistence data, investigate.
from temporalcv import compute_mase, compute_acf
# Check persistence level
acf1 = compute_acf(y_train, max_lag=1)[1] # ACF at lag 1
if acf1 > 0.9:
print(f"High persistence series (ACF(1)={acf1:.2f})")
# If MASE < 0.8 on high-persistence data, be suspicious
mase = compute_mase(predictions, actuals, y_train)
if mase < 0.8 and acf1 > 0.9:
print("WARNING: Suspiciously good performance")
print("Run gate_signal_verification() before proceeding")
Method 3: Horizon Consistency Check¶
If h=1 is dramatically better than h=2,3,4, you may have gap issues.
# If gap=0, the h=1 prediction can "see" y[t] in features
# Compare performance across horizons
mase_h1 = compute_mase(preds_h1, actuals_h1, y_train)
mase_h4 = compute_mase(preds_h4, actuals_h4, y_train)
if mase_h1 < 0.5 * mase_h4:
print("WARNING: h=1 >> h=4 suggests gap enforcement issue")
Safe Feature Engineering Checklist¶
Before using any feature, verify:
No future data: Feature at time
tuses only data from[0, t-gap]Shift before rolling:
df['x'].shift(1).rolling(...)notdf['x'].rolling(...)Training-only statistics: Percentiles, means, encodings fit on training only
Explicit lag: If using lag, is it
y[t-lag]or accidentallyy[t]?No centered windows:
center=Falsefor all rolling operationsGap respected: If predicting
hsteps ahead, gap>= hin all features
Quick Reference Card¶
Operation |
Leaky Version |
Safe Version |
|---|---|---|
Rolling mean |
|
|
Percentile |
|
|
Standardization |
|
|
Group encoding |
|
Fit on train, map to full |
Expanding mean |
|
|
Diff |
|
|
Summary¶
The One Rule: Features at
tuse only data from≤ t-gapShift before rolling: The most common pandas mistake
Training-only statistics: Percentiles, normalization, encoding
Validate with shuffled target: The definitive leakage test
Suspicious improvement = investigate: >20% over persistence = verify
When in doubt, ask: “Could I compute this feature in real-time production, knowing only the past?”
If the answer is “no,” it’s leakage.
See Also¶
Notebook 00: Time Series Fundamentals — Why time series differs from regular ML
Walk-Forward CV Tutorial — Proper temporal validation
Diagnostic Flowchart — What to do when validation fails
Metric Selection Guide — Which metric for which problem