Feature Engineering Safety Guide¶

Audience: ML practitioners who know sklearn but are new to time-series feature engineering.

Purpose: Teach you to identify safe vs dangerous features without memorization — understand the principle, apply it anywhere.

The One Rule¶

Every feature at time t must be computable using ONLY data from times ≤ t-gap.

That’s it. If you internalize this rule, you’ll never accidentally leak information.

Why gap? If you’re predicting h steps ahead, your features at time t cannot use any information from [t, t+h). The gap ensures temporal separation.

Quick Decision Tree¶

Question 1: Does this feature use future data (y[t+1], x[t+1], ...)?
    YES → 🚫 LEAKAGE - Never use
    NO  → Continue

Question 2: Does this feature use the target variable y?
    YES → Question 2a: Is it computed on training data only?
        YES → ⚠️ DANGEROUS - Requires careful handling
        NO  → 🚫 LEAKAGE - Target-derived features on full series
    NO  → Continue

Question 3: Does this feature use centered windows (data from both sides of t)?
    YES → 🚫 LEAKAGE - Bidirectional operations
    NO  → ✅ SAFE - Backward-looking only

Feature Categories¶

✅ SAFE: Backward-Looking Only¶

These features use only past data and are always safe when properly lagged.

Feature Type	Example	Why Safe
Lag features	`y[t-1]`, `y[t-5]`	Explicitly past
Expanding statistics	`mean(y[0:t])`	Only uses [0, t)
Rolling (left-aligned)	`y[t-window:t].mean()`	Only uses past window
Cumulative sums	`sum(y[0:t])`	Only uses [0, t)
Calendar features	`day_of_week(t)`	Deterministic, no data needed
External regressors	`x[t-1]` (lagged)	Past values only

Code Example — Safe Rolling Mean:

# ✅ SAFE: Rolling mean using only past values
def safe_rolling_mean(series, window=5):
    """Compute rolling mean using only past values (t-window to t-1)."""
    result = np.full_like(series, np.nan)
    for t in range(window, len(series)):
        result[t] = series[t-window:t].mean()  # excludes t!
    return result

# Equivalent pandas (note: shift BEFORE rolling!)
df['safe_rolling'] = df['y'].shift(1).rolling(window=5).mean()

⚠️ DANGEROUS: Requires Careful Handling¶

These features are legitimate but easy to implement incorrectly.

Feature Type	Danger	Safe Implementation
Rolling statistics	Default `center=True`	Always `center=False` + `shift(1)`
Percentile ranks	Full-series percentiles	Compute on training only, apply to test
Standardization	Fit on all data	Fit on training only
Target encoding	Include test targets	Use only training targets
Regime indicators	Computed on full series	Use changepoint detection with lag

Code Example — Dangerous vs Safe Percentile:

# 🚫 WRONG: Percentile uses full series (future information!)
def leaky_percentile_rank(series, value):
    return (series < value).sum() / len(series)

# ✅ SAFE: Percentile uses only training data
def safe_percentile_rank(training_series, value):
    return (training_series < value).sum() / len(training_series)

# In practice:
train_percentiles = np.percentile(y_train, [25, 50, 75])
# Apply these thresholds to test data — never recompute on test

Code Example — Dangerous vs Safe Rolling:

# 🚫 WRONG: Center=True uses future values
df['leaky'] = df['y'].rolling(window=5, center=True).mean()

# 🚫 WRONG: No shift means y[t] is used to predict y[t]
df['also_leaky'] = df['y'].rolling(window=5, center=False).mean()

# ✅ SAFE: Shift first, then roll
df['safe'] = df['y'].shift(1).rolling(window=5, center=False).mean()

🚫 LEAKAGE: Never Use These¶

These features inherently use future information and cannot be fixed.

Feature Type	Why It Leaks	What Happens
Centered rolling	Uses `y[t+1], y[t+2]...`	Model “sees” future
Full-series normalization	Mean/std include test	Test distribution leaked
Target encoding (full)	Test targets in encoding	Direct target leakage
Cross-validation leakage	KFold shuffles time	Future in training
Forward-looking indicators	Any `y[t+k]` for k>0	Crystal ball

How Leakage Manifests:

# Generate high-persistence AR(1) data
np.random.seed(42)
y = np.zeros(500)
for t in range(1, 500):
    y[t] = 0.95 * y[t-1] + np.random.normal()

# 🚫 LEAKY FEATURE: Rolling mean includes current value
X_leaky = pd.DataFrame({
    'rolling_mean': pd.Series(y).rolling(5).mean()  # Uses y[t]!
})

# Train/test split
X_train, y_train = X_leaky.iloc[:400], y[:400]
X_test, y_test = X_leaky.iloc[400:], y[400:]

# Model will appear to perform impossibly well on test set
# because X_test features contain y_test information

Common Mistakes by Category¶

1. Pandas Rolling Window Traps¶

# Trap 1: Default includes current value
df['bad1'] = df['y'].rolling(5).mean()  # y[t-4:t+1].mean(), includes t

# Trap 2: center=True uses both sides
df['bad2'] = df['y'].rolling(5, center=True).mean()  # y[t-2:t+3]

# Trap 3: min_periods allows partial windows at start (ok), but still includes t
df['bad3'] = df['y'].rolling(5, min_periods=1).mean()  # still includes t

# ✅ CORRECT: shift(1) BEFORE rolling
df['good'] = df['y'].shift(1).rolling(5, min_periods=1).mean()

2. Sklearn Transformer Pitfalls¶

from sklearn.preprocessing import StandardScaler

# 🚫 WRONG: Fit on all data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses mean/std of entire dataset
X_train_scaled = X_scaled[:split_idx]  # Test info leaked into training!

# ✅ CORRECT: Fit on training only
scaler = StandardScaler()
scaler.fit(X_train)  # Only training statistics
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Apply training statistics

3. GroupBy Aggregation Leakage¶

# 🚫 WRONG: Group statistics include all data
df['sector_mean'] = df.groupby('sector')['returns'].transform('mean')
# If sector appears in both train and test, test values contaminate training

# ✅ CORRECT: Compute on training, merge to full
train_sector_means = df.loc[train_idx].groupby('sector')['returns'].mean()
df['sector_mean'] = df['sector'].map(train_sector_means)

4. Technical Indicators Leakage¶

# Many TA-Lib indicators are fine, but some use future data

# ⚠️ CHECK: Does the indicator use future values?
# - MACD, RSI, Bollinger: Safe (backward-looking)
# - Pivot Points: Often calculated for "today" using "today's" HLOC
#   → Safe if using previous day's HLOC

# ✅ SAFE RSI (standard implementation is backward-looking)
import talib
df['RSI'] = talib.RSI(df['close'].shift(1), timeperiod=14)  # shift for safety

Validation: How to Detect Leakage¶

Method 1: Shuffled Target Test¶

The gold standard. If your model beats a shuffled target, you have leakage.

from temporalcv import gate_signal_verification

result = gate_signal_verification(
    model=your_model,
    X=X_train,
    y=y_train,
    n_shuffles=5,
    random_state=42
)

if result.status == "HALT":
    print("LEAKAGE DETECTED!")
    print(f"Real MAE: {result.real_mae:.4f}")
    print(f"Shuffled MAE: {result.shuffled_mae:.4f}")
    # Model performs nearly as well on shuffled target
    # → Features encode target position, not predictive signal

Method 2: Too-Good-to-Be-True Check¶

If you’re beating persistence by >20% on high-persistence data, investigate.

from temporalcv import compute_mase, compute_acf

# Check persistence level
acf1 = compute_acf(y_train, max_lag=1)[1]  # ACF at lag 1
if acf1 > 0.9:
    print(f"High persistence series (ACF(1)={acf1:.2f})")

# If MASE < 0.8 on high-persistence data, be suspicious
mase = compute_mase(predictions, actuals, y_train)
if mase < 0.8 and acf1 > 0.9:
    print("WARNING: Suspiciously good performance")
    print("Run gate_signal_verification() before proceeding")

Method 3: Horizon Consistency Check¶

If h=1 is dramatically better than h=2,3,4, you may have gap issues.

# If gap=0, the h=1 prediction can "see" y[t] in features
# Compare performance across horizons
mase_h1 = compute_mase(preds_h1, actuals_h1, y_train)
mase_h4 = compute_mase(preds_h4, actuals_h4, y_train)

if mase_h1 < 0.5 * mase_h4:
    print("WARNING: h=1 >> h=4 suggests gap enforcement issue")

Safe Feature Engineering Checklist¶

Before using any feature, verify:

No future data: Feature at time t uses only data from [0, t-gap]
Shift before rolling: df['x'].shift(1).rolling(...) not df['x'].rolling(...)
Training-only statistics: Percentiles, means, encodings fit on training only
Explicit lag: If using lag, is it y[t-lag] or accidentally y[t]?
No centered windows: center=False for all rolling operations
Gap respected: If predicting h steps ahead, gap >= h in all features

Quick Reference Card¶

Operation	Leaky Version	Safe Version
Rolling mean	`df['y'].rolling(5).mean()`	`df['y'].shift(1).rolling(5).mean()`
Percentile	`np.percentile(full_series, 75)`	`np.percentile(train_only, 75)`
Standardization	`scaler.fit_transform(X)`	`scaler.fit(X_train).transform(X)`
Group encoding	`df.groupby('g')['y'].transform('mean')`	Fit on train, map to full
Expanding mean	`df['y'].expanding().mean()`	`df['y'].shift(1).expanding().mean()`
Diff	`df['y'].diff()`	`df['y'].diff()` (safe, uses t and t-1)

Summary¶

The One Rule: Features at t use only data from ≤ t-gap
Shift before rolling: The most common pandas mistake
Training-only statistics: Percentiles, normalization, encoding
Validate with shuffled target: The definitive leakage test
Suspicious improvement = investigate: >20% over persistence = verify

When in doubt, ask: “Could I compute this feature in real-time production, knowing only the past?”

If the answer is “no,” it’s leakage.