Skip to content

Cross-Validation

Overview

The EPF system uses scikit-learn’s TimeSeriesSplit with 5 folds for cross-validation. Unlike random k-fold, TimeSeriesSplit respects temporal ordering — models are always validated on future data relative to their training set, providing realistic performance estimates for a time-series forecasting application.

Why Not Random K-Fold?

Random k-fold cross-validation randomly assigns samples to folds, ignoring temporal order. This creates two problems for time series:

Data Leakage

A model trained on data from September might be validated on data from June. Since the model has “seen” future information during training, it appears more accurate than it actually is in production.

Unrealistic Error Estimates

Real forecasting is always forward-looking: you train on the past and predict the future. Random k-fold breaks this constraint, producing optimistic error estimates that don’t reflect deployment accuracy.

TimeSeriesSplit: How It Works

Fold 1: Train [========] Val [==]
Fold 2: Train [==========] Val [==]
Fold 3: Train [============] Val [==]
Fold 4: Train [==============] Val [==]
Fold 5: Train [================] Val [==]
← Past Future →

Each fold uses an expanding training window:

FoldTraining PeriodValidation Period
1Months 1–6Months 7–8
2Months 1–8Months 9–10
3Months 1–10Months 11–12
4Months 1–12Months 13–14
5Months 1–14Months 15–16

(Exact splits depend on dataset size)

Key properties:

  • Training always precedes validation: No future data leaks into training
  • Expanding window: Later folds have more training data (mirrors production)
  • Multiple estimates: 5 independent error measurements → robust mean and variance

Per-Fold Metrics

Each fold produces metrics for the validation period:

for fold_idx, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
mae = mean_absolute_error(y_val, y_pred)
rmse = sqrt(mean_squared_error(y_val, y_pred))

These per-fold metrics reveal:

  • Mean MAE: Expected production accuracy
  • MAE variance across folds: Stability of performance across time periods
  • Trend across folds: Improving (model gets better with more data) or degrading (recent market conditions are harder)

Residual Collection for Conformal Prediction

A critical secondary purpose of cross-validation: collecting out-of-fold residuals for conformal prediction intervals.

Each validation fold produces residuals:

residual = actual_price - predicted_price

These residuals are grouped by horizon bucket and stored as the calibration set for conformal prediction. Using out-of-fold (not in-sample) residuals is essential — in-sample residuals are optimistically small and would produce intervals that are too narrow.

5-Fold Choice

Five folds balance estimation accuracy against computational cost:

FoldsTraining Data per FoldValidation Data per FoldTotal CV Time
3MoreMoreFaster
5ModerateModerateStandard
10Less initial dataLessSlower, volatile

With 5 folds, each validation period covers approximately 2–3 months, providing enough data for reliable metric estimation while keeping training sets reasonably large.

Interaction with Horizon Groups

Cross-validation is performed independently for each horizon group:

For DA1 (hours 14-25):
Fold 1: Train DA1 on period 1, validate DA1 on period 2
Fold 2: Train DA1 on period 1-2, validate DA1 on period 3
...
For S1 (hours 33-56):
Fold 1: Train S1 on period 1, validate S1 on period 2
...

This ensures each model is evaluated on data from its specific horizon, and conformal residuals are collected per horizon bucket.

Production Training

After cross-validation establishes metrics and collects conformal residuals, the final production model is trained on the full dataset (all folds combined). This maximizes training data for the deployed model while the CV metrics provide the accuracy estimate.