Cross-Validation

Overview

The EPF system uses scikit-learn’s TimeSeriesSplit with 5 folds for cross-validation. Unlike random k-fold, TimeSeriesSplit respects temporal ordering — models are always validated on future data relative to their training set, providing realistic performance estimates for a time-series forecasting application.

Why Not Random K-Fold?

Random k-fold cross-validation randomly assigns samples to folds, ignoring temporal order. This creates two problems for time series:

Data Leakage

A model trained on data from September might be validated on data from June. Since the model has “seen” future information during training, it appears more accurate than it actually is in production.

Unrealistic Error Estimates

Real forecasting is always forward-looking: you train on the past and predict the future. Random k-fold breaks this constraint, producing optimistic error estimates that don’t reflect deployment accuracy.

TimeSeriesSplit: How It Works

Fold 1: Train [========]  Val [==]
Fold 2: Train [==========]  Val [==]
Fold 3: Train [============]  Val [==]
Fold 4: Train [==============]  Val [==]
Fold 5: Train [================]  Val [==]
         ← Past                    Future →

Each fold uses an expanding training window:

Fold	Training Period	Validation Period
1	Months 1–6	Months 7–8
2	Months 1–8	Months 9–10
3	Months 1–10	Months 11–12
4	Months 1–12	Months 13–14
5	Months 1–14	Months 15–16

(Exact splits depend on dataset size)

Key properties:

Training always precedes validation: No future data leaks into training
Expanding window: Later folds have more training data (mirrors production)
Multiple estimates: 5 independent error measurements → robust mean and variance

Per-Fold Metrics

Each fold produces metrics for the validation period:

for fold_idx, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    mae = mean_absolute_error(y_val, y_pred)
    rmse = sqrt(mean_squared_error(y_val, y_pred))

These per-fold metrics reveal:

Mean MAE: Expected production accuracy
MAE variance across folds: Stability of performance across time periods
Trend across folds: Improving (model gets better with more data) or degrading (recent market conditions are harder)

Residual Collection for Conformal Prediction

A critical secondary purpose of cross-validation: collecting out-of-fold residuals for conformal prediction intervals.

Each validation fold produces residuals:

residual = actual_price - predicted_price

These residuals are grouped by horizon bucket and stored as the calibration set for conformal prediction. Using out-of-fold (not in-sample) residuals is essential — in-sample residuals are optimistically small and would produce intervals that are too narrow.

5-Fold Choice

Five folds balance estimation accuracy against computational cost:

Folds	Training Data per Fold	Validation Data per Fold	Total CV Time
3	More	More	Faster
5	Moderate	Moderate	Standard
10	Less initial data	Less	Slower, volatile

With 5 folds, each validation period covers approximately 2–3 months, providing enough data for reliable metric estimation while keeping training sets reasonably large.

Interaction with Horizon Groups

Cross-validation is performed independently for each horizon group:

For DA1 (hours 14-25):
  Fold 1: Train DA1 on period 1, validate DA1 on period 2
  Fold 2: Train DA1 on period 1-2, validate DA1 on period 3
  ...

For S1 (hours 33-56):
  Fold 1: Train S1 on period 1, validate S1 on period 2
  ...

This ensures each model is evaluated on data from its specific horizon, and conformal residuals are collected per horizon bucket.

Production Training

After cross-validation establishes metrics and collects conformal residuals, the final production model is trained on the full dataset (all folds combined). This maximizes training data for the deployed model while the CV metrics provide the accuracy estimate.