Cross-Validation
Overview
The EPF system uses scikit-learn’s TimeSeriesSplit with 5 folds for cross-validation. Unlike random k-fold, TimeSeriesSplit respects temporal ordering — models are always validated on future data relative to their training set, providing realistic performance estimates for a time-series forecasting application.
Why Not Random K-Fold?
Random k-fold cross-validation randomly assigns samples to folds, ignoring temporal order. This creates two problems for time series:
Data Leakage
A model trained on data from September might be validated on data from June. Since the model has “seen” future information during training, it appears more accurate than it actually is in production.
Unrealistic Error Estimates
Real forecasting is always forward-looking: you train on the past and predict the future. Random k-fold breaks this constraint, producing optimistic error estimates that don’t reflect deployment accuracy.
TimeSeriesSplit: How It Works
Fold 1: Train [========] Val [==]Fold 2: Train [==========] Val [==]Fold 3: Train [============] Val [==]Fold 4: Train [==============] Val [==]Fold 5: Train [================] Val [==] ← Past Future →Each fold uses an expanding training window:
| Fold | Training Period | Validation Period |
|---|---|---|
| 1 | Months 1–6 | Months 7–8 |
| 2 | Months 1–8 | Months 9–10 |
| 3 | Months 1–10 | Months 11–12 |
| 4 | Months 1–12 | Months 13–14 |
| 5 | Months 1–14 | Months 15–16 |
(Exact splits depend on dataset size)
Key properties:
- Training always precedes validation: No future data leaks into training
- Expanding window: Later folds have more training data (mirrors production)
- Multiple estimates: 5 independent error measurements → robust mean and variance
Per-Fold Metrics
Each fold produces metrics for the validation period:
for fold_idx, (train_idx, val_idx) in enumerate(tscv.split(X)): X_train, X_val = X.iloc[train_idx], X.iloc[val_idx] y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
model.fit(X_train, y_train) y_pred = model.predict(X_val)
mae = mean_absolute_error(y_val, y_pred) rmse = sqrt(mean_squared_error(y_val, y_pred))These per-fold metrics reveal:
- Mean MAE: Expected production accuracy
- MAE variance across folds: Stability of performance across time periods
- Trend across folds: Improving (model gets better with more data) or degrading (recent market conditions are harder)
Residual Collection for Conformal Prediction
A critical secondary purpose of cross-validation: collecting out-of-fold residuals for conformal prediction intervals.
Each validation fold produces residuals:
residual = actual_price - predicted_priceThese residuals are grouped by horizon bucket and stored as the calibration set for conformal prediction. Using out-of-fold (not in-sample) residuals is essential — in-sample residuals are optimistically small and would produce intervals that are too narrow.
5-Fold Choice
Five folds balance estimation accuracy against computational cost:
| Folds | Training Data per Fold | Validation Data per Fold | Total CV Time |
|---|---|---|---|
| 3 | More | More | Faster |
| 5 | Moderate | Moderate | Standard |
| 10 | Less initial data | Less | Slower, volatile |
With 5 folds, each validation period covers approximately 2–3 months, providing enough data for reliable metric estimation while keeping training sets reasonably large.
Interaction with Horizon Groups
Cross-validation is performed independently for each horizon group:
For DA1 (hours 14-25): Fold 1: Train DA1 on period 1, validate DA1 on period 2 Fold 2: Train DA1 on period 1-2, validate DA1 on period 3 ...
For S1 (hours 33-56): Fold 1: Train S1 on period 1, validate S1 on period 2 ...This ensures each model is evaluated on data from its specific horizon, and conformal residuals are collected per horizon bucket.
Production Training
After cross-validation establishes metrics and collects conformal residuals, the final production model is trained on the full dataset (all folds combined). This maximizes training data for the deployed model while the CV metrics provide the accuracy estimate.