Walk-Forward Backtesting
Overview
Walk-forward backtesting simulates production conditions by stepping through historical dates, generating forecasts using only data that would have been available at each origin, and comparing predictions against realized prices. This produces realistic out-of-sample accuracy metrics that reflect actual deployment performance.
Why Not Standard Train/Test Splits?
Standard machine learning evaluation (random train/test split) violates temporal causality: the model may train on future data and predict past data. For time series forecasting, this creates overly optimistic accuracy estimates.
Walk-forward backtesting respects the arrow of time:
Standard split: Random 80/20 → data leakage riskWalk-forward: Always train on past, predict future → no leakageHow It Works
Step 1: Define Backtest Range
Start date: 2025-06-01End date: 2026-02-01Step size: 1 dayStep 2: For Each Origin Date
Date: 2025-09-15, 10:00 UTC (day-ahead run)
1. Training data: All data up to 2025-09-15 10:00 (expanding window — all available history)
2. Train models: DA1, DA2 horizon groups on this data
3. Predict: D+1 prices (2025-09-16 00:00–23:00)
4. Store: 24 predictions with model_name suffix "_backtest"Step 3: Compare with Actuals
After the backtest completes, actual prices are backfilled and metrics computed:
For each prediction: error = predicted_price - actual_price |error| → contributes to MAE error² → contributes to RMSEExpanding vs Rolling Window
Expanding Window (Default)
All available history is used for training:
Origin 2025-06-01: train on 2024-01-01 to 2025-06-01Origin 2025-09-15: train on 2024-01-01 to 2025-09-15Origin 2026-01-01: train on 2024-01-01 to 2026-01-01Training set grows over time, giving later origins more data. This matches production behavior where models are retrained on all available history.
Rolling Window
Only the most recent N months are used:
Window: 6 monthsOrigin 2025-06-01: train on 2024-12-01 to 2025-06-01Origin 2025-09-15: train on 2025-03-15 to 2025-09-15Origin 2026-01-01: train on 2025-07-01 to 2026-01-01Training set size is fixed, prioritizing recent patterns over older data. Useful when market structure has changed (e.g., after a regulatory shift).
Retrain Frequency
Full model retraining is expensive, so backtests support configurable retrain schedules:
| Frequency | Description | Trade-off |
|---|---|---|
| Daily | Retrain every origin | Most realistic, slowest |
| Weekly | Retrain once per week | Good balance |
| Monthly | Retrain once per month | Fast but less adaptive |
Between retraining dates, the most recently trained model is reused.
Backtest Outputs
Prediction Records
Each backtest prediction is stored with model_name suffix _backtest:
model_name: "ensemble_backtest"run_mode: "dayahead"This separates backtest results from live predictions in the database.
Aggregate Metrics
After completion, the backtest reports:
- Overall MAE, RMSE, MAPE
- MAE by horizon group (DA1, DA2, S1–S5)
- MAE by hour of day
- MAE by day of week
- Skill scores vs naive baselines
Running a Backtest
A backtest is configured with a date range, run mode, and retrain frequency. The run mode controls which product is backtested:
| Mode | Origin | Horizons |
|---|---|---|
dayahead | 10:00 UTC | DA1, DA2 (D+1) |
strategic | 15:00 UTC | S1–S5 (D+2–D+7) |
all | Both | All horizon groups |
Interpreting Results
Backtest MAE represents expected production accuracy under historical conditions. Key caveats:
- Hindsight data quality: Historical data has been cleaned and backfilled. Production data may have more gaps and delays.
- Feature availability lag: In production, some features (weather forecasts, commodity prices) may not be available at the exact origin time.
- Non-stationarity: Future market conditions may differ from backtest period conditions (new regulations, plant closures, demand shifts).
Backtest results are a lower bound on expected production error. Live accuracy is typically 5–15% worse than backtest accuracy.