Walk-Forward Backtesting

Overview

Walk-forward backtesting simulates production conditions by stepping through historical dates, generating forecasts using only data that would have been available at each origin, and comparing predictions against realized prices. This produces realistic out-of-sample accuracy metrics that reflect actual deployment performance.

Why Not Standard Train/Test Splits?

Standard machine learning evaluation (random train/test split) violates temporal causality: the model may train on future data and predict past data. For time series forecasting, this creates overly optimistic accuracy estimates.

Walk-forward backtesting respects the arrow of time:

Standard split:  Random 80/20 → data leakage risk
Walk-forward:    Always train on past, predict future → no leakage

How It Works

Step 1: Define Backtest Range

Start date:  2025-06-01
End date:    2026-02-01
Step size:   1 day

Step 2: For Each Origin Date

Date: 2025-09-15, 10:00 UTC (day-ahead run)

1. Training data:  All data up to 2025-09-15 10:00
   (expanding window — all available history)

2. Train models:   DA1, DA2 horizon groups on this data

3. Predict:        D+1 prices (2025-09-16 00:00–23:00)

4. Store:          24 predictions with model_name suffix "_backtest"

Step 3: Compare with Actuals

After the backtest completes, actual prices are backfilled and metrics computed:

For each prediction:
  error = predicted_price - actual_price
  |error| → contributes to MAE
  error² → contributes to RMSE

Expanding vs Rolling Window

Expanding Window (Default)

All available history is used for training:

Origin 2025-06-01: train on 2024-01-01 to 2025-06-01
Origin 2025-09-15: train on 2024-01-01 to 2025-09-15
Origin 2026-01-01: train on 2024-01-01 to 2026-01-01

Training set grows over time, giving later origins more data. This matches production behavior where models are retrained on all available history.

Rolling Window

Only the most recent N months are used:

Window: 6 months
Origin 2025-06-01: train on 2024-12-01 to 2025-06-01
Origin 2025-09-15: train on 2025-03-15 to 2025-09-15
Origin 2026-01-01: train on 2025-07-01 to 2026-01-01

Training set size is fixed, prioritizing recent patterns over older data. Useful when market structure has changed (e.g., after a regulatory shift).

Retrain Frequency

Full model retraining is expensive, so backtests support configurable retrain schedules:

Frequency	Description	Trade-off
Daily	Retrain every origin	Most realistic, slowest
Weekly	Retrain once per week	Good balance
Monthly	Retrain once per month	Fast but less adaptive

Between retraining dates, the most recently trained model is reused.

Backtest Outputs

Prediction Records

Each backtest prediction is stored with model_name suffix _backtest:

model_name: "ensemble_backtest"
run_mode:   "dayahead"

This separates backtest results from live predictions in the database.

Aggregate Metrics

After completion, the backtest reports:

Overall MAE, RMSE, MAPE
MAE by horizon group (DA1, DA2, S1–S5)
MAE by hour of day
MAE by day of week
Skill scores vs naive baselines

Running a Backtest

A backtest is configured with a date range, run mode, and retrain frequency. The run mode controls which product is backtested:

Mode	Origin	Horizons
`dayahead`	10:00 UTC	DA1, DA2 (D+1)
`strategic`	15:00 UTC	S1–S5 (D+2–D+7)
`all`	Both	All horizon groups

Interpreting Results

Backtest MAE represents expected production accuracy under historical conditions. Key caveats:

Hindsight data quality: Historical data has been cleaned and backfilled. Production data may have more gaps and delays.
Feature availability lag: In production, some features (weather forecasts, commodity prices) may not be available at the exact origin time.
Non-stationarity: Future market conditions may differ from backtest period conditions (new regulations, plant closures, demand shifts).

Backtest results are a lower bound on expected production error. Live accuracy is typically 5–15% worse than backtest accuracy.