v10.1 — LSTM Encoder Optimisation + Ensemble Validation
Date: March 22, 2026 | Status: Scouted — Promotion Candidate Identified
Background
v10.0 established the LSTM-XGBoost hybrid architecture and found the best 90-day scout result (MAE 13.12) using a generic LSTM encoder. The 90-day window (Dec–Mar) was too calm to be conclusive — it excluded the Oct–Nov 2025 volatile gas crisis that drove prices 170-247 EUR/MWh.
v10.1 runs all experiments on a 150-day window (2025-10-24 to 2026-03-21) — the same standard used to promote v7.0 — and tests three hypotheses:
- Task-aligned encoder: An LSTM trained to predict the full D+1 price curve should embed richer, task-relevant information than one trained to predict just the next hour
- Encoder variants: Wider (128-dim), longer-context (2-week window), exogenous inputs (demand + temperature alongside price)
- Ensemble strategies: Combining XGBoost and LSTM predictions with various weights to get the best of both
Important Finding: The 90-day Window Was Misleading
On the 90-day scout window, xgb-ref (v7.0 config) achieved MAE 13.08. On the 150-day window it is 15.76. The LSTM task-aligned encoder was 16.13 on 90 days but 15.73 on 150 days — it beats xgb-ref on every single metric at the full scale.
This is because the Oct–Nov period includes extreme price volatility that disproportionately challenges XGBoost’s structural limitations (leaf averaging). The LSTM, precisely because it learns price trajectory patterns, handles these regime shifts better.
Phase A: Task-Aligned Encoder Variants (90-day scouts)
Hypothesis: encoder pre-trained to predict D+1 full daily price curve (96 quarter-hours) should learn more relevant embeddings than the generic “next-hour” pre-training objective.
| Experiment | Encoder | MAE | Bias | Slope | MaxPred | SpkRec |
|---|---|---|---|---|---|---|
| xgb-ref | None (v7.0 config) | 13.08 | -6.29 | 0.707 | 143 | 71.4% |
| task-aligned | Task-aligned 64-dim | 16.13 | +0.85 | 0.665 | 210 | 79.7% |
| ta-2week | 2-week window 64-dim | 16.35 | +0.49 | 0.660 | 188 | 78.5% |
| ta-128dim | Task-aligned 128-dim | 16.81 | +1.49 | 0.654 | 205 | 76.6% |
| ta-pw3x | Task-aligned + pw3x | 17.91 | +2.82 | 0.615 | 166 | 79.6% |
Key observation on 90-day: xgb-ref wins on MAE, but all LSTM variants have MaxPred 188-210 vs 143. Price weighting confirmed incompatible with LSTM (3rd time).
Phase B: 150-day Full Validation — The Reversal
| Experiment | MAE | Bias | Slope | MaxPred | SpkRec | Crisis MAE |
|---|---|---|---|---|---|---|
| xgb-ref (v7.0) | 15.76 | -0.82 | 0.659 | 188 | 16.2% | 27.95 |
| LSTM task-aligned | 15.73 | -0.65 | 0.669 | 209 | 24.1% | 27.16 |
| LSTM 2-week | 16.04 | -0.42 | 0.661 | 188 | 21.4% | 27.95 |
LSTM task-aligned beats v7.0 on every metric at 150-day scale: MAE (-0.03), bias (-0.65 vs -0.82), slope (+0.010), MaxPred (+21 EUR), spike recall (+7.9pp), crisis MAE (-0.79).
Phase C: Additional Base Models + Ensembles
Training 4 additional single models and 12 ensemble strategies revealed the full trade-off landscape:
| Model | MAE | Bias | Slope | MaxPred | SpkRec | Notes |
|---|---|---|---|---|---|---|
| base-xgb-nopw | 13.95 | -6.70 | 0.684 | 153 | 8.7% | Best MAE, worst structure |
| generic-res1w | 14.41 | -4.00 | 0.690 | 199 | 11.0% | Best promotion candidate |
| generic-plain | 14.44 | -6.55 | 0.669 | 139 | 3.3% | — |
| generic-d180 | 14.72 | -5.59 | 0.690 | 199 | 11.6% | — |
| ens-all-7 | 14.96 | -1.60 | 0.653 | 180 | 14.3% | Best bias among low-MAE |
| lstm-res1w-d180 | 15.24 | -1.40 | 0.664 | 192 | 17.7% | — |
| ens-20-80 (XGB+LSTM) | 15.44 | -1.03 | 0.665 | 205 | 18.9% | — |
| xgb-ref | 15.76 | -0.82 | 0.659 | 188 | 16.2% | Current production |
| lstm-task-aligned | 15.73 | -0.65 | 0.669 | 209 | 24.1% | Best spike recall |
The Structural Trade-off
These experiments confirm a fundamental trade-off in the Spanish DA electricity market:
| Dimension | Winner | Runner-up |
|---|---|---|
| Best overall MAE | XGB no price-weight (13.95) | Generic LSTM + res1w (14.41) |
| Best bias (near zero) | LSTM task-aligned (-0.65) | LSTM+res1w-d180 (-1.40) |
| Best spike recall | LSTM task-aligned (24.1%) | ens-20-80 (18.9%) |
| Best MaxPred | LSTM task-aligned (209) | ens-20-80 (205) |
| Best balanced | Generic LSTM + res1w | ens-all-7 |
generic-res1w wins on “balance” because it improves MAE by 8.5% over v7.0 while keeping MaxPred at 199 EUR (vs v7.0’s 188 EUR). It gives up spike recall (11% vs 16%) and bias (-4.00 vs -0.82) compared to the LSTM task-aligned, but these are accepted tradeoffs for a production model that serves general-purpose accuracy.
Promotion Decision: generic-res1w
The model to promote is generic-res1w: the v10.0 generic LSTM encoder (2-layer, 64-dim, next-hour pre-training) combined with the residual_1w target transform (predict deviation from weekly same-hour median) and the deep-tree config (depth=12, min_child=5).
vs v7.0 production (150-day window):
| Metric | v7.0 production | generic-res1w | Change |
|---|---|---|---|
| MAE | 15.76 | 14.41 | -8.5% |
| Bias | -0.82 | -4.00 | Worse |
| Slope | 0.659 | 0.690 | +4.7% |
| MaxPred | 188 | 199 | +11 EUR |
| Spike recall | 16.2% | 11.0% | Worse |
Gains: MAE -8.5%, slope +4.7%, MaxPred +11 EUR. The model can reach higher prices and is more accurate on average.
Losses: Bias worsens from -0.82 to -4.00 (more systematic underprediction), spike recall drops from 16.2% to 11.0%.
Verdict: A meaningful step forward in accuracy and price range, at the cost of increased systematic bias. For a production model serving price forecasts to traders and analysts, the MAE improvement is primary. The bias can be partially compensated by users knowing the model tends to underpredict by ~4 EUR.
LSTM Ceiling Reached for Embeddings-as-Features
After v10.0 and v10.1 (18+ LSTM experiments), the LSTM-as-embeddings approach has been exhausted:
- Generic encoder: validated, ~8.5% MAE improvement, production candidate
- Task-aligned encoder: better structural metrics, worse MAE — not suitable for general forecasting
- 128-dim, 2-week window, exogenous: all worse than 64-dim generic
- Price weighting: always incompatible with LSTM (confirmed 3 times)
- Ensembles: marginal improvement (15.73 → 15.44 best)
The next frontier is not encoder architecture — it is what the model is asked to learn:
- Asymmetric loss — penalise underprediction more than overprediction, directly targeting the -4 to -10 EUR bias
- Zero/negative price features — Spanish prices frequently hit 0 EUR (solar saturation). The 0 EUR barrier and negative territory are market-structural features not encoded anywhere
- Spike classifier hybrid — a binary “is this a spike day?” model blended with a price-optimised regressor