Skip to content

v10.1 — LSTM Encoder Optimisation + Ensemble Validation

Date: March 22, 2026 | Status: Scouted — Promotion Candidate Identified

Background

v10.0 established the LSTM-XGBoost hybrid architecture and found the best 90-day scout result (MAE 13.12) using a generic LSTM encoder. The 90-day window (Dec–Mar) was too calm to be conclusive — it excluded the Oct–Nov 2025 volatile gas crisis that drove prices 170-247 EUR/MWh.

v10.1 runs all experiments on a 150-day window (2025-10-24 to 2026-03-21) — the same standard used to promote v7.0 — and tests three hypotheses:

  1. Task-aligned encoder: An LSTM trained to predict the full D+1 price curve should embed richer, task-relevant information than one trained to predict just the next hour
  2. Encoder variants: Wider (128-dim), longer-context (2-week window), exogenous inputs (demand + temperature alongside price)
  3. Ensemble strategies: Combining XGBoost and LSTM predictions with various weights to get the best of both

Important Finding: The 90-day Window Was Misleading

On the 90-day scout window, xgb-ref (v7.0 config) achieved MAE 13.08. On the 150-day window it is 15.76. The LSTM task-aligned encoder was 16.13 on 90 days but 15.73 on 150 days — it beats xgb-ref on every single metric at the full scale.

This is because the Oct–Nov period includes extreme price volatility that disproportionately challenges XGBoost’s structural limitations (leaf averaging). The LSTM, precisely because it learns price trajectory patterns, handles these regime shifts better.

Phase A: Task-Aligned Encoder Variants (90-day scouts)

Hypothesis: encoder pre-trained to predict D+1 full daily price curve (96 quarter-hours) should learn more relevant embeddings than the generic “next-hour” pre-training objective.

ExperimentEncoderMAEBiasSlopeMaxPredSpkRec
xgb-refNone (v7.0 config)13.08-6.290.70714371.4%
task-alignedTask-aligned 64-dim16.13+0.850.66521079.7%
ta-2week2-week window 64-dim16.35+0.490.66018878.5%
ta-128dimTask-aligned 128-dim16.81+1.490.65420576.6%
ta-pw3xTask-aligned + pw3x17.91+2.820.61516679.6%

Key observation on 90-day: xgb-ref wins on MAE, but all LSTM variants have MaxPred 188-210 vs 143. Price weighting confirmed incompatible with LSTM (3rd time).

Phase B: 150-day Full Validation — The Reversal

ExperimentMAEBiasSlopeMaxPredSpkRecCrisis MAE
xgb-ref (v7.0)15.76-0.820.65918816.2%27.95
LSTM task-aligned15.73-0.650.66920924.1%27.16
LSTM 2-week16.04-0.420.66118821.4%27.95

LSTM task-aligned beats v7.0 on every metric at 150-day scale: MAE (-0.03), bias (-0.65 vs -0.82), slope (+0.010), MaxPred (+21 EUR), spike recall (+7.9pp), crisis MAE (-0.79).

Phase C: Additional Base Models + Ensembles

Training 4 additional single models and 12 ensemble strategies revealed the full trade-off landscape:

ModelMAEBiasSlopeMaxPredSpkRecNotes
base-xgb-nopw13.95-6.700.6841538.7%Best MAE, worst structure
generic-res1w14.41-4.000.69019911.0%Best promotion candidate
generic-plain14.44-6.550.6691393.3%
generic-d18014.72-5.590.69019911.6%
ens-all-714.96-1.600.65318014.3%Best bias among low-MAE
lstm-res1w-d18015.24-1.400.66419217.7%
ens-20-80 (XGB+LSTM)15.44-1.030.66520518.9%
xgb-ref15.76-0.820.65918816.2%Current production
lstm-task-aligned15.73-0.650.66920924.1%Best spike recall

The Structural Trade-off

These experiments confirm a fundamental trade-off in the Spanish DA electricity market:

DimensionWinnerRunner-up
Best overall MAEXGB no price-weight (13.95)Generic LSTM + res1w (14.41)
Best bias (near zero)LSTM task-aligned (-0.65)LSTM+res1w-d180 (-1.40)
Best spike recallLSTM task-aligned (24.1%)ens-20-80 (18.9%)
Best MaxPredLSTM task-aligned (209)ens-20-80 (205)
Best balancedGeneric LSTM + res1wens-all-7

generic-res1w wins on “balance” because it improves MAE by 8.5% over v7.0 while keeping MaxPred at 199 EUR (vs v7.0’s 188 EUR). It gives up spike recall (11% vs 16%) and bias (-4.00 vs -0.82) compared to the LSTM task-aligned, but these are accepted tradeoffs for a production model that serves general-purpose accuracy.

Promotion Decision: generic-res1w

The model to promote is generic-res1w: the v10.0 generic LSTM encoder (2-layer, 64-dim, next-hour pre-training) combined with the residual_1w target transform (predict deviation from weekly same-hour median) and the deep-tree config (depth=12, min_child=5).

vs v7.0 production (150-day window):

Metricv7.0 productiongeneric-res1wChange
MAE15.7614.41-8.5%
Bias-0.82-4.00Worse
Slope0.6590.690+4.7%
MaxPred188199+11 EUR
Spike recall16.2%11.0%Worse

Gains: MAE -8.5%, slope +4.7%, MaxPred +11 EUR. The model can reach higher prices and is more accurate on average.

Losses: Bias worsens from -0.82 to -4.00 (more systematic underprediction), spike recall drops from 16.2% to 11.0%.

Verdict: A meaningful step forward in accuracy and price range, at the cost of increased systematic bias. For a production model serving price forecasts to traders and analysts, the MAE improvement is primary. The bias can be partially compensated by users knowing the model tends to underpredict by ~4 EUR.

LSTM Ceiling Reached for Embeddings-as-Features

After v10.0 and v10.1 (18+ LSTM experiments), the LSTM-as-embeddings approach has been exhausted:

  • Generic encoder: validated, ~8.5% MAE improvement, production candidate
  • Task-aligned encoder: better structural metrics, worse MAE — not suitable for general forecasting
  • 128-dim, 2-week window, exogenous: all worse than 64-dim generic
  • Price weighting: always incompatible with LSTM (confirmed 3 times)
  • Ensembles: marginal improvement (15.73 → 15.44 best)

The next frontier is not encoder architecture — it is what the model is asked to learn:

  1. Asymmetric loss — penalise underprediction more than overprediction, directly targeting the -4 to -10 EUR bias
  2. Zero/negative price features — Spanish prices frequently hit 0 EUR (solar saturation). The 0 EUR barrier and negative territory are market-structural features not encoded anywhere
  3. Spike classifier hybrid — a binary “is this a spike day?” model blended with a price-optimised regressor