v10.1 — LSTM Encoder Optimisation + Ensemble Validation

Date: March 22, 2026 | Status: Scouted — Promotion Candidate Identified

Background

v10.0 established the LSTM-XGBoost hybrid architecture and found the best 90-day scout result (MAE 13.12) using a generic LSTM encoder. The 90-day window (Dec–Mar) was too calm to be conclusive — it excluded the Oct–Nov 2025 volatile gas crisis that drove prices 170-247 EUR/MWh.

v10.1 runs all experiments on a 150-day window (2025-10-24 to 2026-03-21) — the same standard used to promote v7.0 — and tests three hypotheses:

Task-aligned encoder: An LSTM trained to predict the full D+1 price curve should embed richer, task-relevant information than one trained to predict just the next hour
Encoder variants: Wider (128-dim), longer-context (2-week window), exogenous inputs (demand + temperature alongside price)
Ensemble strategies: Combining XGBoost and LSTM predictions with various weights to get the best of both

Important Finding: The 90-day Window Was Misleading

On the 90-day scout window, xgb-ref (v7.0 config) achieved MAE 13.08. On the 150-day window it is 15.76. The LSTM task-aligned encoder was 16.13 on 90 days but 15.73 on 150 days — it beats xgb-ref on every single metric at the full scale.

This is because the Oct–Nov period includes extreme price volatility that disproportionately challenges XGBoost’s structural limitations (leaf averaging). The LSTM, precisely because it learns price trajectory patterns, handles these regime shifts better.

Phase A: Task-Aligned Encoder Variants (90-day scouts)

Hypothesis: encoder pre-trained to predict D+1 full daily price curve (96 quarter-hours) should learn more relevant embeddings than the generic “next-hour” pre-training objective.

Experiment	Encoder	MAE	Bias	Slope	MaxPred	SpkRec
xgb-ref	None (v7.0 config)	13.08	-6.29	0.707	143	71.4%
task-aligned	Task-aligned 64-dim	16.13	+0.85	0.665	210	79.7%
ta-2week	2-week window 64-dim	16.35	+0.49	0.660	188	78.5%
ta-128dim	Task-aligned 128-dim	16.81	+1.49	0.654	205	76.6%
ta-pw3x	Task-aligned + pw3x	17.91	+2.82	0.615	166	79.6%

Key observation on 90-day: xgb-ref wins on MAE, but all LSTM variants have MaxPred 188-210 vs 143. Price weighting confirmed incompatible with LSTM (3rd time).

Phase B: 150-day Full Validation — The Reversal

Experiment	MAE	Bias	Slope	MaxPred	SpkRec	Crisis MAE
xgb-ref (v7.0)	15.76	-0.82	0.659	188	16.2%	27.95
LSTM task-aligned	15.73	-0.65	0.669	209	24.1%	27.16
LSTM 2-week	16.04	-0.42	0.661	188	21.4%	27.95

LSTM task-aligned beats v7.0 on every metric at 150-day scale: MAE (-0.03), bias (-0.65 vs -0.82), slope (+0.010), MaxPred (+21 EUR), spike recall (+7.9pp), crisis MAE (-0.79).

Phase C: Additional Base Models + Ensembles

Training 4 additional single models and 12 ensemble strategies revealed the full trade-off landscape:

Model	MAE	Bias	Slope	MaxPred	SpkRec	Notes
base-xgb-nopw	13.95	-6.70	0.684	153	8.7%	Best MAE, worst structure
generic-res1w	14.41	-4.00	0.690	199	11.0%	Best promotion candidate
generic-plain	14.44	-6.55	0.669	139	3.3%	—
generic-d180	14.72	-5.59	0.690	199	11.6%	—
ens-all-7	14.96	-1.60	0.653	180	14.3%	Best bias among low-MAE
lstm-res1w-d180	15.24	-1.40	0.664	192	17.7%	—
ens-20-80 (XGB+LSTM)	15.44	-1.03	0.665	205	18.9%	—
xgb-ref	15.76	-0.82	0.659	188	16.2%	Current production
lstm-task-aligned	15.73	-0.65	0.669	209	24.1%	Best spike recall

The Structural Trade-off

These experiments confirm a fundamental trade-off in the Spanish DA electricity market:

Dimension	Winner	Runner-up
Best overall MAE	XGB no price-weight (13.95)	Generic LSTM + res1w (14.41)
Best bias (near zero)	LSTM task-aligned (-0.65)	LSTM+res1w-d180 (-1.40)
Best spike recall	LSTM task-aligned (24.1%)	ens-20-80 (18.9%)
Best MaxPred	LSTM task-aligned (209)	ens-20-80 (205)
Best balanced	Generic LSTM + res1w	ens-all-7

generic-res1w wins on “balance” because it improves MAE by 8.5% over v7.0 while keeping MaxPred at 199 EUR (vs v7.0’s 188 EUR). It gives up spike recall (11% vs 16%) and bias (-4.00 vs -0.82) compared to the LSTM task-aligned, but these are accepted tradeoffs for a production model that serves general-purpose accuracy.

Promotion Decision: generic-res1w

The model to promote is generic-res1w: the v10.0 generic LSTM encoder (2-layer, 64-dim, next-hour pre-training) combined with the residual_1w target transform (predict deviation from weekly same-hour median) and the deep-tree config (depth=12, min_child=5).

vs v7.0 production (150-day window):

Metric	v7.0 production	generic-res1w	Change
MAE	15.76	14.41	-8.5%
Bias	-0.82	-4.00	Worse
Slope	0.659	0.690	+4.7%
MaxPred	188	199	+11 EUR
Spike recall	16.2%	11.0%	Worse

Gains: MAE -8.5%, slope +4.7%, MaxPred +11 EUR. The model can reach higher prices and is more accurate on average.

Losses: Bias worsens from -0.82 to -4.00 (more systematic underprediction), spike recall drops from 16.2% to 11.0%.

Verdict: A meaningful step forward in accuracy and price range, at the cost of increased systematic bias. For a production model serving price forecasts to traders and analysts, the MAE improvement is primary. The bias can be partially compensated by users knowing the model tends to underpredict by ~4 EUR.

LSTM Ceiling Reached for Embeddings-as-Features

After v10.0 and v10.1 (18+ LSTM experiments), the LSTM-as-embeddings approach has been exhausted:

Generic encoder: validated, ~8.5% MAE improvement, production candidate
Task-aligned encoder: better structural metrics, worse MAE — not suitable for general forecasting
128-dim, 2-week window, exogenous: all worse than 64-dim generic
Price weighting: always incompatible with LSTM (confirmed 3 times)
Ensembles: marginal improvement (15.73 → 15.44 best)

The next frontier is not encoder architecture — it is what the model is asked to learn:

Asymmetric loss — penalise underprediction more than overprediction, directly targeting the -4 to -10 EUR bias
Zero/negative price features — Spanish prices frequently hit 0 EUR (solar saturation). The 0 EUR barrier and negative territory are market-structural features not encoded anywhere
Spike classifier hybrid — a binary “is this a spike day?” model blended with a price-optimised regressor