EPF PROJECT · MARCH 2026 · CRISIS ANALYSIS

When Our Model Met a Crisis: Electricity Price Forecasting During the March 2026 Oil Shock

Prices hit 250 EUR/MWh. At the worst single hour, we predicted 78. Here's exactly why each generation of our architecture failed differently — and what it tells us about supervised learning when markets leave the training distribution entirely.

Jorge Lopez Lan~14 min readMar 25, 2026model-failuredistribution-shiftenergy-markets

The Setup

In early March 2026, Iran escalated tensions in the Strait of Hormuz. Brent crude jumped 12% in four days. European TTF natural gas futures followed. Spanish electricity prices — which are set each day by the marginal cost of gas-fired generation — spiked from a typical 20–50 EUR/MWh into territory the market hadn't seen since the 2022 energy crisis.

On March 4, the first day of the shock, the daily average hit 89.93 EUR/MWh. By March 9, hourly prices touched 250 EUR/MWh — peaks the OMIE market hadn't cleared at that level in over a year. For eight days straight, prices were 2–5× their recent normal.

Our production model at the time was v4.3 — a gradient boosting ensemble (HistGB + LightGBM + XGBoost) trained on 58 features including REE grid data, weather, and commodity prices. It had a rolling 150-day MAE of 14.47 EUR/MWh. On March 4, it predicted an average of 65.68 EUR/MWh against an actual of 89.93 — a miss of 24 EUR on the daily average. On March 8, the worst single day, average actual was 120.44 EUR; average predicted was 56.54 EUR. A miss of 64 EUR/MWh on every hour.

Daily avg actual price vs production forecast — Feb 21 – Mar 17, 2026

Actual pricev4.3 production forecastv10.1 LSTM forecastCrisis window (Mar 4–13)

Warning

On March 9 at 19:00 UTC — the worst single prediction of the entire crisis — actual price was 197 EUR/MWh. The model predicted 78 EUR. A miss of 119 EUR on a single hour. That's not a model failing gracefully. That's a structural ceiling being hit.

What the Model Saw

The 150-day rolling MAE of 14.47 EUR/MWh tells almost nothing useful about how the model behaves across different market conditions. The actual variance is extreme:

Normal MAE

0.0EUR/MWh

Crisis MAE

0.0EUR/MWh

Recovery MAE

0.0EUR/MWh

Normal markets: 9.70 EUR/MWh. Crisis: 38.11 EUR/MWh — nearly 4× worse. And recovery (Mar 14–17), when prices crashed back to 6–46 EUR/MWh, was the worst period of all at 50.90 EUR. The model kept predicting elevated prices for days after they'd collapsed, anchored to a crisis that was already over. The residual-1w baseline, a mechanism in later model versions, fixes this by adapting in 7 days — but v4.3 had no such mechanism.

v4.3 production model — MAE by period (EUR/MWh)

The 150-day rolling average (14.47 EUR) hides an extreme variance: 4× better during calm weeks, 3.5× worse during the crisis.

Insight

The 150-day backtest MAE (14.47) is real — it's the honest long-run average. But it hides a 4× swing between market regimes. A model that scores 9.70 in calm conditions and 50.90 during recovery is not one model — it's three models with the same weights behaving very differently.

How Each Model Generation Failed Differently

We didn't just run v4.3 against the crisis. We have full backtest data for every major architecture we developed, including the final v10.1 LSTM hybrid that became the new production model. Each one failed in a structurally distinct way.

Normal vs crisis MAE by model architecture (EUR/MWh, same dates)

Normal MAE (Feb 21–Mar 3) Crisis MAE (Mar 4–13)

v4.3Production (v4.3)Trees only, no transform

Normal: 9.7 EURCrisis: 38.11 EURMaxPred: 171 EURSpike recall: 39.3%

Structural ceiling: leaf averaging could not output values above training range for this period. Recovery MAE explodes as model keeps predicting high prices after crash.

v7.0Price-Weighted (v7.0)3× weight on >60 EUR hours

Normal: 7.42 EURCrisis: 32.19 EURMaxPred: 159 EURSpike recall: 46.4%

Weighting improved normal-market predictions but made crisis worse in aggregate. When OOD inputs arrive, a model that's been trained to 'care more' about expensive hours is confidently wrong rather than cautiously wrong.

v7.2Residual Transform (v7.2)Predict deviation from 1w median

Normal: 7.95 EURCrisis: 13.78 EURMaxPred: 123 EUR

Transform shrinks the prediction target: model predicts deviation from a rising 1w baseline instead of raw prices. During crisis, the baseline itself rises rapidly, so the residuals to predict remain small. Dramatically lower crisis MAE — but MaxPred is still only 123 EUR. Every hour above that is structurally impossible.

v10.1LSTM + XGBoost (v10.1)Task-aligned LSTM encoder

Normal: 8.01 EURCrisis: 25.86 EURMaxPred: 210 EURSpike recall: 57.1%

LSTM encoder learned the shape of full D+1 crisis curves from historical data. Its hidden state encodes 'sustained high-price regime' — something flat lag features can't represent. MaxPred jumps to 210 EUR. But the LSTM was trained on 2022–2025 crises, not this specific geopolitical event. 43% of >150 EUR hours still missed.

Maximum prediction ceiling per architecture (EUR/MWh)

Only v10.1 approaches the actual crisis price range. Every other architecture has a structural ceiling below 175 EUR.

Result

v10.1 (task-aligned LSTM + XGBoost) has the lowest overall crisis MAE at 25.86 EUR and the only architecture capable of predicting above 200 EUR/MWh. Spike recall improved from 39.3% (v4.3) to 57.1% — catching 1 in 2 of the worst hours instead of 1 in 3. Still not good enough to rely on during a crisis. But structurally closer to the truth. (v7.2 spike recall not shown — this metric was added to the evaluation framework after v7.2 was retired.)

Why It Failed Structurally

There are three distinct failure modes at work — and they stack:

1. Out-of-Distribution Inputs

Every supervised model is implicitly bounded by its training distribution. When the model was trained, the highest-ever Spanish day-ahead price it saw was around 170 EUR/MWh (the 2022 crisis). When March 2026 pushed prices to 250 EUR/MWh, the feature values describing that state — gas prices, demand, generation mix — were combinations the model had never seen together at that magnitude.

The out-of-distribution gap — training range vs crisis prices

TRAINING RANGE

~170 EUR

-4 EUR

Leaves learn

averages here

OOD gap

tree can't extrapolate

CRISIS PRICES

250 EUR

max v10.1: 210

max v4.3: 171

170 EUR

2. Tree Leaf Averaging Creates a Hard Ceiling

Gradient boosting trees make predictions by assigning each input to a leaf, then returning the average of training targets in that leaf. When a leaf contains training examples with prices [80, 110, 130, 170 EUR], the prediction is ~122 EUR. Never 170. Never 200. Trees are structurally incapable of predicting values that are extreme relative to their training distribution — not because of the features, but because of how they aggregate predictions.

The v4.3 model's maximum prediction during the crisis was 171 EUR. The actual maximum was 250 EUR. That 79 EUR gap isn't a tuning problem. It's the architecture's hard limit — which is why the only fix was adding an LSTM encoder that gives the trees a dense latent representation of the current price regime, expanding the effective prediction range to 210 EUR.

3. No Exogenous Shock Signal

v4.3 included TTF spot gas prices as a feature. But spot TTF lags behind market expectations of a supply shock by days. The signal that prices were about to spike — forward curve widening, Brent futures jumping 12% — wasn't in any feature the model could see at 10:00 UTC the day before. By the time spot TTF reflected the crisis, the damage was already done.

Insight

The 2022 crisis improved every subsequent model because it expanded the training distribution. The March 2026 shock did the same: it added new high-price examples to the training data that future models will learn from. Crisis periods are not just hard evaluation cases — they are the most valuable training data the model will ever see.

What Would Have Helped

Being honest about this: none of the following would have solved the crisis-period problem. But they would have reduced it.

TTF Front-Month Forward Curve

The TTF 1-month forward price captures market expectations of future supply tightness days before spot TTF reflects it. Forward spread widening (forward > spot + 10%) is an early warning signal for gas-driven price spikes. Adding this feature is the highest-priority improvement currently in the queue. On the specific Iran 2026 event, it likely would have cut the first two days of crisis MAE by 20–40%.

Regime Detection Layer

A volatility gate — flagging when 7-day price standard deviation exceeds a threshold — could trigger a fallback to a more conservative prediction strategy, or blend in a model variant that's more conservative about spike recall. The fundamental challenge: volatility is still low at crisis onset. By the time the gate fires, you've already missed the first two days.

Online Adaptation

Fine-tuning model weights during a crisis — using the last 48 hours of actual prices as additional training signal — would help the model adapt in near-real-time. This requires infrastructure changes (continuous retraining pipeline) and careful regularization to prevent the adapted model from forgetting its base behavior. Currently not implemented; would take weeks to build safely.

Warning

The academic EPF literature has the same problem. Crisis robustness is an open research question across every supervised short-term electricity forecasting method. Our failure during the Iran shock is not unique to gradient boosting or to this system — it is the current state of the art.

Where We Are Now

By March 14, Iranian diplomatic channels reopened and oil futures reversed. Spanish prices collapsed from 120+ EUR to single digits within 72 hours. A new problem emerged: both models kept predicting high prices for several more days, anchored to a crisis that was already over.

Crisis tail and recovery — prices collapse while model over-predicts (Mar 6–17)

v4.3 recovery MAE: 50.90 EUR — worst of any period. Both models over-predicted for days after prices crashed; v10.1's LSTM "remembered" the crisis for longer (recovery MAE: 32.19 EUR).

v4.3's recovery MAE (50.90 EUR) was the worst period of the entire analysis — worse even than the crisis peak. The model had no mechanism to "forget" the elevated price environment. v10.1's LSTM encoder had the opposite problem: it had learned the crisis regime too well. Its hidden state held onto elevated crisis representations for days after the underlying cause had disappeared. Recovery MAE: 32.19 EUR.

The standout recovery result is actually v7.0: recovery MAE of just 17.77 EUR — the best of any architecture, well below v10.1's 32.19. The residual-1w transform in v7.0 is what makes the difference: the 1-week price baseline adapts in 7 days, so when prices collapse the baseline catches up quickly, and the residual the model needs to predict snaps back toward normal. The LSTM, which learns longer-range regime representations, adapts more slowly.

This is a meaningful finding: for recovery periods specifically, the simpler residual transform outperforms the LSTM hybrid. The production stack uses both (v10.1 = LSTM + residual transform), but the two mechanisms pull in different directions during rapid reversals.

Live predictions are updating daily as the market normalizes. See it live →

What This Means for Electricity Price Forecasting

Supervised models trained on historical electricity prices learn a compressed, averaged representation of past market states. They are exceptional at interpolating within that distribution — predicting prices in conditions they've seen before. They are structurally limited at extrapolating into new regimes.

The March 2026 Iran oil shock was a new regime. Not because gas markets had never spiked before, but because the specific combination of geopolitical context, market timing, and price velocity was outside anything in three years of training data. No amount of feature engineering fixes this. More features don't help when the inputs themselves are OOD.

The trajectory from v4.3 → v7.0 → v7.2 → v10.1 shows real structural progress: MaxPred ceiling broke from 171 to 210 EUR. Spike recall improved from 39% to 57%. Crisis MAE dropped from 38 to 26 EUR. These are genuine improvements. But they also show the floor: even the best architecture we've built caught fewer than 3 in 5 spike hours during a genuine crisis.

The honest ceiling, from our analysis: on a window that includes a gas crisis, the theoretical MAE floor for any publicly-available-feature-based ML approach is approximately 10–15 EUR. Below that requires either proprietary data (forward curves, order book, REMIT outage data) or information that simply doesn't exist in any public source at 10:00 UTC the day before a geopolitical event.

The Full Story

This article focused on what happens when things go wrong. The full story — 121 experiments, the structural diagnosis that revealed why trees compress predictions, the 48-experiment campaign to break through, and the LSTM architecture that changed everything — is in the companion piece.

I Tried to Predict Electricity Prices 7 Days Ahead. Here's How Close I Got. →

The methodology and full experiment history are open: github.com/JorgeLopezLan/epf-methodology. Live predictions at epf.productjorge.com.