When Our Model Met a Crisis: Electricity Price Forecasting During the March 2026 Oil Shock
Prices hit 250 EUR/MWh. At the worst single hour, we predicted 78. Here's exactly why each generation of our architecture failed differently — and what it tells us about supervised learning when markets leave the training distribution entirely.
The Setup
In early March 2026, Iran escalated tensions in the Strait of Hormuz. Brent crude jumped 12% in four days. European TTF natural gas futures followed. Spanish electricity prices — which are set each day by the marginal cost of gas-fired generation — spiked from a typical 20–50 EUR/MWh into territory the market hadn't seen since the 2022 energy crisis.
On March 4, the first day of the shock, the daily average hit 89.93 EUR/MWh. By March 9, hourly prices touched 250 EUR/MWh — peaks the OMIE market hadn't cleared at that level in over a year. For eight days straight, prices were 2–5× their recent normal.
Our production model at the time was v4.3 — a gradient boosting ensemble (HistGB + LightGBM + XGBoost) trained on 58 features including REE grid data, weather, and commodity prices. It had a rolling 150-day MAE of 14.47 EUR/MWh. On March 4, it predicted an average of 65.68 EUR/MWh against an actual of 89.93 — a miss of 24 EUR on the daily average. On March 8, the worst single day, average actual was 120.44 EUR; average predicted was 56.54 EUR. A miss of 64 EUR/MWh on every hour.
What the Model Saw
The 150-day rolling MAE of 14.47 EUR/MWh tells almost nothing useful about how the model behaves across different market conditions. The actual variance is extreme:
Normal markets: 9.70 EUR/MWh. Crisis: 38.11 EUR/MWh — nearly 4× worse. And recovery (Mar 14–17), when prices crashed back to 6–46 EUR/MWh, was the worst period of all at 50.90 EUR. The model kept predicting elevated prices for days after they'd collapsed, anchored to a crisis that was already over. The residual-1w baseline, a mechanism in later model versions, fixes this by adapting in 7 days — but v4.3 had no such mechanism.
How Each Model Generation Failed Differently
We didn't just run v4.3 against the crisis. We have full backtest data for every major architecture we developed, including the final v10.1 LSTM hybrid that became the new production model. Each one failed in a structurally distinct way.
Why It Failed Structurally
There are three distinct failure modes at work — and they stack:
1. Out-of-Distribution Inputs
Every supervised model is implicitly bounded by its training distribution. When the model was trained, the highest-ever Spanish day-ahead price it saw was around 170 EUR/MWh (the 2022 crisis). When March 2026 pushed prices to 250 EUR/MWh, the feature values describing that state — gas prices, demand, generation mix — were combinations the model had never seen together at that magnitude.
2. Tree Leaf Averaging Creates a Hard Ceiling
Gradient boosting trees make predictions by assigning each input to a leaf, then returning the average of training targets in that leaf. When a leaf contains training examples with prices [80, 110, 130, 170 EUR], the prediction is ~122 EUR. Never 170. Never 200. Trees are structurally incapable of predicting values that are extreme relative to their training distribution — not because of the features, but because of how they aggregate predictions.
The v4.3 model's maximum prediction during the crisis was 171 EUR. The actual maximum was 250 EUR. That 79 EUR gap isn't a tuning problem. It's the architecture's hard limit — which is why the only fix was adding an LSTM encoder that gives the trees a dense latent representation of the current price regime, expanding the effective prediction range to 210 EUR.
3. No Exogenous Shock Signal
v4.3 included TTF spot gas prices as a feature. But spot TTF lags behind market expectations of a supply shock by days. The signal that prices were about to spike — forward curve widening, Brent futures jumping 12% — wasn't in any feature the model could see at 10:00 UTC the day before. By the time spot TTF reflected the crisis, the damage was already done.
What Would Have Helped
Being honest about this: none of the following would have solved the crisis-period problem. But they would have reduced it.
TTF Front-Month Forward Curve
The TTF 1-month forward price captures market expectations of future supply tightness days before spot TTF reflects it. Forward spread widening (forward > spot + 10%) is an early warning signal for gas-driven price spikes. Adding this feature is the highest-priority improvement currently in the queue. On the specific Iran 2026 event, it likely would have cut the first two days of crisis MAE by 20–40%.
Regime Detection Layer
A volatility gate — flagging when 7-day price standard deviation exceeds a threshold — could trigger a fallback to a more conservative prediction strategy, or blend in a model variant that's more conservative about spike recall. The fundamental challenge: volatility is still low at crisis onset. By the time the gate fires, you've already missed the first two days.
Online Adaptation
Fine-tuning model weights during a crisis — using the last 48 hours of actual prices as additional training signal — would help the model adapt in near-real-time. This requires infrastructure changes (continuous retraining pipeline) and careful regularization to prevent the adapted model from forgetting its base behavior. Currently not implemented; would take weeks to build safely.
Where We Are Now
By March 14, Iranian diplomatic channels reopened and oil futures reversed. Spanish prices collapsed from 120+ EUR to single digits within 72 hours. A new problem emerged: both models kept predicting high prices for several more days, anchored to a crisis that was already over.
v4.3's recovery MAE (50.90 EUR) was the worst period of the entire analysis — worse even than the crisis peak. The model had no mechanism to "forget" the elevated price environment. v10.1's LSTM encoder had the opposite problem: it had learned the crisis regime too well. Its hidden state held onto elevated crisis representations for days after the underlying cause had disappeared. Recovery MAE: 32.19 EUR.
The standout recovery result is actually v7.0: recovery MAE of just 17.77 EUR — the best of any architecture, well below v10.1's 32.19. The residual-1w transform in v7.0 is what makes the difference: the 1-week price baseline adapts in 7 days, so when prices collapse the baseline catches up quickly, and the residual the model needs to predict snaps back toward normal. The LSTM, which learns longer-range regime representations, adapts more slowly.
This is a meaningful finding: for recovery periods specifically, the simpler residual transform outperforms the LSTM hybrid. The production stack uses both (v10.1 = LSTM + residual transform), but the two mechanisms pull in different directions during rapid reversals.
What This Means for Electricity Price Forecasting
Supervised models trained on historical electricity prices learn a compressed, averaged representation of past market states. They are exceptional at interpolating within that distribution — predicting prices in conditions they've seen before. They are structurally limited at extrapolating into new regimes.
The March 2026 Iran oil shock was a new regime. Not because gas markets had never spiked before, but because the specific combination of geopolitical context, market timing, and price velocity was outside anything in three years of training data. No amount of feature engineering fixes this. More features don't help when the inputs themselves are OOD.
The trajectory from v4.3 → v7.0 → v7.2 → v10.1 shows real structural progress: MaxPred ceiling broke from 171 to 210 EUR. Spike recall improved from 39% to 57%. Crisis MAE dropped from 38 to 26 EUR. These are genuine improvements. But they also show the floor: even the best architecture we've built caught fewer than 3 in 5 spike hours during a genuine crisis.
The honest ceiling, from our analysis: on a window that includes a gas crisis, the theoretical MAE floor for any publicly-available-feature-based ML approach is approximately 10–15 EUR. Below that requires either proprietary data (forward curves, order book, REMIT outage data) or information that simply doesn't exist in any public source at 10:00 UTC the day before a geopolitical event.
This article focused on what happens when things go wrong. The full story — 121 experiments, the structural diagnosis that revealed why trees compress predictions, the 48-experiment campaign to break through, and the LSTM architecture that changed everything — is in the companion piece.
I Tried to Predict Electricity Prices 7 Days Ahead. Here's How Close I Got. →The methodology and full experiment history are open: github.com/JorgeLopezLan/epf-methodology. Live predictions at epf.productjorge.com.