EPF PROJECT · MARCH 2026

I Tried to Predict Electricity Prices 7 Days Ahead. Here's How Close I Got.

121 experiments. 10 major versions. From a single tree with 20 features to a task-aligned LSTM encoder that predicts above 200 EUR/MWh with less than 1 EUR of systematic bias — including the dead ends, the structural diagnosis, and the architecture shift that changed everything.

Jorge Lopez Lan~13 min readUpdated Mar 23, 2026machine-learningelectricity-pricesgradient-boosting
121 experiments across 10 major versions
v1.0
Single Model
v2.0
Ensemble
v3.0
Two Products
v4.1
Quantile + Weather
v4.3
Feature Selection
v5
Failed Experiments
v6
Deep Trees
v7.0
Price Weighting
v7.1
Neural Networks
v7.2
Residual Transform
v8–v9
PyTorch + Quantile Ens.
v10.0
LSTM Hybrid
v10.1
Task-Aligned LSTM ★

Electricity prices in Spain can swing from negative values to over 200 EUR/MWh in a single week. Every day, the OMIE market operator publishes tomorrow's hourly prices, and energy traders, renewable producers, and grid operators make decisions worth millions based on where those prices will land.

I wanted to know: as one person, using only publicly available data and open-source tools, how close could I get to predicting those prices — not just for tomorrow, but for the entire week ahead?

This article covers the full journey — 121 experiments across 6 weeks, from an initial 18 EUR/MWh error down to a new production model with best-ever bias and spike recall. Along the way I discovered why gradient boosting trees have a structural ceiling, ran a 48-experiment campaign to break through it, changed what the model predicts entirely, trained an LSTM encoder on the exact task it needed to solve, and ended up with a model that can predict prices above 200 EUR/MWh with less than 1 EUR of systematic bias.

121 total experiments
6
2
48
20
3
4
12
5
18
3
v1–v4: Core versionsv5: Failed experimentsv6: Deep tree scoutsv7.0: Price weightingv7.1: Neural networksv7.2: Residual transformv8–v9: PyTorch + quantile ens.v10.0: LSTM hybrid scoutsv10.1: Task-aligned LSTMv10.2: Verification

Data, Features, and Products

Spain's grid operator (REE) publishes 16 real-time electricity indicators through the ESIOS API — generation by source, demand, forecasts, and cross-border flows. Open-Meteo adds hourly weather from any coordinates at no cost. Three years of history across 16 electricity indicators, weather from 3 stations, and commodity futures — entirely free.

The Hard Way
Plausible-looking wrong data is more dangerous than a crash. When Spain completed its transition to 15-minute market resolution, the ESIOS API summed sub-hourly values instead of averaging them — silently reporting ~200 EUR/MWh prices that were really ~50. A NaN stops your pipeline immediately. A 4× inflated price within historical range corrupts training data for weeks without triggering any alarm.
v1.0 → v4.3
Feature Engineering · Feb 12 → Mar 5

From 20 initial features I expanded to 100 across 11 domain categories: price lags, generation mix, weather interactions, commodity dynamics. Two surprises stood out. Residual demand (total demand minus renewables minus nuclear) directly encodes what gas plants need to produce — almost a direct proxy for the marginal price. Temperature deviation squared captures the U-shaped heating and cooling demand curve better than the linear term alone.

Feature count across versions

Feature selection in v4.3 cut 100 features down to 58 per horizon group. The selection revealed something unexpected: strategic horizons (D+2 to D+7) dropped every price lag feature entirely. At 2–7 day distances, yesterday's price at the same hour is noise, not signal.

Insight
Feature engineering for short-horizon and long-horizon forecasting are fundamentally different problems. Day-ahead kept 67 features; strategic kept only 50.
v3.0
Two Forecast Products · Feb 25

Predicting tomorrow (D+1) and next week (D+2–D+7) are different tasks. For D+1 you have yesterday's published prices — the single most important predictor. For strategic horizons, you don't. Two separate products, matching how the market actually works:

Day-Ahead (D+1)
Generated ~10:00 UTC
Before OMIE gate closure
Strategic (D+2–D+7)
Generated ~15:00 UTC
Uses D+1 published prices

The Models: Three Trees, One Average

v1.0 → v2.0
From One Model to an Ensemble · Feb 12–17

From day one, I used gradient boosting — a family of machine learning algorithms purpose-built for tabular data. I deliberately avoided deep learning. With ~30K rows and dozens of structured features, gradient boosting consistently outperforms neural networks on this kind of problem, while being interpretable, fast to train on a single CPU, and requiring no GPU.

The core idea is simple: start with a naive guess (the average price), then train a sequence of small decision trees where each one specifically focuses on correcting the mistakes of everything before it. The first tree learns the biggest patterns (night is cheap, day is expensive). The second tree learns what the first one got wrong (peak ramps, weekends). By tree 200 or 500, you're catching subtle residual patterns.

How gradient boosting works — each tree fixes the previous one's mistakes
μ
Start
Predict the average price for every hour
Errors everywhere
1
Tree 1
Learns the biggest patterns (night vs. day)
Smaller errors remain
2
Tree 2
Corrects Tree 1's mistakes (peak hours)
Even smaller
n
Tree N
Catches residual patterns (weather, holidays)
Minimal residual
Final prediction = average + Tree 1 correction + Tree 2 correction + ... + Tree N correction
HistGradientBoosting
scikit-learn
Strategy: Histogram-based splits
+ Native NaN handling, no external deps
- Degrades with deep trees
LightGBM
Microsoft
Strategy: Leaf-wise tree growth
+ Fastest training, best on large data
- Can't exploit depth=12
XGBoost
DMLC
Strategy: Level-wise tree growth
+ Best at depth=12, best spike recall
- Slowest, most memory
Ensemble = Simple Average
Equal weight across all three. Each model fails differently — XGBoost catches spikes that HistGB misses, LightGBM stays stable where XGBoost overreacts. Averaging smooths the edges.

In v1.0, I started with just HistGradientBoosting. By v2.0, I had all three running as an ensemble — their simple average. XGBoost had the best spike recall, HistGB the most stable covariance, LightGBM sat between them. Averaging smooths individual weaknesses. This three-model ensemble served as production through v4.3 — until the deep trees campaign revealed something surprising about how these models differ at depth.

Loss Function: A Small Shift That Changed Everything

v3.1 → v4.1
From MSE to Quantile Loss · Feb 26 → Mar 2

The default choice for regression is MAE or MSE loss. Both target the conditional median or mean of the price distribution. But electricity prices are right-skewed — many moderate hours, a long tail of expensive spikes. The median sits structurally below the mean.

First, in v3.1, switching from MSE to MAE loss fixed the worst of the strategic bias:

Before
(MSE loss)
-6.94
EUR/MWh bias
After
(MAE loss)
-0.3
EUR/MWh bias
96%better

Then in v4.1, I went further — switching to quantile loss at q=0.55, targeting the 55th percentile instead of the 50th. Just 5 percentage points of shift.

Right-skewed price distribution
Targeting the 55th percentile instead of the median corrects for the long right tail of price spikes

The result: day-ahead MAE improved by 18.2% (from 16.40 to 13.42 EUR/MWh), and bias dropped by 37%. This was the single biggest accuracy improvement in the entire project — not from a better model or more data, but from choosing the right loss function.

In the later scout campaigns, I systematically tested other quantile values (0.45, 0.50, 0.52, 0.60, 0.65). None beat 0.55. It's the sweet spot for this dataset's skew — enough to correct the structural underprediction without overshooting.

DA MAE before
0.00EUR/MWh
DA MAE after
0.00EUR/MWh
-18.2%
Bias before
0.00EUR/MWh
Bias after
0.00EUR/MWh
-37%reduced

Bugs, Baseline, and Dead Ends

v4.3 → v5.1
Establishing the Baseline, Then Hitting Every Wall · Mar 5–17

The most instructive bug: during strategic backtesting, my data slicing excluded published D+1 prices — the number one feature (26.5% of total importance) was NaN in some runs. Each of the three tree models handled the missing value differently, so the ensemble was broken in a non-obvious way. After fixing the data slicing, strategic MAE dropped 49% (XGBoost: 40.25 → 20.40 EUR/MWh). Tree models don't fail gracefully on missing data — they silently route samples down their default NaN branch, producing inconsistent behavior rather than an error.

After that fix, v4.3 became the first production deployment — automated daily predictions, fully live.

Day-Ahead MAE
0.00EUR/MWh
-11.8%vs v3.0
Strategic MAE
0.00EUR/MWh
-3.1%best ever
Shape Corr.
0.00
Spread Capture
0%

The nagging signs: bias at -10.97 EUR/MWh, errors worst at the most expensive hours, and every subsequent experiment bouncing off an invisible 13–14 EUR/MWh MAE ceiling. Before understanding why, I tried five approaches that all failed:

Rejected
v5.0 — Peak/off-peak split: Halving training samples per model degraded both. MAE worsened 10.1%. Peaks need surrounding context — splitting removed it.

v5.0b — Log-transform target: Log compression amplified peak underprediction when transformed back. MAE worsened 14.5%.

v5.1 — Crisis data weighting: Winsorizing 2022 crisis prices had zero effect (±0.04 MAE). The problem is structural, not historical.

v8.0 — PyTorch deep MLP (after v7.2, see below): MAE 28.13 — 2.2× worse than XGBoost. A flattened lag array destroys temporal structure that trees handle naturally as conditional splits.

v9.0 — Quantile ensemble (after v7.2, see below): Blending q=0.25/0.55/0.75 models. Best combination: MAE 14.48 — worse than v7.0's 12.69. Averaging compressed outputs produces compressed averages.

The Structural Diagnosis: Why Trees Compress Prices

v6.1 → v6.2
48 Experiments to Find the Root Cause · Mar 18

After three failed experiments, I stepped back and asked a different question. Instead of "what feature or config will fix this," I asked: why does every gradient boosting configuration I try converge to the same accuracy ceiling?

I ran a systematic scout campaign — 48 experiments in two days — designed not to improve accuracy, but to diagnose why it wasn't improving. The scouts tested everything: different decay rates, quantile values, new features (Fourier harmonics, renewable curtailment proxies), target transforms (residual from baseline, multiplicative ratios), and a completely different tree algorithm (CatBoost).

The answer was in the numbers. I computed the ratio of predicted price standard deviation to actual price standard deviation — a measure I called "range ratio." Across every single experiment, this number was almost identical:

The compression problem — predicted vs actual price range
ACTUAL PRICES
-4 EUR
200 EUR
P99: 155 EUR
73%of range
PREDICTED PRICES
-22 EUR
127 EUR
P99: 99 EUR
Never predicts here

Regardless of features, loss function, decay rate, or even the boosting algorithm, predictions captured only 73% of the actual price range. The model's maximum prediction was about 127 EUR/MWh, while actual prices regularly exceeded 200 EUR/MWh. The 99th percentile of predictions was 99 EUR/MWh; the 99th percentile of actuals was 155.

This is leaf averaging — a fundamental property of how decision trees work. When a leaf node contains training samples with prices [80, 100, 130, 170], the tree predicts their average (~120), never the extremes. Every sample in that leaf gets the same prediction. The more heterogeneous the leaf, the more compression happens. With depth=8 trees, the leaves aren't specialized enough to isolate truly extreme conditions.

Where the Errors Live

The compression creates a distinctive error pattern that I could now see clearly in the data:

Error and bias by actual price level (EUR/MWh)
MAE Bias (negative = underpredicts)

The model overpredicts cheap hours (below 20 EUR/MWh) and severely underpredicts expensive hours. The critical range is 80–130 EUR/MWh, which makes up 34% of all hours but drives about 40% of total error. Above 130 EUR/MWh, the model essentially gives up — every prediction is too low by 45+ EUR/MWh.

The evening peak (hours 17–20) is where this hits hardest: bias of -15 to -22 EUR/MWh. These are exactly the hours that matter most for energy traders and battery operators. The model is least reliable precisely when accuracy matters most.

Breakthrough
The root cause is not missing features, not crisis data contamination, not the wrong loss function, and not post-processing errors. It is an inherent property of gradient boosting: tree leaf averaging is mean-regressive. When extremes share a leaf with moderate values, the average always pulls toward the center.

Breaking Through: Deep Trees

v6.2 → v6.3
The depth=12 Breakthrough · Mar 18–19

With the diagnosis in hand, the fix was conceptually simple: if leaves are too broad, make more leaves. More splits means more specialized leaf nodes. Instead of one leaf for "evening + high demand + winter" containing prices from 80 to 170, deeper trees create separate leaves for "evening + high demand + winter + high gas + low wind" (prices 130–170) and "evening + high demand + winter + moderate gas + some wind" (prices 80–110).

I increased tree depth from 8 to 12 and compensated with regularization to prevent overfitting: learning rate dropped from 0.05 to 0.03, L2 regularization (lambda) set to 0.3, and minimum child weight raised to 5 samples per leaf.

In the 90-day scout, deep trees alone dropped MAE to 12.99 — a 10.2% improvement over v4.3 production. When combined with time decay (d365, exponentially downweighting old data), it reached 12.55 MAE — a 13.3% improvement.

But here's the twist that made the v6 campaign valuable beyond just the MAE number: deep trees are XGBoost-specific. I ran the same depth=12 configuration on all three models. The results diverged dramatically:

Deep trees: only XGBoost benefits (150-day validation)
XGB d=12 + d365
13.2
-8.8%
XGB d=12 (no decay)
13.83
-4.4%
LightGBM d=12 + d180
14.27
-1.4%
v4.3 Production
14.47
ref
LightGBM d=12
15.52
ref
HistGB d=12 + d180
16.53
ref

XGBoost's level-wise growth strategy handles deep trees gracefully — it fills out each depth before going deeper, maintaining balanced structure. LightGBM's leaf-wise strategy already goes deep on difficult cases, so adding more depth gives diminishing returns and risks overfitting. HistGradientBoosting's histogram binning loses resolution at extreme depths — its 256-bucket discretization becomes a bottleneck when you need fine-grained splits.

This meant the three-model ensemble that had served me well through v4.3 was no longer optimal. A single XGBoost at depth=12 outperformed the three-model ensemble at depth=8. The ensemble was dragging down the best model.

The top configurations validated on a full 150-day backtest (October 2025 through February 2026). XGBoost depth=12 + decay365 + q=0.55 achieved 13.20 EUR/MWh MAE — an 8.8% improvement over production, capturing 70% of actual price variation. Notably, the 90-day scout had suggested decay180 was optimal; the 150-day window showed decay365 was better. Short backtests can mislead — always validate on longer windows.

The Compression-Breaking Campaign: v7.0

v7.0
Price-Weighted Training · Mar 19–20

Deep trees reduced compression but didn't eliminate it. The regression slope was stuck at 0.70 — for every 10 EUR of actual price variation, the model predicted only 7 EUR. I needed the model to pay more attention to expensive hours during training.

The idea: during training, multiply the sample weight of every hour where the actual price exceeds a threshold by a fixed multiplier. If the threshold is 60 EUR/MWh and the multiplier is 3x, then expensive hours count three times as much in the loss function. The model allocates more of its limited capacity to getting those hours right, at the cost of slightly worse accuracy on cheap hours where errors matter less.

I ran 20 experiments across two waves, sweeping weight multipliers (2x, 3x, 4x, 5x), thresholds (60, 80, 100 EUR), and combinations with time decay.

v7.0 price weighting sweep — MAE vs spike recall tradeoff
pw-3x + d365 + t60
12.69
70.7% spk
pw-3x + d365
12.75
69.4% spk
pw-3x + d365 + t100
12.94
70.6% spk
Monday features
12.99
68.9% spk
pw-2x
13.12
73.8% spk
pw-5x
13.84
76% spk
pw-3x
14.09
74.8% spk
v4.3 Production
14.47
40% spk

The best configuration — price weight 3x with threshold 60 EUR and decay365 — achieved 12.69 EUR/MWh MAE, a 12.3% improvement over v4.3 production. Spike recall jumped from ~40% to 70.7%. Critical-range (80–130 EUR) error dropped from 28.3 to 20.1 EUR/MWh.

The Non-Linear Interaction

The most surprising finding was that price weighting and time decay interact non-linearly. You'd expect their benefits to simply add up. They don't.

Non-linear interaction: price weight × time decay
Weight
No Decay
With d365
Delta
2x
13.12
13.27
+0.15
3x
14.09
12.75
-1.34
5x
13.84
13.71
-0.13
Only 3x benefits dramatically from time decay. At 2x, decay hurts. At 5x, aggressive weighting overwhelms it.

Weight 3x benefits dramatically from decay365 (MAE improves by 1.34 EUR/MWh). But weight 2x gets worse with decay (+0.15), and weight 5x barely benefits (-0.13). The explanation: at weight 3x, the model is allocating moderate extra capacity to peaks, and time decay helps by reducing the noise from old crisis data. At 2x, the peak allocation is too gentle and decay over-thins the training set. At 5x, aggressive peak weighting overwhelms everything else — the model is already ignoring cheap hours so much that decay can't help further.

This kind of non-linear interaction is why I ran 20 experiments instead of just trying the obvious combination. In machine learning optimization, you often can't predict which parameter combinations will interact well until you test them.

The Neural Net Detour (v7.1)

v7.1
MLP — Rejected · Mar 20

If tree leaf averaging is the structural limit, why not use a model that doesn't have leaves? I tested sklearn's MLPRegressor (256, 128, 64) on the same data. MAE was 37 EUR/MWh — 3× worse than XGBoost. But the MLP's maximum prediction was 207 EUR/MWh and its range ratio was 1.08 (it overshot the actual range). This proved that compression is model-specific, not data-specific. The features contain enough information to predict extreme prices; trees can't extract it due to leaf averaging; an MLP architecturally can — but sklearn's optimizer couldn't find the weights in 2–4 hour training runs with no mini-batching or batch normalization.

Rejected
The concept is validated — neural networks can break the compression barrier. The right architecture is LSTM, not MLP, and it needs to be trained on the right task. That's v10.

Changing What the Model Predicts (v7.2)

v7.2
Residual Target Transform · Mar 21

The structural diagnosis showed that tree leaf averaging compresses predictions because leaves average extreme training values. But there was another angle: what if the prediction target itself were smaller?

Instead of predicting a raw price (which can range from 0 to 250 EUR/MWh), predict the deviation from a rolling 1-week price baseline — then add the baseline back at inference time. The compression still happens, but the model is now averaging a target range of roughly ±50 EUR rather than ±250. The leaf containing [80, 100, 130, 170] is now targeting [-20, 0, +30, +70], and the compressed average of those deviations is far less damaging.

The key design question was which baseline to use. A 4-week median seemed safer — smoother, more stable — but it degraded MAE to 15.13. The problem: a crisis that moves prices from 50 to 200 EUR in a week takes 3 more weeks to show up in the 4-week median. The 1-week baseline adapts in 7 days.

Slope before
0.00
Slope after
0.00
Bias reduction
0%
MaxPred before
0EUR
MaxPred after
0EUR

The 80–130 EUR/MWh range — the critical bucket where most error lived — improved by 26%. The compression IS partly reversible by changing the prediction target.

The catch: overall MAE was 12.92 vs v7.0's 12.69. About 66% of hours are below 80 EUR where the raw model was already accurate. The transform trades low-price precision for high-price accuracy — a slightly negative trade on a calm window. It becomes a decisive positive when the market turns volatile. We'd find that out later.

Insight
Predicting a residual (deviation from baseline) instead of raw price is not just a target engineering trick — it's a diagnostic confirmation. The fact that it improved structural metrics while barely changing average MAE confirmed the compression was concentrated in the high-price regime.

Two deeper architecture attempts confirmed the same conclusion: a PyTorch residual MLP (v8.0, MAE 28.13) and a quantile ensemble (v9.0, MAE 14.48) both failed to break the slope ceiling. A flattened lag array destroys temporal structure. The compression problem is not about loss function, optimizer, or ensemble diversity — it's about how the model represents time. Every approach so far sees price history as a flat array, not as a sequence.

Teaching Trees to Remember (v10.0 + v10.1)

v10.0 → v10.1
LSTM-XGBoost Hybrid → Task-Aligned Production · Mar 21–22

v10.0 — The Architecture

The insight: don't replace the gradient boosting model with a recurrent network. Give the trees something they can't learn themselves — a compact summary of sequential price dynamics. Pre-train an LSTM encoder on 3 years of price history (given the last 168 hours, predict the next hour's price). After training, discard the output layer and keep the 64-dimensional hidden state. Feed those 64 numbers as additional input features to XGBoost alongside all hand-crafted domain features.

On the 90-day calm scout, v10.0 achieved 13.12 MAE, -1.47 EUR bias, 81% spike recall — best structural metrics in the project's history. The residual_1w transform and LSTM embeddings complemented each other: the transform removes the regime baseline; the LSTM captures how prices have been deviating within that regime.

But on the full 150-day window (including Oct–Nov 2025 gas crisis), spike recall collapsed from 81% to 11%. The LSTM had been trained to predict the next-hour price — not what an entire day looks like during a structural market repricing.

Insight
The architecture was right. The training objective was wrong. A next-1h prediction task teaches the encoder what the market does on average. It doesn't teach it what to do when the gas market structurally reprices for weeks.

v10.1 — Task Alignment

v10.1
Task-Aligned LSTM — New Production · Mar 22

The generic LSTM learned "what usually happens in the next hour given recent history." What the model needed was an encoder that learned "what does the full D+1 price curve look like from here?"

Those two tasks look similar during calm trading. During crisis periods they diverge sharply. "What happens in the next hour" relies on short-range momentum and mean reversion. "What does the full next day look like" requires understanding regime state: is this a brief spike or a sustained crisis? A task-aligned encoder is pre-trained specifically to predict the full 24-hour D+1 price curve — and its internal representations encode exactly that.

A harder evaluation standard. Before comparing results, the evaluation window was switched from 90 days to 150 days — permanently. The shorter window covered December through March, a calm period. The 150-day window reaches back into October–November 2025, when Spanish day-ahead prices hit 170–247 EUR/MWh during the gas crisis. Any model that looks good only on calm months is not production-ready. On that harder window, v7.0 (the prior best) scores 15.76 MAE — not 12.69. All v10.1 comparisons are on this honest window.

MAE (150d)
0.00EUR/MWh
Bias
0.00EUR/MWh
-77%vs v7.0
MaxPred
0EUR/MWh
Spike Recall
0.00%

Compared to v7.0 on the same 150-day window (MAE 15.76, bias -2.82, MaxPred 188, spike recall 16.2%): essentially identical average error, but 77% less systematic bias, a model that can predict above 200 EUR, and 8 more percentage points of spike recall.

Insight
The 24.1% spike recall reflects performance on a 150-day window that includes many calm weeks. During an actual crisis — the March 2026 Iran oil shock — v10.1's spike recall was 57.1% on the crisis window specifically. Calm-window recall and crisis-window recall measure very different things; the 150-day average hides both. See the crisis analysis for the full breakdown.

The 90-day scout had shown the task-aligned model at 16.13 MAE vs XGBoost's 13.08 — apparently worse. Lesson 5 playing out in real time: scouts on calm windows mislead. The 150-day validation reversed the ranking on every structural metric.

Two final verification experiments (v10.2) confirmed the design boundaries. Feature selection on the LSTM's 64-dim embedding made results substantially worse: the dimensions are a dense latent space, not 64 independent features — removing any subset breaks the learned representation. And replacing the 1-week baseline with a 4-week baseline introduced +3–4 EUR systematic overprediction during post-crisis periods: the model "remembered" crisis prices for a full month after they passed. Both are clean rules, not ad-hoc patches.

Breakthrough
v10.1 achieves best-ever bias (−0.65 EUR/MWh), best-ever spike recall (24.1%), and a maximum prediction of 209 EUR/MWh — up from 127 EUR when compression was first diagnosed. The structural ceiling has moved significantly.

Results: Where Things Stand After 121 Experiments

The evaluation framework evolved alongside the models — from MAE/RMSE only (v1–v2), to adding bias and naive benchmarks (v3), to spike recall and shape correlation (v4), to regression slope and error-by-price-level decomposition (v6–v7). Each phase changed which problem to solve next. Tracking regression slope from v1.0 would have identified the tree ceiling weeks earlier.

The full MAE journey:

MAE evolution (EUR/MWh) — lower is better
Day-Ahead (D+1) Strategic (D+2–D+7)
Best DA MAE
0.00EUR/MWh
-12.3%vs v4.3
Production MAE
0.00EUR/MWh
Strategic MAE
0.00EUR/MWh
Spike Recall
0.00%
+76.8%vs v4.3

The v4.2 regression is visible in the chart — adding 24 crisis features made things worse before feature selection recovered most of the damage in v4.3. The jump from v4.3 to v6.3 is the deep trees breakthrough. The final step from v6.3 to v7.0 is price-weighted training.

To put the final number in context: average Spanish electricity prices hover around 50–80 EUR/MWh, so 12.69 EUR/MWh error represents roughly 16–25% relative error. The model beats persistence (yesterday's price) by 53% and weekly seasonal baselines by 57%. It correctly identifies 70.7% of price spikes and captures the within-day price shape with 0.89 correlation — which is what battery storage operators actually need.

Two Numbers, One Explanation

This article references two MAE numbers: 12.69 (v7.0) and 15.73 (v10.1). These are not apples-to-apples. The 12.69 was measured on a 90-day calm window from December through March. The 15.73 is measured on a 150-day window that includes the Oct–Nov 2025 gas crisis, when prices hit 170–247 EUR/MWh. On that same harder window, v7.0 scores 15.76. The improvement in v10.1 is not in average accuracy — it is in structure: 77% less systematic underprediction, a model that can see above 200 EUR, and nearly 1.5x better spike recall.

Target transforms — done (v7.2)
Residual from 1-week baseline: slope +5.5%, bias -51%, MaxPred +18.5%. Foundation of the production stack.
PyTorch neural networks — tested (v8.0)
MAE 28.13, 2.2x worse. Tabular NNs lose to trees on flattened lag features. The concept is structural, not optimizer-dependent.
LSTM hybrid ensemble — done (v10.0, v10.1)
Task-aligned LSTM encoder + XGBoost is now the production stack. Best-ever bias and spike recall.
Full accuracy breakdowns by hour, horizon, and model — updated daily. See it live →

What 121 Experiments Taught Me

What you optimize matters more than how you optimize it. The two biggest structural improvements both came from asking "am I targeting the right thing?" — not from tuning the model. Switching from MAE to quantile loss at q=0.55 improved accuracy 18.2% (the single biggest gain in the project). Switching the prediction target from raw price to 1-week residual reduced bias by 51% and high-price MAE by 26%. Both required changing one line, not building a new architecture.

Diagnose first, and validate on the hard window. I spent weeks tweaking features and hyperparameters before running the 48-experiment scout campaign that identified leaf averaging as the structural limit. Once diagnosed, the fix was obvious. Short backtests also mislead — 90-day scouts pointed to decay180 as optimal; the 150-day window showed decay365 was better. Always validate on the window that includes the hard regime.

Non-obvious interactions dominate. Price weight 3x + time decay 365 is uniquely effective — a non-linear combination that outperforms either technique alone and outperforms nearby parameter values (2x or 5x weight, 180-day or no decay). You can't find these with theory. You find them by running systematic sweeps.

Failed experiments are informative. The peak/off-peak split taught me that peaks need surrounding context. The log transform taught me that compression in log-space amplifies underprediction. The MLP taught me the compression problem is model-specific, not data-specific. Each failure narrowed the search space for what would work.

One good model beats three average ones. The three-model ensemble was right through v4.3 — each model's weaknesses smoothed by the others. But once deep trees made XGBoost dramatically better than HistGB and LightGBM, the ensemble became a liability. Knowing when to stop ensembling is as important as knowing when to start.

Task alignment beats generic pre-training. The generic LSTM (trained to predict next-1h) and the task-aligned encoder (trained to predict the full D+1 curve) use identical architecture. Different training objective — 8 percentage points more spike recall on volatile periods. Any encoder borrowed from a proxy task should be re-aligned to the exact downstream problem.

How Close Can One Person Get?

Commercial electricity price forecasters achieve D+1 MAE of 5–10 EUR/MWh with proprietary data and dedicated teams. On a calm 90-day window I got to 12.69 with public data and open-source tools. On the harder 150-day window including a gas crisis, the production model sits at 15.73 — with less than 1 EUR of systematic bias. The structural problems that seemed intractable at the start — tree leaf averaging, extreme-price blindness, persistent underprediction — all have answers. Some are simple (change what you predict). Some are architectural (give trees temporal memory). None required proprietary data.

Most of the remaining gap to commercial forecasters likely comes from three things I don't have: proprietary weather forecast data (not historical observations), real-time market order book data, and years of domain knowledge about edge cases. All three are on the roadmap.

Since this post was published, the model was stress-tested by a real market crisis: the March 2026 Iran oil shock sent Spanish prices to 250 EUR/MWh. I wrote a detailed post-mortem on exactly how each model generation failed — and why. Read: When Our Model Met a Crisis →

The methodology, the code, and the live predictions are all open — updated to v10.1.

I got to 15.73 EUR/MWh on a volatile market. Can you do better?