I Tried to Predict Electricity Prices 7 Days Ahead. Here's How Close I Got.
121 experiments. 10 major versions. From a single tree with 20 features to a task-aligned LSTM encoder that predicts above 200 EUR/MWh with less than 1 EUR of systematic bias — including the dead ends, the structural diagnosis, and the architecture shift that changed everything.
Electricity prices in Spain can swing from negative values to over 200 EUR/MWh in a single week. Every day, the OMIE market operator publishes tomorrow's hourly prices, and energy traders, renewable producers, and grid operators make decisions worth millions based on where those prices will land.
I wanted to know: as one person, using only publicly available data and open-source tools, how close could I get to predicting those prices — not just for tomorrow, but for the entire week ahead?
This article covers the full journey — 121 experiments across 6 weeks, from an initial 18 EUR/MWh error down to a new production model with best-ever bias and spike recall. Along the way I discovered why gradient boosting trees have a structural ceiling, ran a 48-experiment campaign to break through it, changed what the model predicts entirely, trained an LSTM encoder on the exact task it needed to solve, and ended up with a model that can predict prices above 200 EUR/MWh with less than 1 EUR of systematic bias.
Data, Features, and Products
Spain's grid operator (REE) publishes 16 real-time electricity indicators through the ESIOS API — generation by source, demand, forecasts, and cross-border flows. Open-Meteo adds hourly weather from any coordinates at no cost. Three years of history across 16 electricity indicators, weather from 3 stations, and commodity futures — entirely free.
From 20 initial features I expanded to 100 across 11 domain categories: price lags, generation mix, weather interactions, commodity dynamics. Two surprises stood out. Residual demand (total demand minus renewables minus nuclear) directly encodes what gas plants need to produce — almost a direct proxy for the marginal price. Temperature deviation squared captures the U-shaped heating and cooling demand curve better than the linear term alone.
Feature selection in v4.3 cut 100 features down to 58 per horizon group. The selection revealed something unexpected: strategic horizons (D+2 to D+7) dropped every price lag feature entirely. At 2–7 day distances, yesterday's price at the same hour is noise, not signal.
Predicting tomorrow (D+1) and next week (D+2–D+7) are different tasks. For D+1 you have yesterday's published prices — the single most important predictor. For strategic horizons, you don't. Two separate products, matching how the market actually works:
The Models: Three Trees, One Average
From day one, I used gradient boosting — a family of machine learning algorithms purpose-built for tabular data. I deliberately avoided deep learning. With ~30K rows and dozens of structured features, gradient boosting consistently outperforms neural networks on this kind of problem, while being interpretable, fast to train on a single CPU, and requiring no GPU.
The core idea is simple: start with a naive guess (the average price), then train a sequence of small decision trees where each one specifically focuses on correcting the mistakes of everything before it. The first tree learns the biggest patterns (night is cheap, day is expensive). The second tree learns what the first one got wrong (peak ramps, weekends). By tree 200 or 500, you're catching subtle residual patterns.
In v1.0, I started with just HistGradientBoosting. By v2.0, I had all three running as an ensemble — their simple average. XGBoost had the best spike recall, HistGB the most stable covariance, LightGBM sat between them. Averaging smooths individual weaknesses. This three-model ensemble served as production through v4.3 — until the deep trees campaign revealed something surprising about how these models differ at depth.
Loss Function: A Small Shift That Changed Everything
The default choice for regression is MAE or MSE loss. Both target the conditional median or mean of the price distribution. But electricity prices are right-skewed — many moderate hours, a long tail of expensive spikes. The median sits structurally below the mean.
First, in v3.1, switching from MSE to MAE loss fixed the worst of the strategic bias:
Then in v4.1, I went further — switching to quantile loss at q=0.55, targeting the 55th percentile instead of the 50th. Just 5 percentage points of shift.
The result: day-ahead MAE improved by 18.2% (from 16.40 to 13.42 EUR/MWh), and bias dropped by 37%. This was the single biggest accuracy improvement in the entire project — not from a better model or more data, but from choosing the right loss function.
In the later scout campaigns, I systematically tested other quantile values (0.45, 0.50, 0.52, 0.60, 0.65). None beat 0.55. It's the sweet spot for this dataset's skew — enough to correct the structural underprediction without overshooting.
Bugs, Baseline, and Dead Ends
The most instructive bug: during strategic backtesting, my data slicing excluded published D+1 prices — the number one feature (26.5% of total importance) was NaN in some runs. Each of the three tree models handled the missing value differently, so the ensemble was broken in a non-obvious way. After fixing the data slicing, strategic MAE dropped 49% (XGBoost: 40.25 → 20.40 EUR/MWh). Tree models don't fail gracefully on missing data — they silently route samples down their default NaN branch, producing inconsistent behavior rather than an error.
After that fix, v4.3 became the first production deployment — automated daily predictions, fully live.
The nagging signs: bias at -10.97 EUR/MWh, errors worst at the most expensive hours, and every subsequent experiment bouncing off an invisible 13–14 EUR/MWh MAE ceiling. Before understanding why, I tried five approaches that all failed:
v5.0b — Log-transform target: Log compression amplified peak underprediction when transformed back. MAE worsened 14.5%.
v5.1 — Crisis data weighting: Winsorizing 2022 crisis prices had zero effect (±0.04 MAE). The problem is structural, not historical.
v8.0 — PyTorch deep MLP (after v7.2, see below): MAE 28.13 — 2.2× worse than XGBoost. A flattened lag array destroys temporal structure that trees handle naturally as conditional splits.
v9.0 — Quantile ensemble (after v7.2, see below): Blending q=0.25/0.55/0.75 models. Best combination: MAE 14.48 — worse than v7.0's 12.69. Averaging compressed outputs produces compressed averages.
The Structural Diagnosis: Why Trees Compress Prices
After three failed experiments, I stepped back and asked a different question. Instead of "what feature or config will fix this," I asked: why does every gradient boosting configuration I try converge to the same accuracy ceiling?
I ran a systematic scout campaign — 48 experiments in two days — designed not to improve accuracy, but to diagnose why it wasn't improving. The scouts tested everything: different decay rates, quantile values, new features (Fourier harmonics, renewable curtailment proxies), target transforms (residual from baseline, multiplicative ratios), and a completely different tree algorithm (CatBoost).
The answer was in the numbers. I computed the ratio of predicted price standard deviation to actual price standard deviation — a measure I called "range ratio." Across every single experiment, this number was almost identical:
Regardless of features, loss function, decay rate, or even the boosting algorithm, predictions captured only 73% of the actual price range. The model's maximum prediction was about 127 EUR/MWh, while actual prices regularly exceeded 200 EUR/MWh. The 99th percentile of predictions was 99 EUR/MWh; the 99th percentile of actuals was 155.
This is leaf averaging — a fundamental property of how decision trees work. When a leaf node contains training samples with prices [80, 100, 130, 170], the tree predicts their average (~120), never the extremes. Every sample in that leaf gets the same prediction. The more heterogeneous the leaf, the more compression happens. With depth=8 trees, the leaves aren't specialized enough to isolate truly extreme conditions.
Where the Errors Live
The compression creates a distinctive error pattern that I could now see clearly in the data:
The model overpredicts cheap hours (below 20 EUR/MWh) and severely underpredicts expensive hours. The critical range is 80–130 EUR/MWh, which makes up 34% of all hours but drives about 40% of total error. Above 130 EUR/MWh, the model essentially gives up — every prediction is too low by 45+ EUR/MWh.
The evening peak (hours 17–20) is where this hits hardest: bias of -15 to -22 EUR/MWh. These are exactly the hours that matter most for energy traders and battery operators. The model is least reliable precisely when accuracy matters most.
Breaking Through: Deep Trees
With the diagnosis in hand, the fix was conceptually simple: if leaves are too broad, make more leaves. More splits means more specialized leaf nodes. Instead of one leaf for "evening + high demand + winter" containing prices from 80 to 170, deeper trees create separate leaves for "evening + high demand + winter + high gas + low wind" (prices 130–170) and "evening + high demand + winter + moderate gas + some wind" (prices 80–110).
I increased tree depth from 8 to 12 and compensated with regularization to prevent overfitting: learning rate dropped from 0.05 to 0.03, L2 regularization (lambda) set to 0.3, and minimum child weight raised to 5 samples per leaf.
In the 90-day scout, deep trees alone dropped MAE to 12.99 — a 10.2% improvement over v4.3 production. When combined with time decay (d365, exponentially downweighting old data), it reached 12.55 MAE — a 13.3% improvement.
But here's the twist that made the v6 campaign valuable beyond just the MAE number: deep trees are XGBoost-specific. I ran the same depth=12 configuration on all three models. The results diverged dramatically:
XGBoost's level-wise growth strategy handles deep trees gracefully — it fills out each depth before going deeper, maintaining balanced structure. LightGBM's leaf-wise strategy already goes deep on difficult cases, so adding more depth gives diminishing returns and risks overfitting. HistGradientBoosting's histogram binning loses resolution at extreme depths — its 256-bucket discretization becomes a bottleneck when you need fine-grained splits.
This meant the three-model ensemble that had served me well through v4.3 was no longer optimal. A single XGBoost at depth=12 outperformed the three-model ensemble at depth=8. The ensemble was dragging down the best model.
The top configurations validated on a full 150-day backtest (October 2025 through February 2026). XGBoost depth=12 + decay365 + q=0.55 achieved 13.20 EUR/MWh MAE — an 8.8% improvement over production, capturing 70% of actual price variation. Notably, the 90-day scout had suggested decay180 was optimal; the 150-day window showed decay365 was better. Short backtests can mislead — always validate on longer windows.
The Compression-Breaking Campaign: v7.0
Deep trees reduced compression but didn't eliminate it. The regression slope was stuck at 0.70 — for every 10 EUR of actual price variation, the model predicted only 7 EUR. I needed the model to pay more attention to expensive hours during training.
The idea: during training, multiply the sample weight of every hour where the actual price exceeds a threshold by a fixed multiplier. If the threshold is 60 EUR/MWh and the multiplier is 3x, then expensive hours count three times as much in the loss function. The model allocates more of its limited capacity to getting those hours right, at the cost of slightly worse accuracy on cheap hours where errors matter less.
I ran 20 experiments across two waves, sweeping weight multipliers (2x, 3x, 4x, 5x), thresholds (60, 80, 100 EUR), and combinations with time decay.
The best configuration — price weight 3x with threshold 60 EUR and decay365 — achieved 12.69 EUR/MWh MAE, a 12.3% improvement over v4.3 production. Spike recall jumped from ~40% to 70.7%. Critical-range (80–130 EUR) error dropped from 28.3 to 20.1 EUR/MWh.
The Non-Linear Interaction
The most surprising finding was that price weighting and time decay interact non-linearly. You'd expect their benefits to simply add up. They don't.
Weight 3x benefits dramatically from decay365 (MAE improves by 1.34 EUR/MWh). But weight 2x gets worse with decay (+0.15), and weight 5x barely benefits (-0.13). The explanation: at weight 3x, the model is allocating moderate extra capacity to peaks, and time decay helps by reducing the noise from old crisis data. At 2x, the peak allocation is too gentle and decay over-thins the training set. At 5x, aggressive peak weighting overwhelms everything else — the model is already ignoring cheap hours so much that decay can't help further.
This kind of non-linear interaction is why I ran 20 experiments instead of just trying the obvious combination. In machine learning optimization, you often can't predict which parameter combinations will interact well until you test them.
The Neural Net Detour (v7.1)
If tree leaf averaging is the structural limit, why not use a model that doesn't have leaves? I tested sklearn's MLPRegressor (256, 128, 64) on the same data. MAE was 37 EUR/MWh — 3× worse than XGBoost. But the MLP's maximum prediction was 207 EUR/MWh and its range ratio was 1.08 (it overshot the actual range). This proved that compression is model-specific, not data-specific. The features contain enough information to predict extreme prices; trees can't extract it due to leaf averaging; an MLP architecturally can — but sklearn's optimizer couldn't find the weights in 2–4 hour training runs with no mini-batching or batch normalization.
Changing What the Model Predicts (v7.2)
The structural diagnosis showed that tree leaf averaging compresses predictions because leaves average extreme training values. But there was another angle: what if the prediction target itself were smaller?
Instead of predicting a raw price (which can range from 0 to 250 EUR/MWh), predict the deviation from a rolling 1-week price baseline — then add the baseline back at inference time. The compression still happens, but the model is now averaging a target range of roughly ±50 EUR rather than ±250. The leaf containing [80, 100, 130, 170] is now targeting [-20, 0, +30, +70], and the compressed average of those deviations is far less damaging.
The key design question was which baseline to use. A 4-week median seemed safer — smoother, more stable — but it degraded MAE to 15.13. The problem: a crisis that moves prices from 50 to 200 EUR in a week takes 3 more weeks to show up in the 4-week median. The 1-week baseline adapts in 7 days.
The 80–130 EUR/MWh range — the critical bucket where most error lived — improved by 26%. The compression IS partly reversible by changing the prediction target.
The catch: overall MAE was 12.92 vs v7.0's 12.69. About 66% of hours are below 80 EUR where the raw model was already accurate. The transform trades low-price precision for high-price accuracy — a slightly negative trade on a calm window. It becomes a decisive positive when the market turns volatile. We'd find that out later.
Two deeper architecture attempts confirmed the same conclusion: a PyTorch residual MLP (v8.0, MAE 28.13) and a quantile ensemble (v9.0, MAE 14.48) both failed to break the slope ceiling. A flattened lag array destroys temporal structure. The compression problem is not about loss function, optimizer, or ensemble diversity — it's about how the model represents time. Every approach so far sees price history as a flat array, not as a sequence.
Teaching Trees to Remember (v10.0 + v10.1)
v10.0 — The Architecture
The insight: don't replace the gradient boosting model with a recurrent network. Give the trees something they can't learn themselves — a compact summary of sequential price dynamics. Pre-train an LSTM encoder on 3 years of price history (given the last 168 hours, predict the next hour's price). After training, discard the output layer and keep the 64-dimensional hidden state. Feed those 64 numbers as additional input features to XGBoost alongside all hand-crafted domain features.
On the 90-day calm scout, v10.0 achieved 13.12 MAE, -1.47 EUR bias, 81% spike recall — best structural metrics in the project's history. The residual_1w transform and LSTM embeddings complemented each other: the transform removes the regime baseline; the LSTM captures how prices have been deviating within that regime.
But on the full 150-day window (including Oct–Nov 2025 gas crisis), spike recall collapsed from 81% to 11%. The LSTM had been trained to predict the next-hour price — not what an entire day looks like during a structural market repricing.
v10.1 — Task Alignment
The generic LSTM learned "what usually happens in the next hour given recent history." What the model needed was an encoder that learned "what does the full D+1 price curve look like from here?"
Those two tasks look similar during calm trading. During crisis periods they diverge sharply. "What happens in the next hour" relies on short-range momentum and mean reversion. "What does the full next day look like" requires understanding regime state: is this a brief spike or a sustained crisis? A task-aligned encoder is pre-trained specifically to predict the full 24-hour D+1 price curve — and its internal representations encode exactly that.
A harder evaluation standard. Before comparing results, the evaluation window was switched from 90 days to 150 days — permanently. The shorter window covered December through March, a calm period. The 150-day window reaches back into October–November 2025, when Spanish day-ahead prices hit 170–247 EUR/MWh during the gas crisis. Any model that looks good only on calm months is not production-ready. On that harder window, v7.0 (the prior best) scores 15.76 MAE — not 12.69. All v10.1 comparisons are on this honest window.
Compared to v7.0 on the same 150-day window (MAE 15.76, bias -2.82, MaxPred 188, spike recall 16.2%): essentially identical average error, but 77% less systematic bias, a model that can predict above 200 EUR, and 8 more percentage points of spike recall.
The 90-day scout had shown the task-aligned model at 16.13 MAE vs XGBoost's 13.08 — apparently worse. Lesson 5 playing out in real time: scouts on calm windows mislead. The 150-day validation reversed the ranking on every structural metric.
Two final verification experiments (v10.2) confirmed the design boundaries. Feature selection on the LSTM's 64-dim embedding made results substantially worse: the dimensions are a dense latent space, not 64 independent features — removing any subset breaks the learned representation. And replacing the 1-week baseline with a 4-week baseline introduced +3–4 EUR systematic overprediction during post-crisis periods: the model "remembered" crisis prices for a full month after they passed. Both are clean rules, not ad-hoc patches.
Results: Where Things Stand After 121 Experiments
The evaluation framework evolved alongside the models — from MAE/RMSE only (v1–v2), to adding bias and naive benchmarks (v3), to spike recall and shape correlation (v4), to regression slope and error-by-price-level decomposition (v6–v7). Each phase changed which problem to solve next. Tracking regression slope from v1.0 would have identified the tree ceiling weeks earlier.
The full MAE journey:
The v4.2 regression is visible in the chart — adding 24 crisis features made things worse before feature selection recovered most of the damage in v4.3. The jump from v4.3 to v6.3 is the deep trees breakthrough. The final step from v6.3 to v7.0 is price-weighted training.
To put the final number in context: average Spanish electricity prices hover around 50–80 EUR/MWh, so 12.69 EUR/MWh error represents roughly 16–25% relative error. The model beats persistence (yesterday's price) by 53% and weekly seasonal baselines by 57%. It correctly identifies 70.7% of price spikes and captures the within-day price shape with 0.89 correlation — which is what battery storage operators actually need.
Two Numbers, One Explanation
This article references two MAE numbers: 12.69 (v7.0) and 15.73 (v10.1). These are not apples-to-apples. The 12.69 was measured on a 90-day calm window from December through March. The 15.73 is measured on a 150-day window that includes the Oct–Nov 2025 gas crisis, when prices hit 170–247 EUR/MWh. On that same harder window, v7.0 scores 15.76. The improvement in v10.1 is not in average accuracy — it is in structure: 77% less systematic underprediction, a model that can see above 200 EUR, and nearly 1.5x better spike recall.
What 121 Experiments Taught Me
What you optimize matters more than how you optimize it. The two biggest structural improvements both came from asking "am I targeting the right thing?" — not from tuning the model. Switching from MAE to quantile loss at q=0.55 improved accuracy 18.2% (the single biggest gain in the project). Switching the prediction target from raw price to 1-week residual reduced bias by 51% and high-price MAE by 26%. Both required changing one line, not building a new architecture.
Diagnose first, and validate on the hard window. I spent weeks tweaking features and hyperparameters before running the 48-experiment scout campaign that identified leaf averaging as the structural limit. Once diagnosed, the fix was obvious. Short backtests also mislead — 90-day scouts pointed to decay180 as optimal; the 150-day window showed decay365 was better. Always validate on the window that includes the hard regime.
Non-obvious interactions dominate. Price weight 3x + time decay 365 is uniquely effective — a non-linear combination that outperforms either technique alone and outperforms nearby parameter values (2x or 5x weight, 180-day or no decay). You can't find these with theory. You find them by running systematic sweeps.
Failed experiments are informative. The peak/off-peak split taught me that peaks need surrounding context. The log transform taught me that compression in log-space amplifies underprediction. The MLP taught me the compression problem is model-specific, not data-specific. Each failure narrowed the search space for what would work.
One good model beats three average ones. The three-model ensemble was right through v4.3 — each model's weaknesses smoothed by the others. But once deep trees made XGBoost dramatically better than HistGB and LightGBM, the ensemble became a liability. Knowing when to stop ensembling is as important as knowing when to start.
Task alignment beats generic pre-training. The generic LSTM (trained to predict next-1h) and the task-aligned encoder (trained to predict the full D+1 curve) use identical architecture. Different training objective — 8 percentage points more spike recall on volatile periods. Any encoder borrowed from a proxy task should be re-aligned to the exact downstream problem.
How Close Can One Person Get?
Commercial electricity price forecasters achieve D+1 MAE of 5–10 EUR/MWh with proprietary data and dedicated teams. On a calm 90-day window I got to 12.69 with public data and open-source tools. On the harder 150-day window including a gas crisis, the production model sits at 15.73 — with less than 1 EUR of systematic bias. The structural problems that seemed intractable at the start — tree leaf averaging, extreme-price blindness, persistent underprediction — all have answers. Some are simple (change what you predict). Some are architectural (give trees temporal memory). None required proprietary data.
Most of the remaining gap to commercial forecasters likely comes from three things I don't have: proprietary weather forecast data (not historical observations), real-time market order book data, and years of domain knowledge about edge cases. All three are on the roadmap.
Since this post was published, the model was stress-tested by a real market crisis: the March 2026 Iran oil shock sent Spanish prices to 250 EUR/MWh. I wrote a detailed post-mortem on exactly how each model generation failed — and why. Read: When Our Model Met a Crisis →
I got to 15.73 EUR/MWh on a volatile market. Can you do better?