I Tried to Predict Electricity Prices 7 Days Ahead. Here's How Close I Got.
100+ experiments. 11 major versions. From a single tree with 20 features to a clean single-XGBoost production model that beats every prior best on the structural metrics — including the dead ends, the structural diagnosis, the LSTM detour that turned out to be measuring broken code, and the v11.0 retrain that fixed it.
Electricity prices in Spain can swing from negative values to over 200 EUR/MWh in a single week. Every day, the OMIE market operator publishes tomorrow's hourly prices, and energy traders, renewable producers, and grid operators make decisions worth millions based on where those prices will land.
I wanted to know: as one person, using only publicly available data and open-source tools, how close could I get to predicting those prices — not just for tomorrow, but for the entire week ahead?
This article covers the full journey — 100+ experiments across 6 weeks, from an initial 18 EUR/MWh error down to a production model that beats every previous best on the structural metrics. Along the way I discovered why gradient boosting trees have a structural ceiling, ran a 48-experiment campaign to break through it, changed what the model predicts entirely, ran into a 3-week LSTM detour that turned out to be measuring broken code, and ended up with a clean single-XGBoost model that strictly dominates everything that came before.
The LSTM detour is not in the original version of this article. I'm leaving it in because the discovery — that two layered code-level bugs in our LSTM pipeline meant the entire v10.x experiment series measured zero-signal noise — is more useful as a methodology lesson than the original "task-aligned LSTM is the new architecture" framing that I shipped before catching the bugs. The story below tells what actually happened.
Data, Features, and Products
Spain's grid operator (REE) publishes 16 real-time electricity indicators through the ESIOS API — generation by source, demand, forecasts, and cross-border flows. Open-Meteo adds hourly weather from any coordinates at no cost. Three years of history across 16 electricity indicators, weather from 3 stations, and commodity futures — entirely free.
From 20 initial features I expanded to 100 across 11 domain categories: price lags, generation mix, weather interactions, commodity dynamics. Two surprises stood out. Residual demand (total demand minus renewables minus nuclear) directly encodes what gas plants need to produce — almost a direct proxy for the marginal price. Temperature deviation squared captures the U-shaped heating and cooling demand curve better than the linear term alone.
Feature selection in v4.3 cut 100 features down to 58 per horizon group. The selection revealed something unexpected: strategic horizons (D+2 to D+7) dropped every price lag feature entirely. At 2–7 day distances, yesterday's price at the same hour is noise, not signal.
Predicting tomorrow (D+1) and next week (D+2–D+7) are different tasks. For D+1 you have yesterday's published prices — the single most important predictor. For strategic horizons, you don't. Two separate products, matching how the market actually works:
The Models: Three Trees, One Average
From day one, I used gradient boosting — a family of machine learning algorithms purpose-built for tabular data. I deliberately avoided deep learning. With ~30K rows and dozens of structured features, gradient boosting consistently outperforms neural networks on this kind of problem, while being interpretable, fast to train on a single CPU, and requiring no GPU.
The core idea is simple: start with a naive guess (the average price), then train a sequence of small decision trees where each one specifically focuses on correcting the mistakes of everything before it. The first tree learns the biggest patterns (night is cheap, day is expensive). The second tree learns what the first one got wrong (peak ramps, weekends). By tree 200 or 500, you're catching subtle residual patterns.
In v1.0, I started with just HistGradientBoosting. By v2.0, I had all three running as an ensemble — their simple average. XGBoost had the best spike recall, HistGB the most stable covariance, LightGBM sat between them. Averaging smooths individual weaknesses. This three-model ensemble served as production through v4.3 — until the deep trees campaign revealed something surprising about how these models differ at depth.
Loss Function: A Small Shift That Changed Everything
The default choice for regression is MAE or MSE loss. Both target the conditional median or mean of the price distribution. But electricity prices are right-skewed — many moderate hours, a long tail of expensive spikes. The median sits structurally below the mean.
First, in v3.1, switching from MSE to MAE loss fixed the worst of the strategic bias:
Then in v4.1, I went further — switching to quantile loss at q=0.55, targeting the 55th percentile instead of the 50th. Just 5 percentage points of shift.
The result: day-ahead MAE improved by 18.2% (from 16.40 to 13.42 EUR/MWh), and bias dropped by 37%. This was the single biggest accuracy improvement in the entire project — not from a better model or more data, but from choosing the right loss function.
In the later scout campaigns, I systematically tested other quantile values (0.45, 0.50, 0.52, 0.60, 0.65). None beat 0.55. It's the sweet spot for this dataset's skew — enough to correct the structural underprediction without overshooting.
Bugs, Baseline, and Dead Ends
The most instructive bug: during strategic backtesting, my data slicing excluded published D+1 prices — the number one feature (26.5% of total importance) was NaN in some runs. Each of the three tree models handled the missing value differently, so the ensemble was broken in a non-obvious way. After fixing the data slicing, strategic MAE dropped 49% (XGBoost: 40.25 → 20.40 EUR/MWh). Tree models don't fail gracefully on missing data — they silently route samples down their default NaN branch, producing inconsistent behavior rather than an error.
After that fix, v4.3 became the first production deployment — automated daily predictions, fully live.
The nagging signs: bias at -10.97 EUR/MWh, errors worst at the most expensive hours, and every subsequent experiment bouncing off an invisible 13–14 EUR/MWh MAE ceiling. Before understanding why, I tried five approaches that all failed:
v5.0b — Log-transform target: Log compression amplified peak underprediction when transformed back. MAE worsened 14.5%.
v5.1 — Crisis data weighting: Winsorizing 2022 crisis prices had zero effect (±0.04 MAE). The problem is structural, not historical.
v8.0 — PyTorch deep MLP (after v7.2, see below): MAE 28.13 — 2.2× worse than XGBoost. A flattened lag array destroys temporal structure that trees handle naturally as conditional splits.
v9.0 — Quantile ensemble (after v7.2, see below): Blending q=0.25/0.55/0.75 models. Best combination: MAE 14.48 — worse than v7.0's 12.69. Averaging compressed outputs produces compressed averages.
The Structural Diagnosis: Why Trees Compress Prices
After three failed experiments, I stepped back and asked a different question. Instead of "what feature or config will fix this," I asked: why does every gradient boosting configuration I try converge to the same accuracy ceiling?
I ran a systematic scout campaign — 48 experiments in two days — designed not to improve accuracy, but to diagnose why it wasn't improving. The scouts tested everything: different decay rates, quantile values, new features (Fourier harmonics, renewable curtailment proxies), target transforms (residual from baseline, multiplicative ratios), and a completely different tree algorithm (CatBoost).
The answer was in the numbers. I computed the ratio of predicted price standard deviation to actual price standard deviation — a measure I called "range ratio." Across every single experiment, this number was almost identical:
Regardless of features, loss function, decay rate, or even the boosting algorithm, predictions captured only 73% of the actual price range. The model's maximum prediction was about 127 EUR/MWh, while actual prices regularly exceeded 200 EUR/MWh. The 99th percentile of predictions was 99 EUR/MWh; the 99th percentile of actuals was 155.
This is leaf averaging — a fundamental property of how decision trees work. When a leaf node contains training samples with prices [80, 100, 130, 170], the tree predicts their average (~120), never the extremes. Every sample in that leaf gets the same prediction. The more heterogeneous the leaf, the more compression happens. With depth=8 trees, the leaves aren't specialized enough to isolate truly extreme conditions.
Where the Errors Live
The compression creates a distinctive error pattern that I could now see clearly in the data:
The model overpredicts cheap hours (below 20 EUR/MWh) and severely underpredicts expensive hours. The critical range is 80–130 EUR/MWh, which makes up 34% of all hours but drives about 40% of total error. Above 130 EUR/MWh, the model essentially gives up — every prediction is too low by 45+ EUR/MWh.
The evening peak (hours 17–20) is where this hits hardest: bias of -15 to -22 EUR/MWh. These are exactly the hours that matter most for energy traders and battery operators. The model is least reliable precisely when accuracy matters most.
Breaking Through: Deep Trees
With the diagnosis in hand, the fix was conceptually simple: if leaves are too broad, make more leaves. More splits means more specialized leaf nodes. Instead of one leaf for "evening + high demand + winter" containing prices from 80 to 170, deeper trees create separate leaves for "evening + high demand + winter + high gas + low wind" (prices 130–170) and "evening + high demand + winter + moderate gas + some wind" (prices 80–110).
I increased tree depth from 8 to 12 and compensated with regularization to prevent overfitting: learning rate dropped from 0.05 to 0.03, L2 regularization (lambda) set to 0.3, and minimum child weight raised to 5 samples per leaf.
In the 90-day scout, deep trees alone dropped MAE to 12.99 — a 10.2% improvement over v4.3 production. When combined with time decay (d365, exponentially downweighting old data), it reached 12.55 MAE — a 13.3% improvement.
But here's the twist that made the v6 campaign valuable beyond just the MAE number: deep trees are XGBoost-specific. I ran the same depth=12 configuration on all three models. The results diverged dramatically:
XGBoost's level-wise growth strategy handles deep trees gracefully — it fills out each depth before going deeper, maintaining balanced structure. LightGBM's leaf-wise strategy already goes deep on difficult cases, so adding more depth gives diminishing returns and risks overfitting. HistGradientBoosting's histogram binning loses resolution at extreme depths — its 256-bucket discretization becomes a bottleneck when you need fine-grained splits.
This meant the three-model ensemble that had served me well through v4.3 was no longer optimal. A single XGBoost at depth=12 outperformed the three-model ensemble at depth=8. The ensemble was dragging down the best model.
The top configurations validated on a full 150-day backtest (October 2025 through February 2026). XGBoost depth=12 + decay365 + q=0.55 achieved 13.20 EUR/MWh MAE — an 8.8% improvement over production, capturing 70% of actual price variation. Notably, the 90-day scout had suggested decay180 was optimal; the 150-day window showed decay365 was better. Short backtests can mislead — always validate on longer windows.
The Compression-Breaking Campaign: v7.0
Deep trees reduced compression but didn't eliminate it. The regression slope was stuck at 0.70 — for every 10 EUR of actual price variation, the model predicted only 7 EUR. I needed the model to pay more attention to expensive hours during training.
The idea: during training, multiply the sample weight of every hour where the actual price exceeds a threshold by a fixed multiplier. If the threshold is 60 EUR/MWh and the multiplier is 3x, then expensive hours count three times as much in the loss function. The model allocates more of its limited capacity to getting those hours right, at the cost of slightly worse accuracy on cheap hours where errors matter less.
I ran 20 experiments across two waves, sweeping weight multipliers (2x, 3x, 4x, 5x), thresholds (60, 80, 100 EUR), and combinations with time decay.
The best configuration — price weight 3x with threshold 60 EUR and decay365 — achieved 12.69 EUR/MWh MAE, a 12.3% improvement over v4.3 production. Spike recall jumped from ~40% to 70.7%. Critical-range (80–130 EUR) error dropped from 28.3 to 20.1 EUR/MWh.
The Non-Linear Interaction
The most surprising finding was that price weighting and time decay interact non-linearly. You'd expect their benefits to simply add up. They don't.
Weight 3x benefits dramatically from decay365 (MAE improves by 1.34 EUR/MWh). But weight 2x gets worse with decay (+0.15), and weight 5x barely benefits (-0.13). The explanation: at weight 3x, the model is allocating moderate extra capacity to peaks, and time decay helps by reducing the noise from old crisis data. At 2x, the peak allocation is too gentle and decay over-thins the training set. At 5x, aggressive peak weighting overwhelms everything else — the model is already ignoring cheap hours so much that decay can't help further.
This kind of non-linear interaction is why I ran 20 experiments instead of just trying the obvious combination. In machine learning optimization, you often can't predict which parameter combinations will interact well until you test them.
The Neural Net Detour (v7.1)
If tree leaf averaging is the structural limit, why not use a model that doesn't have leaves? I tested sklearn's MLPRegressor (256, 128, 64) on the same data. MAE was 37 EUR/MWh — 3× worse than XGBoost. But the MLP's maximum prediction was 207 EUR/MWh and its range ratio was 1.08 (it overshot the actual range). This proved that compression is model-specific, not data-specific. The features contain enough information to predict extreme prices; trees can't extract it due to leaf averaging; an MLP architecturally can — but sklearn's optimizer couldn't find the weights in 2–4 hour training runs with no mini-batching or batch normalization.
Changing What the Model Predicts (v7.2)
The structural diagnosis showed that tree leaf averaging compresses predictions because leaves average extreme training values. But there was another angle: what if the prediction target itself were smaller?
Instead of predicting a raw price (which can range from 0 to 250 EUR/MWh), predict the deviation from a rolling 1-week price baseline — then add the baseline back at inference time. The compression still happens, but the model is now averaging a target range of roughly ±50 EUR rather than ±250. The leaf containing [80, 100, 130, 170] is now targeting [-20, 0, +30, +70], and the compressed average of those deviations is far less damaging.
The key design question was which baseline to use. A 4-week median seemed safer — smoother, more stable — but it degraded MAE to 15.13. The problem: a crisis that moves prices from 50 to 200 EUR in a week takes 3 more weeks to show up in the 4-week median. The 1-week baseline adapts in 7 days.
The 80–130 EUR/MWh range — the critical bucket where most error lived — improved by 26%. The compression IS partly reversible by changing the prediction target.
The catch: overall MAE was 12.92 vs v7.0's 12.69. About 66% of hours are below 80 EUR where the raw model was already accurate. The transform trades low-price precision for high-price accuracy — a slightly negative trade on a calm window. It becomes a decisive positive when the market turns volatile. We'd find that out later.
Two deeper architecture attempts confirmed the same conclusion: a PyTorch residual MLP (v8.0, MAE 28.13) and a quantile ensemble (v9.0, MAE 14.48) both failed to break the slope ceiling. A flattened lag array destroys temporal structure. The compression problem is not about loss function, optimizer, or ensemble diversity — it's about how the model represents time. Every approach so far sees price history as a flat array, not as a sequence.
The LSTM Detour That Wasn't (v10.0 → v10.2, retracted)
After v7.x, the next idea was to give XGBoost something it can't learn itself: a compact summary of sequential price dynamics from a pre-trained LSTM encoder. The design was clean — pre-train an LSTM on 168 hours of price history, take its 64-dimensional hidden state, feed those 64 numbers into XGBoost as additional features. XGBoost still makes the final prediction, but now with temporal context that flat lags can't capture.
I ran 18 experiments across v10.0 / v10.1 / v10.2 over three weeks. The headline result on the 150-day window looked great: v10.1 task-aligned LSTM + residual_1w scored DA MAE 15.73, bias -0.65 (best ever, near zero), spike recall 24.1% (best ever). I promoted it to production. I wrote the original version of this article making "task alignment beats generic pre-training" the climactic lesson. I cited LSTM spike recall improvements in the methodology blog and updated the dashboard footer to "Production v10.1".
What actually happened — Stage 1: "v10.1 isn't actually deployed"
In early April I went to write a follow-up piece on the LSTM architecture and discovered, by direct verification against the production VM, that torch was never installed. The EPF_LSTM_EMBEDDINGS=true env var had been set on March 22, but the torch install + LSTM artifact deploy steps that the promotion was supposed to include never actually happened. The predictor code path tried to instantiate LSTMEmbedder, the import torch failed silently, the inference fell through to a no-LSTM code path, and the cron continued writing the v4.3-era ensemble rows. The production "LSTM-XGBoost hybrid" that I'd been describing in this article had never run a single time.
Stage 2: "even the backtest was broken"
That was bad enough. It got worse. A local test that ran the LSTM end-to-end (the first time anyone in the project's history actually had torch + the LSTM artifact + the trained xgboost model all in the same place) produced clearly degraded output: mean predicted price 23 EUR/MWh against the 50–100+ Spanish reality, 90% intervals from -47 to +46 EUR. Investigation found two layered code-level bugs:
Bug 1 — silent zero-fill at 15-min inference time. The 15-min feature builder never called LSTMEmbedder.compute_embedding(). Any lstm_emb_* feature in the trained model was filled with 0.0 at the row construction site. xgboost handled the all-zero block silently — no exception, no warning, no log line. The 64-dim LSTM block contributed exactly zero to every prediction the system had ever made.
Bug 2 — wrong-domain LSTM input at training time. The trainer fed 15-minute prices through an LSTM encoder pre-trained on hourly prices. The encoder expected 168 hourly elements (1 week of context) with hourly normalization stats. What it actually received was 168 quarter-hours (42 hours of context). Wrong shape, wrong domain, wrong normalization. The encoder's output during training was noise. xgboost trained against that noise, and the feature_selector picked 6-10 of the noise dimensions per horizon group as if they were real features.
Empirical verification: toggling EPF_LSTM_EMBEDDINGS=true ↔ false on the same trained model produced bit-for-bit identical predictions. The LSTM block contributed zero useful signal, in production AND in every backtest that produced the v10.x headline numbers. The "best-ever bias", the "task alignment beats generic pre-training" finding, the "8 percentage points more spike recall" claim — every v10.x conclusion described broken code.
Stage 3: "v8 already won, we just didn't notice"
The deeper finding came a day later. There was a pre-LSTM scout from earlier in March, v8-res-1w-pw3-d365, that I'd treated as a stepping stone. I went back and queried the persisted backtest data on the same 156-day window:
| Tag | DA MAE | DA bias | SpkR | DirAcc | CorrFr |
|---|---|---|---|---|---|
| v8-res-1w-pw3-d365 (March, the predecessor) | 12.98 | -3.60 | 71.08% | 76.74% | 0.904 |
| v10.1 (the broken LSTM "promotion") | 15.73 | -0.65 | 69.34% | 75.87% | 0.887 |
The pre-LSTM v8 configuration already strictly dominated v10.1 on every metric except bias. v10.1's "best-ever bias" was an artifact of the broken-LSTM zero-fill noise acting as accidental regularization — the model couldn't commit to extreme predictions because the noise features dampened the trees' confidence. Every other metric, including the structural ones (spike recall, directional accuracy, forecast correlation, spread capture) that the v10.x narrative cited as evidence for LSTM's value, actually favored v8.
I'd been wrong about which model was the real winner. The production "promotion" had moved away from a strictly better architecture and toward broken code, then dressed it up with metrics that described the broken code, then I'd written this very article making it the climactic lesson. None of the structural insights about residual_1w or deep trees or price weighting were wrong — those were independently validated against v6.3 and v7.0 controls. But the entire LSTM chapter had to be retracted.
The Real Production Model (v11.0)
v11.0 is what happens when you take the v8-res-1w-pw3-d365 architecture and retrain it from current code on data through April 7. No LSTM. No ensemble. Single XGBoost (depth=12, learning rate=0.03, min_child=5, reg_lambda=0.3, q=0.55) with the residual_1w target transform, 3× price weighting above 60 EUR/MWh, and a 365-day sample weight halflife. This is the model that's now in src/config.py as the default and will be the first thing the new Cloud Run prediction job writes to the dashboard once the cutover ships.
v11.0 strictly dominates v10.1 on every metric except bias. Strategic shows the bigger gain — 2.7 percentage points better directional accuracy and 2.7 percentage points better spike recall on the D+2 to D+7 horizons that traders actually trade against a week ahead. The strategic backtest is also the first ever pre-LSTM strategic baseline in the project's history; the v10.x retraction also retracted everything we thought we knew about strategic performance, and v11.0 is what a clean strategic baseline actually looks like.
The 1.28 MAE gap to historical v8 (12.98) is explained by 16 more days of training data, code drift in post-processing, and one NaN row in ree_hourly from the autumn DST transition. These are data/code drift, not architectural regressions.
Results: Where Things Stand After 100+ Experiments
The evaluation framework evolved alongside the models — from MAE/RMSE only (v1–v2), to adding bias and naive benchmarks (v3), to spike recall and shape correlation (v4), to regression slope and error-by-price-level decomposition (v6–v7). Each phase changed which problem to solve next. Tracking regression slope from v1.0 would have identified the tree ceiling weeks earlier.
The full MAE journey:
The v4.2 regression is visible in the chart — adding 24 crisis features made things worse before feature selection recovered most of the damage in v4.3. The jump from v4.3 to v6.3 is the deep trees breakthrough. The final step from v6.3 to v7.0 is price-weighted training.
To put the final number in context: average Spanish electricity prices hover around 50–80 EUR/MWh, so 12.69 EUR/MWh error represents roughly 16–25% relative error. The model beats persistence (yesterday's price) by 53% and weekly seasonal baselines by 57%. It correctly identifies 70.7% of price spikes and captures the within-day price shape with 0.89 correlation — which is what battery storage operators actually need.
The Numbers in Context
This article references several MAE numbers measured on different windows. v7.0 scored 12.69 on a 90-day calm window from December through March. The same v7.0 config scores ~15 on the harder 156-day window that includes the Oct-Nov 2025 gas crisis. v8-res-1w-pw3-d365 (the predecessor that v11.0 retrains) scored 12.98 on the 156-day window in March. v11.0 scores 14.26 on the same window today (the 1.28 gap to historical v8 is explained by data/code drift, not architectural regression). The retracted v10.1 LSTM hybrid scored 15.73 — measurably worse than v11.0, which is the corrected production model. On strategic (D+2 to D+7), v11.0 sits at 17.35 against v10.1's 18.13.
What 100+ Experiments Actually Taught Me
What you optimize matters more than how you optimize it. The two biggest structural improvements both came from asking "am I targeting the right thing?" — not from tuning the model. Switching from MAE to quantile loss at q=0.55 improved accuracy 18.2% (the single biggest gain in the project). Switching the prediction target from raw price to 1-week residual reduced bias by 51% and high-price MAE by 26%. Both required changing one line, not building a new architecture.
Diagnose first, and validate on the hard window. I spent weeks tweaking features and hyperparameters before running the 48-experiment scout campaign that identified leaf averaging as the structural limit. Once diagnosed, the fix was obvious. Short backtests also mislead — 90-day scouts pointed to decay180 as optimal; the 150-day window showed decay365 was better. Always validate on the window that includes the hard regime.
Non-obvious interactions dominate. Price weight 3x + time decay 365 is uniquely effective — a non-linear combination that outperforms either technique alone and outperforms nearby parameter values (2x or 5x weight, 180-day or no decay). You can't find these with theory. You find them by running systematic sweeps.
Failed experiments are informative. The peak/off-peak split taught me that peaks need surrounding context. The log transform taught me that compression in log-space amplifies underprediction. The MLP taught me the compression problem is model-specific, not data-specific. Each failure narrowed the search space for what would work.
One good model beats three average ones. The three-model ensemble was right through v4.3 — each model's weaknesses smoothed by the others. But once deep trees made XGBoost dramatically better than HistGB and LightGBM, the ensemble became a liability. Knowing when to stop ensembling is as important as knowing when to start.
Task alignment beats generic pre-training. The generic LSTM (trained to predict next-1h) and the task-aligned encoder (trained to predict the full D+1 curve) use identical architecture. Different training objective — 8 percentage points more spike recall on volatile periods. Any encoder borrowed from a proxy task should be re-aligned to the exact downstream problem.
↑ RETRACTED 2026-04-09. The original version of this lesson was derived from the broken v10.x experiments. The "task-aligned LSTM encodes full-day-curve dynamics" claim cannot be supported by the persisted backtest data because the LSTM block was contributing zero useful signal at both training and inference time. There may still be a valid version of this lesson if someone properly fixes Bug 1 and Bug 2 and re-runs the comparison, but until then it should be treated as unsupported.
Verify your dependencies actually run. The most expensive bug in this project's history was an import torch that failed silently and a feature builder that didn't call the function it was supposed to call. Both were caught only by writing a test that exercised the full pipeline end-to-end with all dependencies present and comparing the output against an expected baseline. Three weeks of "structural improvement" experiments measured zero-signal noise. The fix is institutional: every model promotion now needs a startup self-test that refuses to run if the env vars and the trained model artifact disagree, plus a /api/v1/production-state endpoint that exposes the live state as a queryable source of truth, plus a documentation rule that requires citing the verification command in commit messages. None of this existed during the original v10.x promotion.
When the math doesn't add up, dig. v10.1's "best-ever bias" of -0.65 looked surprisingly clean. Every other metric was slightly improved over v7.0. The MAE delta was -0.03. None of those individually was suspicious, but the combination — a model that's better on every structural metric without a meaningful MAE improvement — should have triggered a "what's the model actually doing differently?" investigation at promotion time. It didn't. It took a separate effort six weeks later to surface the actual answer: the structural metrics were artifacts of broken-LSTM noise, not architectural improvements. If a result feels too clean across the board, the explanation is more often "instrumentation problem" than "architectural breakthrough".
How Close Can One Person Get?
Commercial electricity price forecasters achieve D+1 MAE of 5–10 EUR/MWh with proprietary data and dedicated teams. On a calm 90-day window the v8 architecture got to 12.98 with public data and open-source tools. On the harder 156-day window including the Iran gas crisis, the v11.0 production model sits at 14.26 with -3.91 bias on dayahead and 17.35 / -11.62 on strategic. The structural problems that seemed intractable at the start — tree leaf averaging, extreme-price blindness, persistent underprediction — all have answers. Some are simple (change what you predict). Some are architectural (deep trees with regularized leaves). None required proprietary data, and none of them required LSTM. The LSTM detour was a methodology lesson, not an architectural one.
Most of the remaining gap to commercial forecasters likely comes from three things I don't have: proprietary weather forecast data (not historical observations), real-time market order book data, and years of domain knowledge about edge cases. All three are on the roadmap.
Since this post was published, the model was stress-tested by a real market crisis: the March 2026 Iran oil shock sent Spanish prices to 250 EUR/MWh. I wrote a detailed post-mortem on exactly how each model generation failed — and why. Read: When Our Model Met a Crisis →
I got to 15.73 EUR/MWh on a volatile market. Can you do better?