v6.3 — Deep Trees 150-Day Validation (Validated)
Date: March 19, 2026 | Status: Validated — DA-only, pending strategic validation
Summary
Following the v6.2 scout campaign that identified deep trees (depth=12) as a breakthrough, we ran 17 experiments on the full 150-day window (Oct 2025 – Mar 2026) to validate the finding across model types, decay rates, and quantile targets. Combined with the 24 scout experiments (v6.1 + v6.2), this campaign comprises 48 total experiments — the most extensive hyperparameter exploration in the project’s history.
Best result: XGBoost depth=12 + decay365 + q=0.55 — MAE 13.20 (-8.8% vs production)
Bias Sign Correction
An important correction from the scout reports: the bias was reported with the wrong sign convention. All models underpredict (negative bias), consistent with the original structural diagnosis. The scout scripts computed mean(actual - predicted) instead of mean(predicted - actual). Deep trees reduced underprediction from -11.5 (baseline) to -7.0 (best), but did NOT flip it positive.
Validated Results (150-day window)
XGBoost Deep Trees — Decay Sweep
| Decay | MAE | Bias | Eve MAE (h17-21) | Spike Recall | Dir Accuracy |
|---|---|---|---|---|---|
| 365d | 13.20 | -7.03 | 19.42 | 56.5% | 66.8% |
| 180d | 13.74 | -8.65 | 21.50 | 47.4% | 68.3% |
| none | 13.83 | -5.99 | 18.21 | 61.4% | 63.3% |
| 270d | 13.91 | -7.86 | 20.68 | 50.5% | 67.5% |
Finding: Decay365 wins on MAE (13.20) but no-decay wins on peaks (eve MAE 18.21, spike recall 61.4%). The trade-off is monotonic — more decay helps overall accuracy but reduces peak capture.
Important: On 90-day scouts, d180 beat d365. On 150-day, d365 wins. Scout results on decay halflife can be misleading — always validate on longer windows.
XGBoost Deep Trees — Quantile Sweep
| Quantile | MAE | Bias |
|---|---|---|
| 0.55 | 13.83 | -5.99 |
| 0.52 | 14.10 | -7.71 |
| 0.50 (MAE) | 14.52 | -8.06 |
Finding: Quantile 0.55 is definitively optimal. Both lower values produce worse MAE and worse (more negative) bias. This was tested both with and without decay, with consistent results.
Cross-Model Comparison (all depth=12)
| Model | No Decay MAE | With Decay MAE | Conclusion |
|---|---|---|---|
| XGBoost | 13.83 | 13.20 (d365) | Clear winner — histogram splitting handles deep trees efficiently |
| LightGBM | 15.52 | 14.27 (d180) | Can’t exploit deep trees; worse than XGB baseline |
| HistGB | 17.91 | 16.53 (d180) | Incompatible — severe degradation with depth=12 |
Finding: Deep trees are an XGBoost-specific improvement. LightGBM and HistGB both degrade significantly with depth=12. The current 3-model ensemble (HistGB + LGB + XGB, all depth=8) averages a weak model with a strong one — a single XGB depth=12 outperforms the ensemble.
Other Findings
- 800 trees (vs 500): MAE 14.17 vs 13.83 — more trees didn’t help, current regularization is calibrated for 500
- Depth 10 (vs 12): MAE 14.47 vs 13.83 — intermediate depth is decent but 12 is clearly better
- Feature selection: Essential — removing it costs +2 EUR/MWh (overfitting with deep trees + many features)
Compression Analysis: Why MAE Can’t Break Below ~13
The most important finding from this campaign isn’t the winning configuration — it’s understanding the fundamental barrier.
Regression Slope = 0.70
Every model, regardless of depth, decay, or quantile, has a regression slope of 0.64-0.72. This means predictions capture only 64-72% of actual price variation. When actual prices are 130 EUR, the best model predicts ~101 EUR.
Error by Price Level
| Price Range | % of Data | MAE | Bias | Underpredicted |
|---|---|---|---|---|
| < 20 EUR | 26.9% | 5-10 | +5 to +9 | 20-26% ← overpredicts |
| 20-60 EUR | 21.3% | 7-8 | ±2-4 | 46-68% ← balanced |
| 60-100 EUR | 32.2% | 10-15 | -9 to -15 | 89-97% |
| 100-130 EUR | 16.1% | 22 | -22 | 99.7% |
| 130+ EUR | 3.5% | 45 | -45 | 100% |
The model compresses toward the center: overpredicts low prices, severely underpredicts high prices. The 80-130 EUR range (34% of data) alone drives ~40% of total error.
Error Concentration
- Hours 17-20 (evening peak): Bias -15 to -22 EUR/MWh. This 4-hour window dominates total error.
- Mondays: MAE 20.18 vs 12-13 for other weekdays. Weekend→weekday price jumps are poorly predicted.
- December-January: MAE 14.6-15.8 (high prices + volatility). February was easy (MAE 7.4, avg price just 16.5 EUR).
Naive Benchmarks
| Method | MAE | Model Improvement |
|---|---|---|
| Naive persistence | 27.15 | — |
| Naive weekly | 29.70 | — |
| Best model (d365) | 13.20 | -51% vs persistence |
| Academic EPF range | — | typically 35-55% vs persistence |
We’re at 51% improvement over persistence — near the upper end of published academic results. The remaining gap to MAE ~10 requires fundamentally different approaches.
Implications for Next Steps
To break below MAE ~13, we need approaches that specifically fix the 80-130 EUR prediction range where 34% of data has MAE 15-22:
- Asymmetric loss functions that penalize peak underprediction 3-5x more heavily than overprediction of low prices
- Post-hoc quantile recalibration — fit a monotone mapping on OOF predictions to expand the predicted range
- Peak-hour auxiliary model — a second model trained only for hours 17-20 with explicit high-price optimization
- Monday-specific features — capture the weekend→weekday price transition that causes MAE 20 vs 12
- Neural networks (TFT, N-BEATS) — don’t have the tree leaf-averaging compression problem
Incremental tree tuning (more depth, different decay, different quantile) has been exhaustively explored and won’t break through the ~13 floor.
Best Configuration
Model: XGBoost (single model, not ensemble)Hyperparameters: max_depth: 12 min_child_weight: 5 learning_rate: 0.03 reg_lambda: 0.3 n_estimators: 500 quantile: 0.55Training config: sample_weight_halflife: 365 (or none for better peak capture) winsorize_cap: 200 feature_selection: enabled