Skip to content

v6.3 — Deep Trees 150-Day Validation (Validated)

Date: March 19, 2026 | Status: Validated — DA-only, pending strategic validation

Summary

Following the v6.2 scout campaign that identified deep trees (depth=12) as a breakthrough, we ran 17 experiments on the full 150-day window (Oct 2025 – Mar 2026) to validate the finding across model types, decay rates, and quantile targets. Combined with the 24 scout experiments (v6.1 + v6.2), this campaign comprises 48 total experiments — the most extensive hyperparameter exploration in the project’s history.

Best result: XGBoost depth=12 + decay365 + q=0.55 — MAE 13.20 (-8.8% vs production)

Bias Sign Correction

An important correction from the scout reports: the bias was reported with the wrong sign convention. All models underpredict (negative bias), consistent with the original structural diagnosis. The scout scripts computed mean(actual - predicted) instead of mean(predicted - actual). Deep trees reduced underprediction from -11.5 (baseline) to -7.0 (best), but did NOT flip it positive.

Validated Results (150-day window)

XGBoost Deep Trees — Decay Sweep

DecayMAEBiasEve MAE (h17-21)Spike RecallDir Accuracy
365d13.20-7.0319.4256.5%66.8%
180d13.74-8.6521.5047.4%68.3%
none13.83-5.9918.2161.4%63.3%
270d13.91-7.8620.6850.5%67.5%

Finding: Decay365 wins on MAE (13.20) but no-decay wins on peaks (eve MAE 18.21, spike recall 61.4%). The trade-off is monotonic — more decay helps overall accuracy but reduces peak capture.

Important: On 90-day scouts, d180 beat d365. On 150-day, d365 wins. Scout results on decay halflife can be misleading — always validate on longer windows.

XGBoost Deep Trees — Quantile Sweep

QuantileMAEBias
0.5513.83-5.99
0.5214.10-7.71
0.50 (MAE)14.52-8.06

Finding: Quantile 0.55 is definitively optimal. Both lower values produce worse MAE and worse (more negative) bias. This was tested both with and without decay, with consistent results.

Cross-Model Comparison (all depth=12)

ModelNo Decay MAEWith Decay MAEConclusion
XGBoost13.8313.20 (d365)Clear winner — histogram splitting handles deep trees efficiently
LightGBM15.5214.27 (d180)Can’t exploit deep trees; worse than XGB baseline
HistGB17.9116.53 (d180)Incompatible — severe degradation with depth=12

Finding: Deep trees are an XGBoost-specific improvement. LightGBM and HistGB both degrade significantly with depth=12. The current 3-model ensemble (HistGB + LGB + XGB, all depth=8) averages a weak model with a strong one — a single XGB depth=12 outperforms the ensemble.

Other Findings

  • 800 trees (vs 500): MAE 14.17 vs 13.83 — more trees didn’t help, current regularization is calibrated for 500
  • Depth 10 (vs 12): MAE 14.47 vs 13.83 — intermediate depth is decent but 12 is clearly better
  • Feature selection: Essential — removing it costs +2 EUR/MWh (overfitting with deep trees + many features)

Compression Analysis: Why MAE Can’t Break Below ~13

The most important finding from this campaign isn’t the winning configuration — it’s understanding the fundamental barrier.

Regression Slope = 0.70

Every model, regardless of depth, decay, or quantile, has a regression slope of 0.64-0.72. This means predictions capture only 64-72% of actual price variation. When actual prices are 130 EUR, the best model predicts ~101 EUR.

Error by Price Level

Price Range% of DataMAEBiasUnderpredicted
< 20 EUR26.9%5-10+5 to +920-26% ← overpredicts
20-60 EUR21.3%7-8±2-446-68% ← balanced
60-100 EUR32.2%10-15-9 to -1589-97%
100-130 EUR16.1%22-2299.7%
130+ EUR3.5%45-45100%

The model compresses toward the center: overpredicts low prices, severely underpredicts high prices. The 80-130 EUR range (34% of data) alone drives ~40% of total error.

Error Concentration

  • Hours 17-20 (evening peak): Bias -15 to -22 EUR/MWh. This 4-hour window dominates total error.
  • Mondays: MAE 20.18 vs 12-13 for other weekdays. Weekend→weekday price jumps are poorly predicted.
  • December-January: MAE 14.6-15.8 (high prices + volatility). February was easy (MAE 7.4, avg price just 16.5 EUR).

Naive Benchmarks

MethodMAEModel Improvement
Naive persistence27.15
Naive weekly29.70
Best model (d365)13.20-51% vs persistence
Academic EPF rangetypically 35-55% vs persistence

We’re at 51% improvement over persistence — near the upper end of published academic results. The remaining gap to MAE ~10 requires fundamentally different approaches.

Implications for Next Steps

To break below MAE ~13, we need approaches that specifically fix the 80-130 EUR prediction range where 34% of data has MAE 15-22:

  1. Asymmetric loss functions that penalize peak underprediction 3-5x more heavily than overprediction of low prices
  2. Post-hoc quantile recalibration — fit a monotone mapping on OOF predictions to expand the predicted range
  3. Peak-hour auxiliary model — a second model trained only for hours 17-20 with explicit high-price optimization
  4. Monday-specific features — capture the weekend→weekday price transition that causes MAE 20 vs 12
  5. Neural networks (TFT, N-BEATS) — don’t have the tree leaf-averaging compression problem

Incremental tree tuning (more depth, different decay, different quantile) has been exhaustively explored and won’t break through the ~13 floor.

Best Configuration

Model: XGBoost (single model, not ensemble)
Hyperparameters:
max_depth: 12
min_child_weight: 5
learning_rate: 0.03
reg_lambda: 0.3
n_estimators: 500
quantile: 0.55
Training config:
sample_weight_halflife: 365 (or none for better peak capture)
winsorize_cap: 200
feature_selection: enabled