v6.3 — Deep Trees 150-Day Validation (Validated)

Date: March 19, 2026 | Status: Validated — DA-only, pending strategic validation

Summary

Following the v6.2 scout campaign that identified deep trees (depth=12) as a breakthrough, we ran 17 experiments on the full 150-day window (Oct 2025 – Mar 2026) to validate the finding across model types, decay rates, and quantile targets. Combined with the 24 scout experiments (v6.1 + v6.2), this campaign comprises 48 total experiments — the most extensive hyperparameter exploration in the project’s history.

Best result: XGBoost depth=12 + decay365 + q=0.55 — MAE 13.20 (-8.8% vs production)

Bias Sign Correction

An important correction from the scout reports: the bias was reported with the wrong sign convention. All models underpredict (negative bias), consistent with the original structural diagnosis. The scout scripts computed mean(actual - predicted) instead of mean(predicted - actual). Deep trees reduced underprediction from -11.5 (baseline) to -7.0 (best), but did NOT flip it positive.

Validated Results (150-day window)

XGBoost Deep Trees — Decay Sweep

Decay	MAE	Bias	Eve MAE (h17-21)	Spike Recall	Dir Accuracy
365d	13.20	-7.03	19.42	56.5%	66.8%
180d	13.74	-8.65	21.50	47.4%	68.3%
none	13.83	-5.99	18.21	61.4%	63.3%
270d	13.91	-7.86	20.68	50.5%	67.5%

Finding: Decay365 wins on MAE (13.20) but no-decay wins on peaks (eve MAE 18.21, spike recall 61.4%). The trade-off is monotonic — more decay helps overall accuracy but reduces peak capture.

Important: On 90-day scouts, d180 beat d365. On 150-day, d365 wins. Scout results on decay halflife can be misleading — always validate on longer windows.

XGBoost Deep Trees — Quantile Sweep

Quantile	MAE	Bias
0.55	13.83	-5.99
0.52	14.10	-7.71
0.50 (MAE)	14.52	-8.06

Finding: Quantile 0.55 is definitively optimal. Both lower values produce worse MAE and worse (more negative) bias. This was tested both with and without decay, with consistent results.

Cross-Model Comparison (all depth=12)

Model	No Decay MAE	With Decay MAE	Conclusion
XGBoost	13.83	13.20 (d365)	Clear winner — histogram splitting handles deep trees efficiently
LightGBM	15.52	14.27 (d180)	Can’t exploit deep trees; worse than XGB baseline
HistGB	17.91	16.53 (d180)	Incompatible — severe degradation with depth=12

Finding: Deep trees are an XGBoost-specific improvement. LightGBM and HistGB both degrade significantly with depth=12. The current 3-model ensemble (HistGB + LGB + XGB, all depth=8) averages a weak model with a strong one — a single XGB depth=12 outperforms the ensemble.

Other Findings

800 trees (vs 500): MAE 14.17 vs 13.83 — more trees didn’t help, current regularization is calibrated for 500
Depth 10 (vs 12): MAE 14.47 vs 13.83 — intermediate depth is decent but 12 is clearly better
Feature selection: Essential — removing it costs +2 EUR/MWh (overfitting with deep trees + many features)

Compression Analysis: Why MAE Can’t Break Below ~13

The most important finding from this campaign isn’t the winning configuration — it’s understanding the fundamental barrier.

Regression Slope = 0.70

Every model, regardless of depth, decay, or quantile, has a regression slope of 0.64-0.72. This means predictions capture only 64-72% of actual price variation. When actual prices are 130 EUR, the best model predicts ~101 EUR.

Error by Price Level

Price Range	% of Data	MAE	Bias	Underpredicted
< 20 EUR	26.9%	5-10	+5 to +9	20-26% ← overpredicts
20-60 EUR	21.3%	7-8	±2-4	46-68% ← balanced
60-100 EUR	32.2%	10-15	-9 to -15	89-97%
100-130 EUR	16.1%	22	-22	99.7%
130+ EUR	3.5%	45	-45	100%

The model compresses toward the center: overpredicts low prices, severely underpredicts high prices. The 80-130 EUR range (34% of data) alone drives ~40% of total error.

Error Concentration

Hours 17-20 (evening peak): Bias -15 to -22 EUR/MWh. This 4-hour window dominates total error.
Mondays: MAE 20.18 vs 12-13 for other weekdays. Weekend→weekday price jumps are poorly predicted.
December-January: MAE 14.6-15.8 (high prices + volatility). February was easy (MAE 7.4, avg price just 16.5 EUR).

Naive Benchmarks

Method	MAE	Model Improvement
Naive persistence	27.15	—
Naive weekly	29.70	—
Best model (d365)	13.20	-51% vs persistence
Academic EPF range	—	typically 35-55% vs persistence

We’re at 51% improvement over persistence — near the upper end of published academic results. The remaining gap to MAE ~10 requires fundamentally different approaches.

Implications for Next Steps

To break below MAE ~13, we need approaches that specifically fix the 80-130 EUR prediction range where 34% of data has MAE 15-22:

Asymmetric loss functions that penalize peak underprediction 3-5x more heavily than overprediction of low prices
Post-hoc quantile recalibration — fit a monotone mapping on OOF predictions to expand the predicted range
Peak-hour auxiliary model — a second model trained only for hours 17-20 with explicit high-price optimization
Monday-specific features — capture the weekend→weekday price transition that causes MAE 20 vs 12
Neural networks (TFT, N-BEATS) — don’t have the tree leaf-averaging compression problem

Incremental tree tuning (more depth, different decay, different quantile) has been exhaustively explored and won’t break through the ~13 floor.

Best Configuration

Model: XGBoost (single model, not ensemble)
Hyperparameters:
  max_depth: 12
  min_child_weight: 5
  learning_rate: 0.03
  reg_lambda: 0.3
  n_estimators: 500
  quantile: 0.55
Training config:
  sample_weight_halflife: 365 (or none for better peak capture)
  winsorize_cap: 200
  feature_selection: enabled