v6.1 — Structural Diagnosis (Scout)

Date: March 18, 2026 | Status: Diagnosis complete — new approach planned

Background

After completing 11 improvement items across 5 versions (v3.0 → v4.3), the EPF model achieved a 13.1% reduction in day-ahead MAE and a 29.2% reduction in bias. However, a persistent pattern remained: every model variant underpredicts high prices by 20-35%, with a negative bias of -10 to -12 EUR/MWh concentrated at evening peak hours (17-21).

A fast scout pipeline (6 experiments, 90-day window, XGBoost-only) was designed to rapidly isolate which configuration changes help most. The results revealed something unexpected: the underprediction is not caused by missing features, wrong loss parameters, or stale training data. It’s a structural property of how gradient boosting trees make predictions.

Fast Scout Experiments

Six configuration variants were backtested against the v4.3 baseline, each isolating a single change:

Experiment	MAE	Bias	What It Tests
decay180	13.41	-10.65	Winsorize 200 EUR + time decay 180d halflife
q065	13.79	-6.89	Quantile target raised to 0.65
decay365	13.91	-10.49	Winsorize 200 EUR + time decay 365d halflife
baseline	14.44	-10.80	v4.3 configuration unchanged
winsor-only	14.48	-9.75	Winsorize at 200 EUR, no time decay
q060	14.80	-9.28	Quantile target 0.60

Key findings:

Time decay is the best config lever — decay180 improves MAE by 7.1% vs baseline
Winsorization alone has no effect — proves crisis data contamination is not the root cause
Quantile 0.65 improves bias from -10.80 to -6.89 but doesn’t extend the prediction range
A single XGBoost with decay180 (MAE 13.41) matches the full 3-model v4.1 ensemble (MAE 13.90)

The Discovery: Range Compression

Analyzing the raw prediction distribution of the scout baseline revealed a striking pattern:

Metric	Actual	Predicted	Gap
Range	-4 to 200 EUR	-22 to 127 EUR	73% of actual
P75	87 EUR	68 EUR	-22%
P95	122 EUR	92 EUR	-24%
P99	155 EUR	99 EUR	-36%

The model has a hard effective ceiling around 125-127 EUR regardless of what actual prices do. At hour 19 (evening peak):

42.6% of actual prices exceed 100 EUR
Only 12.5% of predictions exceed 100 EUR
When actual prices are 150-200 EUR, the model predicts 101 EUR on average
The model’s maximum prediction at this hour is 126 EUR — never higher

This pattern is identical across all model types (HistGBT, LightGBM, XGBoost) and all configuration variants tested.

Root Cause: Leaf Averaging

Gradient boosting trees work by partitioning training data into terminal leaf nodes, where each leaf predicts the (weighted) average of all training targets that fall into it. This mechanism is inherently mean-regressive at the tails:

When a leaf contains prices of 80, 100, 130, and 170 EUR, the leaf prediction is ~120 EUR — never 170. The tree is structurally incapable of predicting values near the extremes of its training distribution.
Ensemble averaging across three models (HistGBT, LightGBM, XGBoost) compounds the compression — each model’s compressed range gets averaged further.
This is a fundamental algorithm property, not a tuning problem. More features, different loss functions, and post-processing all fail to extend the range because they don’t change the leaf averaging mechanism.

What This Explains

This diagnosis retroactively explains several puzzling results from earlier experiments:

Experiment	Result	Explanation
+22 features (v4.2)	DA MAE regressed +11.4%	More features → more tree splits → same or worse leaf averaging with smaller leaf populations
Peak/off-peak split (v5.0)	DA MAE +10.1% worse	Halved training data → fewer samples per leaf → worse averaging
Log-transform (v5.0b)	DA MAE +14.5% worse	Compressed target range further (3.9-5.3) → even more compression on already-compressed trees
Isotonic calibration (Sprint A)	Amplified underprediction	Monotone mapping trained on compressed OOF predictions can’t extrapolate beyond training range
3-model ensemble ≈ single XGBoost	Scout confirmed	Ensemble doesn’t add diversity — all three models compress similarly, averaging just adds more compression

Proposed Solutions

The structural fix is to change what the model predicts rather than how it predicts or what inputs it uses:

1. Residual-from-Baseline Transform (Priority)

Train the model to predict deviation = price - weekly_same_hour_median instead of raw EUR. The weekly median captures ~60-70% of price variation. The residual distribution is roughly symmetric around zero (range ~-60 to +80 EUR) instead of right-skewed (0-200+). Tree leaf averaging on a symmetric distribution doesn’t cause systematic bias.

2. Multiplicative Ratio Transform

Predict ratio = price / weekly_same_hour_median. Produces a scale-invariant target centered at 1.0 (range ~0.5-2.0). The model predicts “50% above typical” rather than “140 EUR”.

3. Tree Depth/Leaf Tuning

Increase max_depth from 8 to 12, reduce min_samples_leaf from 20 to 5. Creates more specialized leaves for extreme conditions. Quick test that can combine with transforms.

4. Quantile Ensemble Reconstruction

Train models at q=0.10/0.50/0.90 and reconstruct the conditional mean. The q=0.90 model has explicit permission to predict high values (180+ EUR), providing tail capacity the current q=0.55 model lacks.

Impact on Roadmap

This diagnosis triggered a major restructuring of the improvement plan:

New Tier 7 (Target Transforms) added as top priority, with items 7.1-7.4
Tiers 2-6 deprioritized — they were treating symptoms of the structural issue
Previous Sprint C (crisis data weighting) reclassified from “root cause fix” to “useful config add-on” — decay180 helps but doesn’t fix the ceiling
Config tuning ceiling estimated at MAE ~12.5-13.0 — breaking through requires structural changes