v6.1 — Structural Diagnosis (Scout)
Date: March 18, 2026 | Status: Diagnosis complete — new approach planned
Background
After completing 11 improvement items across 5 versions (v3.0 → v4.3), the EPF model achieved a 13.1% reduction in day-ahead MAE and a 29.2% reduction in bias. However, a persistent pattern remained: every model variant underpredicts high prices by 20-35%, with a negative bias of -10 to -12 EUR/MWh concentrated at evening peak hours (17-21).
A fast scout pipeline (6 experiments, 90-day window, XGBoost-only) was designed to rapidly isolate which configuration changes help most. The results revealed something unexpected: the underprediction is not caused by missing features, wrong loss parameters, or stale training data. It’s a structural property of how gradient boosting trees make predictions.
Fast Scout Experiments
Six configuration variants were backtested against the v4.3 baseline, each isolating a single change:
| Experiment | MAE | Bias | What It Tests |
|---|---|---|---|
| decay180 | 13.41 | -10.65 | Winsorize 200 EUR + time decay 180d halflife |
| q065 | 13.79 | -6.89 | Quantile target raised to 0.65 |
| decay365 | 13.91 | -10.49 | Winsorize 200 EUR + time decay 365d halflife |
| baseline | 14.44 | -10.80 | v4.3 configuration unchanged |
| winsor-only | 14.48 | -9.75 | Winsorize at 200 EUR, no time decay |
| q060 | 14.80 | -9.28 | Quantile target 0.60 |
Key findings:
- Time decay is the best config lever — decay180 improves MAE by 7.1% vs baseline
- Winsorization alone has no effect — proves crisis data contamination is not the root cause
- Quantile 0.65 improves bias from -10.80 to -6.89 but doesn’t extend the prediction range
- A single XGBoost with decay180 (MAE 13.41) matches the full 3-model v4.1 ensemble (MAE 13.90)
The Discovery: Range Compression
Analyzing the raw prediction distribution of the scout baseline revealed a striking pattern:
| Metric | Actual | Predicted | Gap |
|---|---|---|---|
| Range | -4 to 200 EUR | -22 to 127 EUR | 73% of actual |
| P75 | 87 EUR | 68 EUR | -22% |
| P95 | 122 EUR | 92 EUR | -24% |
| P99 | 155 EUR | 99 EUR | -36% |
The model has a hard effective ceiling around 125-127 EUR regardless of what actual prices do. At hour 19 (evening peak):
- 42.6% of actual prices exceed 100 EUR
- Only 12.5% of predictions exceed 100 EUR
- When actual prices are 150-200 EUR, the model predicts 101 EUR on average
- The model’s maximum prediction at this hour is 126 EUR — never higher
This pattern is identical across all model types (HistGBT, LightGBM, XGBoost) and all configuration variants tested.
Root Cause: Leaf Averaging
Gradient boosting trees work by partitioning training data into terminal leaf nodes, where each leaf predicts the (weighted) average of all training targets that fall into it. This mechanism is inherently mean-regressive at the tails:
-
When a leaf contains prices of 80, 100, 130, and 170 EUR, the leaf prediction is ~120 EUR — never 170. The tree is structurally incapable of predicting values near the extremes of its training distribution.
-
Ensemble averaging across three models (HistGBT, LightGBM, XGBoost) compounds the compression — each model’s compressed range gets averaged further.
-
This is a fundamental algorithm property, not a tuning problem. More features, different loss functions, and post-processing all fail to extend the range because they don’t change the leaf averaging mechanism.
What This Explains
This diagnosis retroactively explains several puzzling results from earlier experiments:
| Experiment | Result | Explanation |
|---|---|---|
| +22 features (v4.2) | DA MAE regressed +11.4% | More features → more tree splits → same or worse leaf averaging with smaller leaf populations |
| Peak/off-peak split (v5.0) | DA MAE +10.1% worse | Halved training data → fewer samples per leaf → worse averaging |
| Log-transform (v5.0b) | DA MAE +14.5% worse | Compressed target range further (3.9-5.3) → even more compression on already-compressed trees |
| Isotonic calibration (Sprint A) | Amplified underprediction | Monotone mapping trained on compressed OOF predictions can’t extrapolate beyond training range |
| 3-model ensemble ≈ single XGBoost | Scout confirmed | Ensemble doesn’t add diversity — all three models compress similarly, averaging just adds more compression |
Proposed Solutions
The structural fix is to change what the model predicts rather than how it predicts or what inputs it uses:
1. Residual-from-Baseline Transform (Priority)
Train the model to predict deviation = price - weekly_same_hour_median instead of raw EUR. The weekly median captures ~60-70% of price variation. The residual distribution is roughly symmetric around zero (range ~-60 to +80 EUR) instead of right-skewed (0-200+). Tree leaf averaging on a symmetric distribution doesn’t cause systematic bias.
2. Multiplicative Ratio Transform
Predict ratio = price / weekly_same_hour_median. Produces a scale-invariant target centered at 1.0 (range ~0.5-2.0). The model predicts “50% above typical” rather than “140 EUR”.
3. Tree Depth/Leaf Tuning
Increase max_depth from 8 to 12, reduce min_samples_leaf from 20 to 5. Creates more specialized leaves for extreme conditions. Quick test that can combine with transforms.
4. Quantile Ensemble Reconstruction
Train models at q=0.10/0.50/0.90 and reconstruct the conditional mean. The q=0.90 model has explicit permission to predict high values (180+ EUR), providing tail capacity the current q=0.55 model lacks.
Impact on Roadmap
This diagnosis triggered a major restructuring of the improvement plan:
- New Tier 7 (Target Transforms) added as top priority, with items 7.1-7.4
- Tiers 2-6 deprioritized — they were treating symptoms of the structural issue
- Previous Sprint C (crisis data weighting) reclassified from “root cause fix” to “useful config add-on” — decay180 helps but doesn’t fix the ceiling
- Config tuning ceiling estimated at MAE ~12.5-13.0 — breaking through requires structural changes