Skip to content

v6.1 — Structural Diagnosis (Scout)

Date: March 18, 2026 | Status: Diagnosis complete — new approach planned

Background

After completing 11 improvement items across 5 versions (v3.0 → v4.3), the EPF model achieved a 13.1% reduction in day-ahead MAE and a 29.2% reduction in bias. However, a persistent pattern remained: every model variant underpredicts high prices by 20-35%, with a negative bias of -10 to -12 EUR/MWh concentrated at evening peak hours (17-21).

A fast scout pipeline (6 experiments, 90-day window, XGBoost-only) was designed to rapidly isolate which configuration changes help most. The results revealed something unexpected: the underprediction is not caused by missing features, wrong loss parameters, or stale training data. It’s a structural property of how gradient boosting trees make predictions.

Fast Scout Experiments

Six configuration variants were backtested against the v4.3 baseline, each isolating a single change:

ExperimentMAEBiasWhat It Tests
decay18013.41-10.65Winsorize 200 EUR + time decay 180d halflife
q06513.79-6.89Quantile target raised to 0.65
decay36513.91-10.49Winsorize 200 EUR + time decay 365d halflife
baseline14.44-10.80v4.3 configuration unchanged
winsor-only14.48-9.75Winsorize at 200 EUR, no time decay
q06014.80-9.28Quantile target 0.60

Key findings:

  • Time decay is the best config lever — decay180 improves MAE by 7.1% vs baseline
  • Winsorization alone has no effect — proves crisis data contamination is not the root cause
  • Quantile 0.65 improves bias from -10.80 to -6.89 but doesn’t extend the prediction range
  • A single XGBoost with decay180 (MAE 13.41) matches the full 3-model v4.1 ensemble (MAE 13.90)

The Discovery: Range Compression

Analyzing the raw prediction distribution of the scout baseline revealed a striking pattern:

MetricActualPredictedGap
Range-4 to 200 EUR-22 to 127 EUR73% of actual
P7587 EUR68 EUR-22%
P95122 EUR92 EUR-24%
P99155 EUR99 EUR-36%

The model has a hard effective ceiling around 125-127 EUR regardless of what actual prices do. At hour 19 (evening peak):

  • 42.6% of actual prices exceed 100 EUR
  • Only 12.5% of predictions exceed 100 EUR
  • When actual prices are 150-200 EUR, the model predicts 101 EUR on average
  • The model’s maximum prediction at this hour is 126 EUR — never higher

This pattern is identical across all model types (HistGBT, LightGBM, XGBoost) and all configuration variants tested.

Root Cause: Leaf Averaging

Gradient boosting trees work by partitioning training data into terminal leaf nodes, where each leaf predicts the (weighted) average of all training targets that fall into it. This mechanism is inherently mean-regressive at the tails:

  1. When a leaf contains prices of 80, 100, 130, and 170 EUR, the leaf prediction is ~120 EUR — never 170. The tree is structurally incapable of predicting values near the extremes of its training distribution.

  2. Ensemble averaging across three models (HistGBT, LightGBM, XGBoost) compounds the compression — each model’s compressed range gets averaged further.

  3. This is a fundamental algorithm property, not a tuning problem. More features, different loss functions, and post-processing all fail to extend the range because they don’t change the leaf averaging mechanism.

What This Explains

This diagnosis retroactively explains several puzzling results from earlier experiments:

ExperimentResultExplanation
+22 features (v4.2)DA MAE regressed +11.4%More features → more tree splits → same or worse leaf averaging with smaller leaf populations
Peak/off-peak split (v5.0)DA MAE +10.1% worseHalved training data → fewer samples per leaf → worse averaging
Log-transform (v5.0b)DA MAE +14.5% worseCompressed target range further (3.9-5.3) → even more compression on already-compressed trees
Isotonic calibration (Sprint A)Amplified underpredictionMonotone mapping trained on compressed OOF predictions can’t extrapolate beyond training range
3-model ensemble ≈ single XGBoostScout confirmedEnsemble doesn’t add diversity — all three models compress similarly, averaging just adds more compression

Proposed Solutions

The structural fix is to change what the model predicts rather than how it predicts or what inputs it uses:

1. Residual-from-Baseline Transform (Priority)

Train the model to predict deviation = price - weekly_same_hour_median instead of raw EUR. The weekly median captures ~60-70% of price variation. The residual distribution is roughly symmetric around zero (range ~-60 to +80 EUR) instead of right-skewed (0-200+). Tree leaf averaging on a symmetric distribution doesn’t cause systematic bias.

2. Multiplicative Ratio Transform

Predict ratio = price / weekly_same_hour_median. Produces a scale-invariant target centered at 1.0 (range ~0.5-2.0). The model predicts “50% above typical” rather than “140 EUR”.

3. Tree Depth/Leaf Tuning

Increase max_depth from 8 to 12, reduce min_samples_leaf from 20 to 5. Creates more specialized leaves for extreme conditions. Quick test that can combine with transforms.

4. Quantile Ensemble Reconstruction

Train models at q=0.10/0.50/0.90 and reconstruct the conditional mean. The q=0.90 model has explicit permission to predict high values (180+ EUR), providing tail capacity the current q=0.55 model lacks.

Impact on Roadmap

This diagnosis triggered a major restructuring of the improvement plan:

  • New Tier 7 (Target Transforms) added as top priority, with items 7.1-7.4
  • Tiers 2-6 deprioritized — they were treating symptoms of the structural issue
  • Previous Sprint C (crisis data weighting) reclassified from “root cause fix” to “useful config add-on” — decay180 helps but doesn’t fix the ceiling
  • Config tuning ceiling estimated at MAE ~12.5-13.0 — breaking through requires structural changes