Skip to content

v6.2 — Structural Scout Campaign (Scout)

Date: March 18, 2026 | Status: Scout complete — medium-scope validation pending

Background

Following the structural diagnosis that identified gradient boosting tree leaf averaging as the root cause of persistent underprediction, we ran 24 fast scout experiments in a single day to test structural approaches. Each scout used a single model type, day-ahead only, hybrid15 approach, and a 90-day backtest window (Dec 18, 2025 – Mar 18, 2026).

The experiments were organized into two phases:

  • Config scouts (11 experiments): Loss functions, quantile targets, time decay, winsorization, feature additions
  • Structural scouts (13 experiments): Target transforms, CatBoost, tree depth tuning, and combinations

Phase 1: Config Scouts

Testing configuration and parameter changes with default XGBoost (depth=8).

#ExperimentMAEBiasvs BaselineFinding
1decay18013.41-10.65-7.1%Best config — winsor 200 + decay 180d halflife
2decay-only13.74+10.01-4.8%Decay without winsor — confirms decay is the lever
3q06513.79-6.89-4.5%Best bias but worse at low prices
4decay36513.91-10.49-3.7%Sprint C config, 365d halflife
5baseline14.44-10.80refv4.3 production config, XGBoost only
6fourier2314.44+10.800%2nd/3rd Fourier harmonics — zero effect
7curtailment14.44+10.800%Renewable curtailment proxy — zero effect
8winsor-only14.48-9.75+0.3%Winsorization alone — no effect
9q06014.80-9.28+2.5%Quantile 0.60 — worse than 0.55
10sqrt-transform15.55+12.55+7.7%sqrt(price+10) — compresses range further
11rolling18m15.72+8.62+8.9%18-month rolling window — loses useful history

Config Scout Conclusions

  • Time decay (halflife 180d) is the only effective config lever, reducing MAE by 7.1%
  • Winsorization alone has zero effect — crisis data contamination is not the root cause
  • Additional features (Fourier, curtailment) have no impact — the model already has sufficient features
  • Quantile 0.55 is correct — both 0.60 (worse) and 0.65 (trade-offs) don’t improve overall MAE
  • Config tuning ceiling estimated at ~13 EUR/MWh — breaking through requires structural changes

Phase 2: Structural Scouts

Testing fundamentally different approaches to the prediction task.

Category 1: Target Transforms — Residual from Baseline

Train the model to predict deviation = price - baseline instead of raw EUR, then add the baseline back at prediction time.

ExperimentBaselineLossMAEBiasRange
resid-168h-maeprice_lag_168h (origin-time)MAE15.84+7.940.720
resid-168h-q055price_lag_168h (origin-time)Q=0.5515.39+9.370.738
resid-4wavg-maeprice_same_weekday_4w_avgMAE17.04+8.710.685

Verdict: Failed. All residual experiments performed worse than baseline. The core problem: price_lag_168h is the price at origin_dt - 7 days, not target_dt - 7 days. For day-ahead predictions (targets 14-37 hours ahead), this misalignment means the residual contains within-day price variation that the model must re-learn, adding noise rather than removing it. The 4-week average was even worse because it averages over a longer, more variable window.

A target_baseline_1w feature (price at target hour minus 7 days) was added to the codebase for future testing but wasn’t evaluated in this campaign due to time constraints.

Category 2: CatBoost with Ordered Boosting

CatBoost uses a fundamentally different tree construction algorithm. Its “ordered boosting” prevents target leakage that causes standard GBT to regress toward the mean, and its symmetric trees have different bias properties.

ExperimentLossMAEBiasRangeMax Pred
catboost-q055Quantile 0.5514.92+9.410.782137
catboost-maeMAE15.13+10.500.784139
catboost-q045Quantile 0.4515.57+11.370.755132

Verdict: Promising structurally, not competitive on MAE. CatBoost consistently produced better range ratios (0.78 vs XGBoost’s 0.73) and higher maximum predictions (137-139 vs 127). This confirms ordered boosting does reduce range compression. However, MAE was 3-8% worse than XGBoost baseline, with a persistent strong positive bias (+9 to +11). Lowering the quantile to 0.45 made things worse, not better.

CatBoost may be valuable as an ensemble member (different bias profile from XGBoost), but is not a standalone replacement.

Category 3: Multiplicative Ratio Transform

Train on price / max(baseline, 5.0), producing a scale-invariant ratio centered at ~1.0, then multiply back.

ExperimentLossMAEBiasRangeMax Pred
ratio-168h-q055Q=0.5516.60+5.150.798145
ratio-168h-maeMAE17.43+10.370.764137

Verdict: Best range extension but worst MAE. The ratio transform produced the best range ratio (0.798) and highest max prediction (145) of any experiment — the model CAN predict extreme values through multiplication. But overall MAE was 15-20% worse because the origin-time baseline misalignment amplifies errors multiplicatively. This approach deserves retesting with a target-aligned baseline.

Category 4: Deep Trees (The Breakthrough)

Increase XGBoost tree depth from 8 to 12 and reduce minimum leaf size from 20 to 5, with compensating regularization (lower learning rate 0.03, higher L2 penalty 0.3).

ExperimentConfig AdditionsMAEBiasRangeFeat Sel.
deep-trees-d180+ decay180 + Q=0.5512.55+7.830.738Yes
deep-treesQ=0.55 only12.99+5.410.739Yes
deep-trees-maeMAE loss only13.41+7.510.729Yes
deep-trees-q045Q=0.45 only14.02+7.990.706Yes
deep-trees-d180-nofs+ decay180 + Q=0.5514.77+5.730.698No
deep-trees-nofsQ=0.55 only15.11+2.810.704No

Verdict: Clear winner. Deep trees produce the best MAE ever recorded for the EPF model:

  • MAE 12.55 with decay180 — a 13.3% improvement over production (14.47) and 6.4% better than the previous best config scout (13.41)
  • Deep trees alone (no decay) achieve MAE 12.99 — already 10.2% better than production
  • Quantile 0.55 is optimal — MAE (13.41) and q=0.45 (14.02) are both worse
  • Feature selection is essential — without it, MAE degrades by +2 EUR/MWh due to overfitting
  • Deep trees + decay180 improvements are orthogonal — they stack because they address different problems (leaf specialization vs data recency)

Why Deep Trees Work

With max_depth=8 and min_samples_leaf=20, peak-hour extreme prices share leaves with moderate prices and get averaged down. Deeper trees (max_depth=12) with smaller leaves (min_child_weight=5) can isolate extreme price conditions (“high demand + low wind + evening + high gas”) into specialized leaves that predict 130+ instead of being averaged to 90.

The compensating regularization (lower learning rate 0.03 instead of 0.05, higher L2 penalty 0.3 instead of 0.1) prevents the deeper trees from overfitting to training data noise.

Complete Leaderboard

All 24 experiments ranked by MAE, with reference points:

RankExperimentTypeMAEvs Production
1deep-trees-d180Structural12.55-13.3%
2deep-treesStructural12.99-10.2%
3deep-trees-maeStructural13.41-7.3%
4config: decay180Config13.41-7.3%
5config: decay-onlyConfig13.74-5.0%
6config: q065Config13.79-4.7%
7config: decay365Config13.91-3.9%
8deep-trees-q045Structural14.02-3.1%
XGBoost baseline (v4.3 config)Reference14.44ref
Production (v4.3 ensemble)Reference14.47ref
9-24CatBoost, residual, ratio, nofs, etc.Various14.77-17.43worse

Winning Configuration

Model: XGBoost
max_depth: 12 (was 8)
min_child_weight: 5 (was 20, via default)
learning_rate: 0.03 (was 0.05)
reg_lambda: 0.3 (was 0.1)
n_estimators: 500 (unchanged)
quantile: 0.55 (unchanged)
Training config:
winsorize_cap: 200 EUR
sample_weight_halflife: 180 days
feature_selection: enabled

Next Steps

  1. Medium-scope validation: Run the deep-trees config with all 3 model types (HistGB, LightGBM, XGBoost) on the full 150-day DA backtest window
  2. Investigate positive bias: All deep-tree scouts show +5 to +8 positive bias — need to verify this on the longer backtest window
  3. Ensemble with deep trees: Test whether HistGB and LightGBM also benefit from deeper tree parameters
  4. Strategic run mode: Currently only DA validated — strategic predictions are a separate challenge