v6.2 — Structural Scout Campaign (Scout)

Date: March 18, 2026 | Status: Scout complete — medium-scope validation pending

Background

Following the structural diagnosis that identified gradient boosting tree leaf averaging as the root cause of persistent underprediction, we ran 24 fast scout experiments in a single day to test structural approaches. Each scout used a single model type, day-ahead only, hybrid15 approach, and a 90-day backtest window (Dec 18, 2025 – Mar 18, 2026).

The experiments were organized into two phases:

Config scouts (11 experiments): Loss functions, quantile targets, time decay, winsorization, feature additions
Structural scouts (13 experiments): Target transforms, CatBoost, tree depth tuning, and combinations

Phase 1: Config Scouts

Testing configuration and parameter changes with default XGBoost (depth=8).

#	Experiment	MAE	Bias	vs Baseline	Finding
1	decay180	13.41	-10.65	-7.1%	Best config — winsor 200 + decay 180d halflife
2	decay-only	13.74	+10.01	-4.8%	Decay without winsor — confirms decay is the lever
3	q065	13.79	-6.89	-4.5%	Best bias but worse at low prices
4	decay365	13.91	-10.49	-3.7%	Sprint C config, 365d halflife
5	baseline	14.44	-10.80	ref	v4.3 production config, XGBoost only
6	fourier23	14.44	+10.80	0%	2nd/3rd Fourier harmonics — zero effect
7	curtailment	14.44	+10.80	0%	Renewable curtailment proxy — zero effect
8	winsor-only	14.48	-9.75	+0.3%	Winsorization alone — no effect
9	q060	14.80	-9.28	+2.5%	Quantile 0.60 — worse than 0.55
10	sqrt-transform	15.55	+12.55	+7.7%	sqrt(price+10) — compresses range further
11	rolling18m	15.72	+8.62	+8.9%	18-month rolling window — loses useful history

Config Scout Conclusions

Time decay (halflife 180d) is the only effective config lever, reducing MAE by 7.1%
Winsorization alone has zero effect — crisis data contamination is not the root cause
Additional features (Fourier, curtailment) have no impact — the model already has sufficient features
Quantile 0.55 is correct — both 0.60 (worse) and 0.65 (trade-offs) don’t improve overall MAE
Config tuning ceiling estimated at ~13 EUR/MWh — breaking through requires structural changes

Phase 2: Structural Scouts

Testing fundamentally different approaches to the prediction task.

Category 1: Target Transforms — Residual from Baseline

Train the model to predict deviation = price - baseline instead of raw EUR, then add the baseline back at prediction time.

Experiment	Baseline	Loss	MAE	Bias	Range
resid-168h-mae	`price_lag_168h` (origin-time)	MAE	15.84	+7.94	0.720
resid-168h-q055	`price_lag_168h` (origin-time)	Q=0.55	15.39	+9.37	0.738
resid-4wavg-mae	`price_same_weekday_4w_avg`	MAE	17.04	+8.71	0.685

Verdict: Failed. All residual experiments performed worse than baseline. The core problem: price_lag_168h is the price at origin_dt - 7 days, not target_dt - 7 days. For day-ahead predictions (targets 14-37 hours ahead), this misalignment means the residual contains within-day price variation that the model must re-learn, adding noise rather than removing it. The 4-week average was even worse because it averages over a longer, more variable window.

A target_baseline_1w feature (price at target hour minus 7 days) was added to the codebase for future testing but wasn’t evaluated in this campaign due to time constraints.

Category 2: CatBoost with Ordered Boosting

CatBoost uses a fundamentally different tree construction algorithm. Its “ordered boosting” prevents target leakage that causes standard GBT to regress toward the mean, and its symmetric trees have different bias properties.

Experiment	Loss	MAE	Bias	Range	Max Pred
catboost-q055	Quantile 0.55	14.92	+9.41	0.782	137
catboost-mae	MAE	15.13	+10.50	0.784	139
catboost-q045	Quantile 0.45	15.57	+11.37	0.755	132

Verdict: Promising structurally, not competitive on MAE. CatBoost consistently produced better range ratios (0.78 vs XGBoost’s 0.73) and higher maximum predictions (137-139 vs 127). This confirms ordered boosting does reduce range compression. However, MAE was 3-8% worse than XGBoost baseline, with a persistent strong positive bias (+9 to +11). Lowering the quantile to 0.45 made things worse, not better.

CatBoost may be valuable as an ensemble member (different bias profile from XGBoost), but is not a standalone replacement.

Category 3: Multiplicative Ratio Transform

Train on price / max(baseline, 5.0), producing a scale-invariant ratio centered at ~1.0, then multiply back.

Experiment	Loss	MAE	Bias	Range	Max Pred
ratio-168h-q055	Q=0.55	16.60	+5.15	0.798	145
ratio-168h-mae	MAE	17.43	+10.37	0.764	137

Verdict: Best range extension but worst MAE. The ratio transform produced the best range ratio (0.798) and highest max prediction (145) of any experiment — the model CAN predict extreme values through multiplication. But overall MAE was 15-20% worse because the origin-time baseline misalignment amplifies errors multiplicatively. This approach deserves retesting with a target-aligned baseline.

Category 4: Deep Trees (The Breakthrough)

Increase XGBoost tree depth from 8 to 12 and reduce minimum leaf size from 20 to 5, with compensating regularization (lower learning rate 0.03, higher L2 penalty 0.3).

Experiment	Config Additions	MAE	Bias	Range	Feat Sel.
deep-trees-d180	+ decay180 + Q=0.55	12.55	+7.83	0.738	Yes
deep-trees	Q=0.55 only	12.99	+5.41	0.739	Yes
deep-trees-mae	MAE loss only	13.41	+7.51	0.729	Yes
deep-trees-q045	Q=0.45 only	14.02	+7.99	0.706	Yes
deep-trees-d180-nofs	+ decay180 + Q=0.55	14.77	+5.73	0.698	No
deep-trees-nofs	Q=0.55 only	15.11	+2.81	0.704	No

Verdict: Clear winner. Deep trees produce the best MAE ever recorded for the EPF model:

MAE 12.55 with decay180 — a 13.3% improvement over production (14.47) and 6.4% better than the previous best config scout (13.41)
Deep trees alone (no decay) achieve MAE 12.99 — already 10.2% better than production
Quantile 0.55 is optimal — MAE (13.41) and q=0.45 (14.02) are both worse
Feature selection is essential — without it, MAE degrades by +2 EUR/MWh due to overfitting
Deep trees + decay180 improvements are orthogonal — they stack because they address different problems (leaf specialization vs data recency)

Why Deep Trees Work

With max_depth=8 and min_samples_leaf=20, peak-hour extreme prices share leaves with moderate prices and get averaged down. Deeper trees (max_depth=12) with smaller leaves (min_child_weight=5) can isolate extreme price conditions (“high demand + low wind + evening + high gas”) into specialized leaves that predict 130+ instead of being averaged to 90.

The compensating regularization (lower learning rate 0.03 instead of 0.05, higher L2 penalty 0.3 instead of 0.1) prevents the deeper trees from overfitting to training data noise.

Complete Leaderboard

All 24 experiments ranked by MAE, with reference points:

Rank	Experiment	Type	MAE	vs Production
1	deep-trees-d180	Structural	12.55	-13.3%
2	deep-trees	Structural	12.99	-10.2%
3	deep-trees-mae	Structural	13.41	-7.3%
4	config: decay180	Config	13.41	-7.3%
5	config: decay-only	Config	13.74	-5.0%
6	config: q065	Config	13.79	-4.7%
7	config: decay365	Config	13.91	-3.9%
8	deep-trees-q045	Structural	14.02	-3.1%
—	XGBoost baseline (v4.3 config)	Reference	14.44	ref
—	Production (v4.3 ensemble)	Reference	14.47	ref
9-24	CatBoost, residual, ratio, nofs, etc.	Various	14.77-17.43	worse

Winning Configuration

Model: XGBoost
  max_depth: 12        (was 8)
  min_child_weight: 5  (was 20, via default)
  learning_rate: 0.03  (was 0.05)
  reg_lambda: 0.3      (was 0.1)
  n_estimators: 500    (unchanged)
  quantile: 0.55       (unchanged)

Training config:
  winsorize_cap: 200 EUR
  sample_weight_halflife: 180 days
  feature_selection: enabled

Next Steps

Medium-scope validation: Run the deep-trees config with all 3 model types (HistGB, LightGBM, XGBoost) on the full 150-day DA backtest window
Investigate positive bias: All deep-tree scouts show +5 to +8 positive bias — need to verify this on the longer backtest window
Ensemble with deep trees: Test whether HistGB and LightGBM also benefit from deeper tree parameters
Strategic run mode: Currently only DA validated — strategic predictions are a separate challenge