v7.0 — Breaking the Compression Barrier (Scout)

Date: March 19–20, 2026 | Status: Scout — 90-day, DA-only, single model

Why This Experiment Exists

The v6.3 validation campaign established that deep trees (depth=12) improve MAE by 8.8%, but also revealed a hard ceiling: regression slope stuck at 0.70, meaning predictions only capture 70% of actual price variation. The price-level analysis showed exactly where the model fails:

Price Range	% of Data	MAE	Bias	% Underpredicted
< 40 EUR	39%	7.5	+4.5	30% (overpredicts)
40–80 EUR	23%	9.0	-6.6	79%
80–130 EUR	34%	18.4	-18.2	98%
130+ EUR	3.5%	45.3	-45.3	100%

The model overpredicts low prices and severely underpredicts high prices. The 80–130 EUR range (34% of data) alone accounts for ~40% of total error. No amount of tree tuning can fix this — it’s a structural property of leaf-node averaging.

v7 tests fundamentally different approaches to break through this barrier.

Experiment Categories

Category A: Price-Weighted Training

Theory: Standard training treats every data point equally. A 20 EUR error on a 130 EUR peak hour counts the same as a 20 EUR error on a 30 EUR off-peak hour. By giving higher sample weights to expensive hours during training, we force the model to allocate more of its capacity to getting peaks right.

Implementation: When actual_price > threshold during training, that sample’s weight is multiplied by N. The model sees each peak hour as if it appeared N times in the dataset — no data duplication needed, just sample weights passed to XGBoost’s fit().

Experiment	Threshold	Weight	Decay	MAE	Bias	80-130 MAE	Spike Rec
pw-2x	80 EUR	2x	none	13.12	-4.94	19.3	73.8%
pw-3x	80 EUR	3x	none	14.09	-3.00	18.4	74.8%
pw-5x	80 EUR	5x	none	13.84	-2.54	17.4	76.0%
pw-3x-d365	80 EUR	3x	365d	12.75	-6.46	20.1	69.4%

Key findings:

pw-2x is the sweet spot for balanced MAE — enough peak attention without hurting low prices
pw-5x has the best peak metrics (spike recall 76%, 80-130 MAE 17.4) but sacrifices low-price accuracy
pw-3x-d365 is the overall winner (MAE 12.75) — combining moderate peak weighting with gentle time decay gets the best of both worlds
The relationship between weight multiplier and MAE is U-shaped, but peak metrics improve monotonically with higher weights

Category B: Custom Asymmetric Loss

Theory: Instead of weighting samples, modify the loss function itself. The standard quantile loss penalizes over- and under-prediction asymmetrically but uniformly across all prices. An asymmetric loss adds extra penalty specifically for underpredicting high prices (the exact failure mode we see).

Implementation: Custom XGBoost objective function where the gradient for underprediction of prices > 80 EUR is multiplied by a factor N. This makes the model’s gradient descent pay N× more attention to fixing high-price underprediction.

Experiment	Factor	Decay	Status
asym-3x	3x	none	Failed — custom objective incompatible with backtest pipeline
asym-5x	5x	none	Skipped
asym-3x-d365	3x	365d	Skipped

Status: The custom objective function creates correctly and trains in isolation, but fails when run through the full backtest pipeline. The likely cause is that the custom objective (a Python closure) isn’t serializable across the backtest’s model save/load cycle. This approach needs architectural refactoring to work.

Category C: Post-hoc Recalibration

Theory: Instead of fixing the model’s predictions during training, apply a correction afterwards. Train isotonic regression or a linear model on out-of-fold predictions to learn the systematic compression pattern and stretch predictions back to the correct range.

Status: Planned but not yet implemented in v7. Requires collecting OOF predictions during training, which needs pipeline modification.

Category D: Monday Features

Theory: Monday is our worst prediction day (MAE 20+ vs 12–13 for other days). The weekend-to-weekday price jump is poorly predicted because the model lacks explicit Monday awareness. Adding is_monday and monday × hour_of_day interaction features lets the model learn Monday-specific patterns.

Implementation: Two new features added permanently to direct_trainer.py:

is_monday: binary indicator (1 if target day is Monday, 0 otherwise)
monday_hour: interaction term (is_monday × hour_of_day), letting the model learn that Monday hour 8 behaves differently from Tuesday hour 8

Experiment	Config	MAE	Bias	80-130 MAE	Spike Rec
monday-feat	Deep trees + Monday features	12.99	-5.41	19.5	68.9%
pw-3x-monday	Price weight 3x + Monday features	running

Key finding: Monday features alone drop MAE from 13.83 to 12.99 (-6.1%). This is a zero-cost improvement — just two cheap features that capture a known market pattern.

Category E: MLP Neural Network

Theory: Tree-based models are fundamentally limited by leaf-node averaging. A neural network (even a simple MLP) can output any continuous value — there’s no structural compression. If the features contain enough signal to predict 140 EUR, an MLP can output 140 EUR directly, while a tree model would average it with other leaf samples down to ~100 EUR.

Implementation: sklearn.MLPRegressor with 3 hidden layers (256, 128, 64), added to the model factory alongside XGBoost/LightGBM/HistGB.

Experiment	Config	Status
mlp-baseline	Deep features + MLP	Pending — in pipeline

Why this matters: If MLP shows even modest slope improvement (from 0.70 to 0.80+), it confirms the compression ceiling is model-specific, not data-specific, and opens the door to more sophisticated neural architectures (TFT, N-BEATS).

Category F: Combined Approaches

The most promising experiments combine techniques from multiple categories. Wave 2 focused on fine-tuning the best Wave 1 config (pw-3x-d365) by sweeping thresholds, weights, and decay rates.

Results Summary

Wave 1: Initial Exploration (5 experiments)

Tested price weighting at different multipliers and combinations with decay/Monday features.

#	Experiment	Categories	MAE	vs Prod	Bias	Slope	80-130 MAE	Spike Rec
1	pw-3x-d365	A+decay	12.75	-11.9%	-6.46	0.712	20.1	69.4%
2	monday-feat	D	12.99	-10.2%	-5.41	0.694	19.5	68.9%
3	pw-2x	A	13.12	-9.3%	-4.94	0.707	19.3	73.8%
4	pw-5x	A	13.84	-4.4%	-2.54	0.718	17.4	76.0%
5	pw-3x	A	14.09	-2.6%	-3.00	0.693	18.4	74.8%

Key insight: pw-3x-d365 was the clear winner — price weight 3x combined with time decay 365d gets the best of both worlds (overall accuracy from decay + peak attention from weighting).

Wave 2: Fine-Tuning the Winner (6 experiments)

Tested threshold variations (60/80/100 EUR), weight variations (2x/3x/4x/5x with d365), and decay variations (d270/d365).

Threshold Sweep (pw-3x + d365, varying threshold)

Threshold	MAE	Bias	Spike Rec	Conclusion
60 EUR	12.69	-6.48	70.7%	NEW BEST — catches 60-80 EUR mid-range underprediction
80 EUR	12.75	-6.46	69.4%	Original sweet spot
100 EUR	12.94	-5.72	70.6%	Too restrictive — misses the 80-100 range

Finding: Lowering the threshold from 80 to 60 EUR improves MAE because the model was also underpredicting the 60-80 EUR range (68% underprediction rate at that level). By including these prices in the upweighted set, the model allocates more capacity to this large segment.

Weight Sweep (threshold 80, d365, varying weight)

Weight	MAE	Bias	Spike Rec	Conclusion
2x	13.27	-6.51	61.4%	Too gentle — barely shifts predictions
3x	12.75	-6.46	69.4%	Optimal — best MAE-peak balance
4x	13.37	-6.47	72.2%	Overshoot — hurts low prices more than it helps peaks
5x	13.71	-5.23	70.7%	Too aggressive for MAE

Finding: The weight-MAE relationship is U-shaped with a clear minimum at 3x. Higher weights improve spike recall (72-76%) but degrade overall MAE because the model sacrifices low-price accuracy. The 3x sweet spot gives enough peak attention without the trade-off.

Decay × Weight Interaction

Config	No Decay	d365	Delta
pw-2x	13.12	13.27	+0.15 (worse)
pw-3x	14.09	12.75	-1.34 (much better)
pw-5x	13.84	13.71	-0.13 (slightly better)

Finding: The interaction between weight and decay is strongly non-linear. Only pw-3x benefits dramatically from d365 (-1.34 MAE improvement). At 2x, decay actually hurts (the gentle weighting + decay pushes predictions too far toward the compressed center). At 5x, the aggressive weighting overwhelms decay’s contribution. The 3x+d365 combination is uniquely effective because the moderate upweighting creates a balanced gradient landscape that time decay can optimize.

Complete Leaderboard (all 11 v7 experiments)

#	Experiment	MAE	vs Prod	Bias	Slope	80-130 MAE	Spike Rec
1	pw-3x-d365-t60	12.69	-12.3%	-6.48	0.707	20.1	70.7%
2	pw-3x-d365	12.75	-11.9%	-6.46	0.712	20.1	69.4%
3	pw-3x-d365-t100	12.94	-10.6%	-5.72	0.710	19.7	70.6%
4	monday-feat	12.99	-10.2%	-5.41	0.694	19.5	68.9%
5	pw-2x	13.12	-9.3%	-4.94	0.707	19.3	73.8%
6	pw-2x-d365	13.27	-8.3%	-6.51	0.688	21.3	61.4%
7	pw-4x-d365	13.37	-7.6%	-6.47	0.716	20.1	72.2%
8	pw-5x-d365	13.71	-5.3%	-5.23	0.693	19.6	70.7%
9	pw-5x	13.84	-4.4%	-2.54	0.718	17.4	76.0%
10	pw-3x	14.09	-2.6%	-3.00	0.693	18.4	74.8%
—	Production v4.3	14.47	ref	-10.97	0.710	21.8	56.5%

What Improved vs v6.3

Metric	v6.3 Best	v7 Best	Improvement
MAE	13.20	12.69 (pw-3x-d365-t60)	-3.9%
Bias	-7.03	-4.94 (pw-2x)	30% less bias
Spike Recall	56.5%	76.0% (pw-5x)	+34% relative
80-130 MAE	21.8	17.4 (pw-5x)	-20%
Slope	0.699	0.718 (pw-5x)	+2.7%

What Didn’t Work

Asymmetric loss — correct theory but custom XGBoost objective functions aren’t serializable across the backtest pipeline’s model save/load cycle. Needs architecture refactoring.
Price weight 2x + d365 — the gentle 2x weighting combined with decay pushed predictions too far toward the compressed center (MAE 13.27 vs 13.12 without decay)
Price weight 4x/5x + d365 — too aggressive, hurts low-price accuracy more than it helps peaks
Monday features + price weighting — stale result due to feature cache bug (identical to pw-3x). Monday features work alone (MAE 12.99) but need proper testing with price weighting.
Threshold 100 EUR — too restrictive, misses the 80-100 EUR range where significant underprediction occurs

Theoretical Conclusions

Why the ~12.69 floor exists

Even with price weighting and deep trees, the model’s regression slope stays at 0.70-0.72. This means for every 10 EUR of actual price variation, the model only predicts 7 EUR of variation. The compression is fundamental to how gradient boosted trees make predictions:

Leaf averaging: Each tree leaf contains multiple training samples. The prediction is their weighted average, which always pulls toward the center.
Gradient boosting correction: Each new tree corrects the previous ensemble’s errors, but the correction is bounded by the learning rate (0.03) and regularization (lambda=0.3).
Feature limitations: The model’s features describe the conditions for high prices but can’t predict the magnitude. Knowing “it’s a cold Monday morning in December” tells the model prices will be high, but not whether they’ll be 100 or 150 EUR.

Price weighting shifts where the model allocates its limited capacity (toward peaks instead of uniform) but doesn’t increase the total capacity. The slope ceiling at 0.71-0.72 is structural.

What we learned about price weighting

The v7 campaign revealed that price weighting is highly sensitive to the specific combination of weight multiplier, threshold, and time decay. The interaction is non-linear:

Weight 3x is uniquely optimal — both lower (2x) and higher (4x, 5x) weights are worse
Threshold 60 EUR is slightly better than 80 EUR — the 60-80 EUR range also suffers from underprediction
Decay 365d helps 3x weighting dramatically (-1.34 MAE) but barely helps 5x (-0.13) and hurts 2x (+0.15)
The best config (pw-3x-d365-t60) achieves -12.3% vs production while maintaining 70.7% spike recall

Paths forward

Neural networks (Category E) — the only approach that can break the slope ceiling. MLP experiment still pending due to pipeline bugs.
Post-hoc recalibration (Category C) — learn the compression curve and stretch predictions. Doesn’t fix the model but corrects its output.
Ensemble with diversity — combine a compressed tree model with an uncompressed MLP. The tree provides stability, the MLP provides range.
Full experiment (150-day, all models, both run modes) — promote pw-3x-d365-t60 to a full validation before production deployment.
Monday features + price weighting — proper rerun needed after cache bug fix. Could potentially combine the -6.1% Monday improvement with the -12.3% price weighting improvement.