v7.0 — Breaking the Compression Barrier (Scout)
Date: March 19–20, 2026 | Status: Scout — 90-day, DA-only, single model
Why This Experiment Exists
The v6.3 validation campaign established that deep trees (depth=12) improve MAE by 8.8%, but also revealed a hard ceiling: regression slope stuck at 0.70, meaning predictions only capture 70% of actual price variation. The price-level analysis showed exactly where the model fails:
| Price Range | % of Data | MAE | Bias | % Underpredicted |
|---|---|---|---|---|
| < 40 EUR | 39% | 7.5 | +4.5 | 30% (overpredicts) |
| 40–80 EUR | 23% | 9.0 | -6.6 | 79% |
| 80–130 EUR | 34% | 18.4 | -18.2 | 98% |
| 130+ EUR | 3.5% | 45.3 | -45.3 | 100% |
The model overpredicts low prices and severely underpredicts high prices. The 80–130 EUR range (34% of data) alone accounts for ~40% of total error. No amount of tree tuning can fix this — it’s a structural property of leaf-node averaging.
v7 tests fundamentally different approaches to break through this barrier.
Experiment Categories
Category A: Price-Weighted Training
Theory: Standard training treats every data point equally. A 20 EUR error on a 130 EUR peak hour counts the same as a 20 EUR error on a 30 EUR off-peak hour. By giving higher sample weights to expensive hours during training, we force the model to allocate more of its capacity to getting peaks right.
Implementation: When actual_price > threshold during training, that sample’s weight is multiplied by N. The model sees each peak hour as if it appeared N times in the dataset — no data duplication needed, just sample weights passed to XGBoost’s fit().
| Experiment | Threshold | Weight | Decay | MAE | Bias | 80-130 MAE | Spike Rec |
|---|---|---|---|---|---|---|---|
| pw-2x | 80 EUR | 2x | none | 13.12 | -4.94 | 19.3 | 73.8% |
| pw-3x | 80 EUR | 3x | none | 14.09 | -3.00 | 18.4 | 74.8% |
| pw-5x | 80 EUR | 5x | none | 13.84 | -2.54 | 17.4 | 76.0% |
| pw-3x-d365 | 80 EUR | 3x | 365d | 12.75 | -6.46 | 20.1 | 69.4% |
Key findings:
- pw-2x is the sweet spot for balanced MAE — enough peak attention without hurting low prices
- pw-5x has the best peak metrics (spike recall 76%, 80-130 MAE 17.4) but sacrifices low-price accuracy
- pw-3x-d365 is the overall winner (MAE 12.75) — combining moderate peak weighting with gentle time decay gets the best of both worlds
- The relationship between weight multiplier and MAE is U-shaped, but peak metrics improve monotonically with higher weights
Category B: Custom Asymmetric Loss
Theory: Instead of weighting samples, modify the loss function itself. The standard quantile loss penalizes over- and under-prediction asymmetrically but uniformly across all prices. An asymmetric loss adds extra penalty specifically for underpredicting high prices (the exact failure mode we see).
Implementation: Custom XGBoost objective function where the gradient for underprediction of prices > 80 EUR is multiplied by a factor N. This makes the model’s gradient descent pay N× more attention to fixing high-price underprediction.
| Experiment | Factor | Decay | Status |
|---|---|---|---|
| asym-3x | 3x | none | Failed — custom objective incompatible with backtest pipeline |
| asym-5x | 5x | none | Skipped |
| asym-3x-d365 | 3x | 365d | Skipped |
Status: The custom objective function creates correctly and trains in isolation, but fails when run through the full backtest pipeline. The likely cause is that the custom objective (a Python closure) isn’t serializable across the backtest’s model save/load cycle. This approach needs architectural refactoring to work.
Category C: Post-hoc Recalibration
Theory: Instead of fixing the model’s predictions during training, apply a correction afterwards. Train isotonic regression or a linear model on out-of-fold predictions to learn the systematic compression pattern and stretch predictions back to the correct range.
Status: Planned but not yet implemented in v7. Requires collecting OOF predictions during training, which needs pipeline modification.
Category D: Monday Features
Theory: Monday is our worst prediction day (MAE 20+ vs 12–13 for other days). The weekend-to-weekday price jump is poorly predicted because the model lacks explicit Monday awareness. Adding is_monday and monday × hour_of_day interaction features lets the model learn Monday-specific patterns.
Implementation: Two new features added permanently to direct_trainer.py:
is_monday: binary indicator (1 if target day is Monday, 0 otherwise)monday_hour: interaction term (is_monday × hour_of_day), letting the model learn that Monday hour 8 behaves differently from Tuesday hour 8
| Experiment | Config | MAE | Bias | 80-130 MAE | Spike Rec |
|---|---|---|---|---|---|
| monday-feat | Deep trees + Monday features | 12.99 | -5.41 | 19.5 | 68.9% |
| pw-3x-monday | Price weight 3x + Monday features | running |
Key finding: Monday features alone drop MAE from 13.83 to 12.99 (-6.1%). This is a zero-cost improvement — just two cheap features that capture a known market pattern.
Category E: MLP Neural Network
Theory: Tree-based models are fundamentally limited by leaf-node averaging. A neural network (even a simple MLP) can output any continuous value — there’s no structural compression. If the features contain enough signal to predict 140 EUR, an MLP can output 140 EUR directly, while a tree model would average it with other leaf samples down to ~100 EUR.
Implementation: sklearn.MLPRegressor with 3 hidden layers (256, 128, 64), added to the model factory alongside XGBoost/LightGBM/HistGB.
| Experiment | Config | Status |
|---|---|---|
| mlp-baseline | Deep features + MLP | Pending — in pipeline |
Why this matters: If MLP shows even modest slope improvement (from 0.70 to 0.80+), it confirms the compression ceiling is model-specific, not data-specific, and opens the door to more sophisticated neural architectures (TFT, N-BEATS).
Category F: Combined Approaches
The most promising experiments combine techniques from multiple categories. Wave 2 focused on fine-tuning the best Wave 1 config (pw-3x-d365) by sweeping thresholds, weights, and decay rates.
Results Summary
Wave 1: Initial Exploration (5 experiments)
Tested price weighting at different multipliers and combinations with decay/Monday features.
| # | Experiment | Categories | MAE | vs Prod | Bias | Slope | 80-130 MAE | Spike Rec |
|---|---|---|---|---|---|---|---|---|
| 1 | pw-3x-d365 | A+decay | 12.75 | -11.9% | -6.46 | 0.712 | 20.1 | 69.4% |
| 2 | monday-feat | D | 12.99 | -10.2% | -5.41 | 0.694 | 19.5 | 68.9% |
| 3 | pw-2x | A | 13.12 | -9.3% | -4.94 | 0.707 | 19.3 | 73.8% |
| 4 | pw-5x | A | 13.84 | -4.4% | -2.54 | 0.718 | 17.4 | 76.0% |
| 5 | pw-3x | A | 14.09 | -2.6% | -3.00 | 0.693 | 18.4 | 74.8% |
Key insight: pw-3x-d365 was the clear winner — price weight 3x combined with time decay 365d gets the best of both worlds (overall accuracy from decay + peak attention from weighting).
Wave 2: Fine-Tuning the Winner (6 experiments)
Tested threshold variations (60/80/100 EUR), weight variations (2x/3x/4x/5x with d365), and decay variations (d270/d365).
Threshold Sweep (pw-3x + d365, varying threshold)
| Threshold | MAE | Bias | Spike Rec | Conclusion |
|---|---|---|---|---|
| 60 EUR | 12.69 | -6.48 | 70.7% | NEW BEST — catches 60-80 EUR mid-range underprediction |
| 80 EUR | 12.75 | -6.46 | 69.4% | Original sweet spot |
| 100 EUR | 12.94 | -5.72 | 70.6% | Too restrictive — misses the 80-100 range |
Finding: Lowering the threshold from 80 to 60 EUR improves MAE because the model was also underpredicting the 60-80 EUR range (68% underprediction rate at that level). By including these prices in the upweighted set, the model allocates more capacity to this large segment.
Weight Sweep (threshold 80, d365, varying weight)
| Weight | MAE | Bias | Spike Rec | Conclusion |
|---|---|---|---|---|
| 2x | 13.27 | -6.51 | 61.4% | Too gentle — barely shifts predictions |
| 3x | 12.75 | -6.46 | 69.4% | Optimal — best MAE-peak balance |
| 4x | 13.37 | -6.47 | 72.2% | Overshoot — hurts low prices more than it helps peaks |
| 5x | 13.71 | -5.23 | 70.7% | Too aggressive for MAE |
Finding: The weight-MAE relationship is U-shaped with a clear minimum at 3x. Higher weights improve spike recall (72-76%) but degrade overall MAE because the model sacrifices low-price accuracy. The 3x sweet spot gives enough peak attention without the trade-off.
Decay × Weight Interaction
| Config | No Decay | d365 | Delta |
|---|---|---|---|
| pw-2x | 13.12 | 13.27 | +0.15 (worse) |
| pw-3x | 14.09 | 12.75 | -1.34 (much better) |
| pw-5x | 13.84 | 13.71 | -0.13 (slightly better) |
Finding: The interaction between weight and decay is strongly non-linear. Only pw-3x benefits dramatically from d365 (-1.34 MAE improvement). At 2x, decay actually hurts (the gentle weighting + decay pushes predictions too far toward the compressed center). At 5x, the aggressive weighting overwhelms decay’s contribution. The 3x+d365 combination is uniquely effective because the moderate upweighting creates a balanced gradient landscape that time decay can optimize.
Complete Leaderboard (all 11 v7 experiments)
| # | Experiment | MAE | vs Prod | Bias | Slope | 80-130 MAE | Spike Rec |
|---|---|---|---|---|---|---|---|
| 1 | pw-3x-d365-t60 | 12.69 | -12.3% | -6.48 | 0.707 | 20.1 | 70.7% |
| 2 | pw-3x-d365 | 12.75 | -11.9% | -6.46 | 0.712 | 20.1 | 69.4% |
| 3 | pw-3x-d365-t100 | 12.94 | -10.6% | -5.72 | 0.710 | 19.7 | 70.6% |
| 4 | monday-feat | 12.99 | -10.2% | -5.41 | 0.694 | 19.5 | 68.9% |
| 5 | pw-2x | 13.12 | -9.3% | -4.94 | 0.707 | 19.3 | 73.8% |
| 6 | pw-2x-d365 | 13.27 | -8.3% | -6.51 | 0.688 | 21.3 | 61.4% |
| 7 | pw-4x-d365 | 13.37 | -7.6% | -6.47 | 0.716 | 20.1 | 72.2% |
| 8 | pw-5x-d365 | 13.71 | -5.3% | -5.23 | 0.693 | 19.6 | 70.7% |
| 9 | pw-5x | 13.84 | -4.4% | -2.54 | 0.718 | 17.4 | 76.0% |
| 10 | pw-3x | 14.09 | -2.6% | -3.00 | 0.693 | 18.4 | 74.8% |
| — | Production v4.3 | 14.47 | ref | -10.97 | 0.710 | 21.8 | 56.5% |
What Improved vs v6.3
| Metric | v6.3 Best | v7 Best | Improvement |
|---|---|---|---|
| MAE | 13.20 | 12.69 (pw-3x-d365-t60) | -3.9% |
| Bias | -7.03 | -4.94 (pw-2x) | 30% less bias |
| Spike Recall | 56.5% | 76.0% (pw-5x) | +34% relative |
| 80-130 MAE | 21.8 | 17.4 (pw-5x) | -20% |
| Slope | 0.699 | 0.718 (pw-5x) | +2.7% |
What Didn’t Work
- Asymmetric loss — correct theory but custom XGBoost objective functions aren’t serializable across the backtest pipeline’s model save/load cycle. Needs architecture refactoring.
- Price weight 2x + d365 — the gentle 2x weighting combined with decay pushed predictions too far toward the compressed center (MAE 13.27 vs 13.12 without decay)
- Price weight 4x/5x + d365 — too aggressive, hurts low-price accuracy more than it helps peaks
- Monday features + price weighting — stale result due to feature cache bug (identical to pw-3x). Monday features work alone (MAE 12.99) but need proper testing with price weighting.
- Threshold 100 EUR — too restrictive, misses the 80-100 EUR range where significant underprediction occurs
Theoretical Conclusions
Why the ~12.69 floor exists
Even with price weighting and deep trees, the model’s regression slope stays at 0.70-0.72. This means for every 10 EUR of actual price variation, the model only predicts 7 EUR of variation. The compression is fundamental to how gradient boosted trees make predictions:
- Leaf averaging: Each tree leaf contains multiple training samples. The prediction is their weighted average, which always pulls toward the center.
- Gradient boosting correction: Each new tree corrects the previous ensemble’s errors, but the correction is bounded by the learning rate (0.03) and regularization (lambda=0.3).
- Feature limitations: The model’s features describe the conditions for high prices but can’t predict the magnitude. Knowing “it’s a cold Monday morning in December” tells the model prices will be high, but not whether they’ll be 100 or 150 EUR.
Price weighting shifts where the model allocates its limited capacity (toward peaks instead of uniform) but doesn’t increase the total capacity. The slope ceiling at 0.71-0.72 is structural.
What we learned about price weighting
The v7 campaign revealed that price weighting is highly sensitive to the specific combination of weight multiplier, threshold, and time decay. The interaction is non-linear:
- Weight 3x is uniquely optimal — both lower (2x) and higher (4x, 5x) weights are worse
- Threshold 60 EUR is slightly better than 80 EUR — the 60-80 EUR range also suffers from underprediction
- Decay 365d helps 3x weighting dramatically (-1.34 MAE) but barely helps 5x (-0.13) and hurts 2x (+0.15)
- The best config (pw-3x-d365-t60) achieves -12.3% vs production while maintaining 70.7% spike recall
Paths forward
- Neural networks (Category E) — the only approach that can break the slope ceiling. MLP experiment still pending due to pipeline bugs.
- Post-hoc recalibration (Category C) — learn the compression curve and stretch predictions. Doesn’t fix the model but corrects its output.
- Ensemble with diversity — combine a compressed tree model with an uncompressed MLP. The tree provides stability, the MLP provides range.
- Full experiment (150-day, all models, both run modes) — promote pw-3x-d365-t60 to a full validation before production deployment.
- Monday features + price weighting — proper rerun needed after cache bug fix. Could potentially combine the -6.1% Monday improvement with the -12.3% price weighting improvement.