v10.2 — Feature Re-selection + Residual Baseline Variants

Date: March 22, 2026 | Status: Scouted — No Improvement. v10.1 task-aligned confirmed as promotion candidate.

Background

v10.1 identified the task-aligned LSTM encoder as the best model (MAE 15.73, bias −0.65, spike recall 24.1%) but noted a 1.32 MAE gap vs the generic-res1w model (14.41). The gap is concentrated in “boring” base-load hours where the generic encoder’s simpler representation is better calibrated.

v10.2 tests two hypotheses for closing this gap:

H1: Feature re-selection with LSTM included — The v4.3 feature selection ran without LSTM features. With 64 LSTM embedding dims added, some tabular features may now be redundant. Re-running selection should remove noise and reduce base-hour variance.
H2: Smoother residual baseline — residual_1w uses a single price point from 7 days ago. A 4-week average or exponentially-weighted mean should provide a smoother, less noisy baseline, helping the model predict stable base-load hours.

Results (150-day Window 2025-10-24 to 2026-03-21)

Experiment	MAE	Bias	Slope	MaxPred	SpkRec	BasMAE	PkMAE
v10.1-ref (task-aligned)	15.73	−0.65	0.669	209	24.1%	14.03	16.95
v102-ta-fs (H1: feature select)	16.90	−0.82	0.639	209	18.7%	16.34	17.30
v102-ta-4wewm (H2b: EWM 4w)	18.05	+3.87	0.592	175	14.6%	16.80	18.94
v102-ta-4w (H2a: 4-week mean)	18.18	+3.19	0.598	190	18.9%	16.20	19.58

BasMAE = off-peak hours (00:00–07:00, 22:00–23:00). PkMAE = peak hours (08:00–21:00).

All three experiments are worse than v10.1 on every metric.

Finding 1: Feature Selection Destroys LSTM Signal

H1 reduced from 171 features to ~76 (DA1 group) by dropping ~20 individual LSTM embedding dimensions identified as low-importance by permutation tests.

Results vs v10.1: MAE +1.17, bias reverted from −0.65 to −0.82 (back to v7.0 level), spike recall −5.4pp (24.1% → 18.7%), slope −0.030. Both base-hour and peak-hour MAE worsened.

Why it failed: The 64 LSTM embedding dimensions encode a dense latent representation of price trajectory. No single dimension is independently powerful — they collectively describe the market regime, momentum, and shape. Permutation importance evaluates each feature by shuffling it individually while keeping others fixed, which measures isolated importance. For a correlated latent space, isolated importance ≠ collective importance. The “weak” dims that were pruned carry distributional information visible only in context of the full embedding.

This is analogous to compressing a JPEG to 40% quality by deleting the “least important” DCT coefficients individually — each one seems dispensable, but together they define the image sharpness.

Rule established: LSTM embedding dims must be kept as an atomic 64-dim block. Feature selection should only run on the tabular features, with all LSTM dims forced-included.

Finding 2: residual_1w is Optimal for Regime-Switching Markets

H2a (4-week mean) and H2b (EWM with w1=0.4, w2=0.3, w3=0.2, w4=0.1) both caused strong positive bias (+3.19 and +3.87 EUR respectively) and higher MAE.

Why it failed: The evaluation window covers a sharp regime transition: extreme price volatility in Oct–Nov 2025 (170–247 EUR) followed by normal winter prices in Dec–Mar 2026 (~50–80 EUR). A 4-week mean baseline computed for December dates still includes 1–3 weeks of crisis-era prices in its lookback. This inflates the baseline, causing the model to predict large positive residuals even for normal-price days — systematic overprediction.

The 1-week single-point baseline adapts to the current regime within 7 days. The 4-week baseline takes 28 days. In a market with abrupt crisis/recovery transitions, recency beats smoothing.

The EWM variant (w1=0.4 on most recent week) slightly reduces crisis contamination but the bias remains +3.87 because even a single crisis week included with weight 0.3–0.4 is enough to distort the baseline for a full month.

Rule established: For the Spanish DA electricity market with episodic price crises, residual_1w is the optimal transform baseline. Multi-week smoothing is beneficial in stable markets but harmful in regime-switching ones.

Promotion Decision

v10.1 task-aligned encoder (MAE 15.73, bias −0.65, spike recall 24.1%) remains the promotion candidate. See the v10.1 changelog and promotion plan.

Next Experiments

With feature selection and alternative baselines ruled out, the remaining approaches to close the base-hour MAE gap are:

H3: Conditional ensemble with volatility gate — blend task-aligned (crisis periods) with base-xgb-nopw (calm periods) using a lightweight volatility predictor
H4: Per-hour quantile schedule — q=0.50 for off-peak, q=0.55 for daytime, q=0.60 for peak risk hours
H5: French DA electricity price as input feature — EPEX D+1 prices known at origin time, strong predictor via Spain-France interconnection arbitrage
H6: Gas forward curve (1-month TTF) — captures market expectations of supply shocks before they appear in spot prices