v11.0 — Post-LSTM Correction (Production)

Date: April 9, 2026 | Status: ✅ Production (post-M0.6 Phase C, awaiting Phase F Cloud Run cutover for live deployment)

Why this version exists

This is a methodology correction, not a feature release. v11.0 retracts the entire v10.x LSTM-XGBoost hybrid line of experiments and replaces it with a clean retrain of the pre-LSTM v8-res-1w-pw3-d365 configuration that was already empirically better.

The investigation that led here happened in two stages:

Stage 1 (April 8) — “v10.1 isn’t actually deployed”

A read-only verification pass against the production VM found that the v10.1 promotion on March 22 had happened only at the env-var level. Three independent SSH checks against the VM:

What we checked	Expected (per docs)	Reality
`pip show torch` in `/opt/epf/venv`	installed	”Package(s) not found”
`pip show torch` in `/usr/bin/python3`	installed (fallback)	“Package(s) not found”
`find /opt/epf -name 'lstm_encoder*.pt'`	`lstm_ta_64.pt` present	nothing
`predictions.model_name` for last 7 days	`xgboost_hybrid15` (with LSTM)	`ensemble_hybrid15` (3 base learners)

Live production was actually the v4.3-era ensemble (XGBoost + LightGBM + HistGB) + residual_1w. The EPF_LSTM_EMBEDDINGS=true env var was set but the predictor code path silently caught the ImportError and fell through. The “v10.1 LSTM-XGBoost hybrid” we’d been describing in the methodology blog post and frontend footer had never actually shipped — the env-var flip was promoted, the torch install + LSTM artifact deploy steps were never completed.

Stage 2 (April 8 evening) — “even the backtest was broken”

The first fully end-to-end local v11.0 LSTM test (the first time anyone in the project’s history had run a real LSTM-enabled prediction) produced clearly degraded output: mean predicted price 23 EUR/MWh against the 50–100+ Spanish DA reality, 90% confidence intervals from −47 to +46 EUR. Investigation found two layered code bugs:

Bug 1 — silent zero-fill at 15-min inference: _build_origin_features_15min() in src/models/direct_predictor.py never called LSTMEmbedder.compute_embedding(). Any lstm_emb_* feature in the trained model’s feature_cols was filled with 0.0 at the row construction site. xgboost handled the all-zero block silently. No exception, no warning, no log line — degraded predictions only. The 64-dim LSTM embedding block was effectively a no-op at every prediction call, including in every backtest that produced the v10.x headline metrics.

Bug 2 — wrong-domain LSTM input at training time: direct_trainer.py fed 15-minute prices through an LSTM encoder pre-trained on hourly prices. The encoder expects 168 elements = 1 week of context with hourly normalization stats; what it actually received was 168 quarter-hours = 42 hours of context. Wrong shape, wrong domain, wrong normalization. The encoder output was noise. xgboost trained against that noise, with feature_selector picking 6-10 of those noise dimensions per horizon group as if they were real features.

Empirical verification: toggling EPF_LSTM_EMBEDDINGS=true ↔ false at inference time on the v10.1 production joblib produced bit-for-bit identical predictions. The LSTM block contributed exactly zero useful signal at runtime.

Stage 3 (April 9) — “v8 already won, we just didn’t notice”

A follow-up dig into the historical SQLite snapshot found the xgboost_hybrid15_v8-res-1w-pw3-d365_backtest tag from March 22 — a pre-LSTM scout configuration we’d been treating as a stepping stone toward v10.x. Same 156-day evaluation window. Same eval methodology. Different result:

Tag	DA MAE	DA bias	DA SpkR	DA DirAcc	DA CorrFr	DA SprdCap
v8-res-1w-pw3-d365 (historical, March)	12.98	-3.60	71.08%	76.74%	0.904	92.31%
v10.1 (broken-LSTM, retracted)	15.73	-0.65	69.34%	75.87%	0.887	90.96%

The pre-LSTM v8 configuration already strictly dominated v10.1 on every metric except bias — including the structural metrics (SpkR, DirAcc, CorrFr, SprdCap) that the v10.x narrative had cited as evidence for LSTM’s value. v10.1’s near-zero bias (−0.65) was an artifact of the broken-LSTM zero-fill noise acting as accidental regularization, not a genuine calibration win. The v10.x narrative had been wrong about which model was the real winner.

That left a clear path forward: retrain the v8 architecture from current code on current data, ship it as v11.0, retract v10.x.

Architecture

Single XGBoost (no ensemble). No LSTM. Same general feature pipeline as v10.x but without the broken LSTM block.

Input data
  ├─ REE 15-min day-ahead prices (hybrid: hourly pre-Oct-2025 expanded to 15-min, genuine 15-min post-Oct-2025)
  ├─ REE generation mix (wind, solar, hydro, nuclear, combined cycle, etc.)
  ├─ Open-Meteo weather (5 weather stations, hourly + 7-day forecasts)
  └─ Commodities (TTF gas, ETS carbon, Brent — daily, forward-filled)
        ↓
Feature engineering (~110 columns per origin)
  ├─ price lags (15m, 30m, 45m, 1h, 2h, 3h, 24h, 48h, 72h, 168h, 336h, 504h)
  ├─ rolling stats (24h, 7d mean/std/min/max/iqr)
  ├─ momentum (1h/24h change, acceleration, peak-to-base ratio)
  ├─ cyclical time (hour_sin/cos, dow, month, week — UTC)
  ├─ calendar (holidays, weekend, christmas period, august vacation)
  ├─ generation mix shares (wind, solar, nuclear, residual demand, thermal)
  ├─ weather (temp, wind, GHI, HDD/CDD, cold×demand interaction)
  ├─ commodities (gas, carbon, marginal cost, spark spread)
  └─ residual_1w baseline (target transform — added to prediction at inference)
        ↓
Feature selection: correlation filter → permutation importance
  ~110 → ~70-90 features per horizon group
        ↓
XGBoost direct multi-horizon predictor
  depth=12, lr=0.03, min_child=5, reg_lambda=0.3, q=0.55
  pw3 above 60 EUR/MWh, decay 365d
  ├─ DA1 (quarters 56-103, ~14h to ~25h ahead)
  ├─ DA2 (quarters 104-151, ~26h to ~38h ahead)
  ├─ S1 (quarters 132-227, strategic D+2)
  ├─ S2 (quarters 228-323, strategic D+3)
  ├─ S3 (quarters 324-419, strategic D+4)
  ├─ S4 (quarters 420-515, strategic D+5)
  └─ S5 (quarters 516-707, strategic D+6-D+7)
        ↓
Inverse transform: pred = model_output + target_baseline_1w
Post-processing: isotonic calibration + residual corrector (by hour-of-day + dow)
Conformal calibrator: 50% + 90% prediction intervals
        ↓
predictions table (96 quarter-hours/day per horizon group)

Phase C measurements (156-day window 2025-10-09 → 2026-03-18, NaN-safe)

Metric	Dayahead (n=159)	Strategic (n=837 daily eval rows)
MAE (EUR/MWh)	14.26	17.35
Bias (EUR/MWh)	−3.91	−11.62
RMSE	17.40	~24
Spike Recall (SpkR)	69.28%	55.19%
Directional Accuracy (DirAcc)	76.59%	58.18%
Forecast correlation (CorrFr)	0.891	0.782
Energy covariance (CovE)	−0.473	−0.570
Spread Capture (SprdCap)	90.92%	79.54%

vs v10.1 (retracted broken-LSTM)

	DA MAE	DA bias	DA SpkR	DA DirAcc	Strategic MAE	Strategic SpkR	Strategic DirAcc
v11.0 (current)	14.26	-3.91	69.28%	76.59%	17.35	55.19%	58.18%
v10.1 (retracted)	15.73	-0.65	69.34%	75.87%	18.13	52.45%	55.50%
delta	-1.47 (-9.3%)	worse	~tied	+0.72pp	-0.78 (-4.3%)	+2.73pp	+2.68pp

v11.0 strictly dominates v10.1 on every metric except bias. The strategic improvements are the biggest business win — a model that’s 2.7pp better on directional accuracy and 2.7pp better on spike recall at D+2 to D+7 horizons is meaningful for traders looking out a week ahead.

The bias gap (−3.91 vs −0.65) is real but explained: v10.1’s near-zero bias was an artifact of broken-LSTM zero-fill noise acting as accidental regularization, NOT a real calibration win. The “best ever bias” framing in the v10.1 changelog page was actually a sign that the model couldn’t commit to extreme predictions because the noise features were dampening the trees’ confidence.

vs v8 historical

The v8-res-1w-pw3-d365 scout from March scored DA MAE 12.98 — 1.28 below v11.0’s 14.26. The gap is explained by:

16 more days of training data through April 7 (the v8 scout was March 22). Some of those days are after the Iran crisis and have different price dynamics.
Code drift in direct_trainer.py since March 22 — post-processing (isotonic calibration, residual corrector) has been refined.
One NaN row in ree_hourly at 2025-10-26 01:00 UTC (the autumn DST duplicate hour) that propagates through lag features for a week.
Schema additions (the weather_country and commodity_country columns added for future multi-country support) — handled by a pre-flight feature_cols dtype filter shipped as part of Phase C.

These are all data/code drift, not architectural regressions. v11.0 is what the v8 architecture looks like on current data and current code.

Strategic backtest is the FIRST EVER for a pre-LSTM tag

No historical pre-LSTM scout in the project’s database had strategic eval rows. The strategic side was added after the v8 era and only had v10.1 broken-LSTM rows for comparison. Phase C closes that gap — xgboost_hybrid15_v11.0_backtest is the first clean strategic baseline the project has ever had.

DST hour-feature fix experiment (parallel A/B, REVERTED)

Phase C also tested a hypothesis that the cyclical hour features (origin_hour_sin/cos, target_hour_sin/cos) should be encoded in Madrid local time instead of UTC, since the model is otherwise misaligned across DST transitions. New helper functions _local_hour() / _local_dow() were added to direct_trainer.py and applied at the 4 cyclical encoding sites. Two parallel training runs were compared.

	DA MAE	Strategic MAE	Strategic SpkR
v11.0 UTC-hours (shipped)	14.26	17.35	55.19%
v11.0 with DST fix	14.21	17.71	54.59%
delta	DST -0.05	DST +0.36	DST -0.59pp

The DST fix marginally improved DA MAE but regressed strategic MAE by 0.36 and strategic SpkR by 0.59pp. Per the M0.6 acceptance rule (revert if significantly worse on strategic MAE/SpkR/SprdCap), the DST fix was REVERTED at all 4 call sites.

Likely cause: the price lag features (price_lag_24h, price_lag_48h, price_lag_168h) already encode hour-of-day patterns implicitly. The explicit cyclical encoding is partially redundant — switching its semantics from UTC to local changes the noise pattern without adding signal. A fuller DST refactor (also touching the morning/evening hour-band aggregates and operational filters in the trainer) was NOT tested in Phase C and is a candidate for a future tier-7 experiment.

The helper functions _local_hour() / _local_dow() are kept in direct_trainer.py module scope as scaffolding for future M1+ multi-country (PT/FR/DE) work, where per-country tz handling will be needed.

What changed in the codebase

src/config.py:48:

# v11.0 = XGBoost depth=12 + residual_1w + pw3x + d365 (single-XGBoost, NO LSTM,
# NO ensemble). Promoted in M0.6 Phase C (2026-04-09) after the v10.x LSTM
# debacle was discovered (LSTM contributed zero useful signal due to layered
# train/inference bugs).
PRODUCTION_TAG: str = os.getenv("EPF_PRODUCTION_TAG", "v11.0")

src/models/direct_trainer.py:

Pre-flight bug fix: feature_cols selector now filters object-dtype columns (the local PG schema added weather_country and commodity_country strings that broke data[feature_cols].astype(float))
Added _local_hour() / _local_dow() helpers (currently unused, kept for M1+)
Inline comments at the 4 cyclical encoding sites documenting the experimental result
The LSTM block at line 1265-1274 still exists in the source but is gated behind EPF_LSTM_EMBEDDINGS=true (which is now false in production); the drift guard in direct_predictor.py raises if anyone re-enables it without first fixing the underlying bugs

src/models/direct_predictor.py:

Drift guard at the 15-min inference call sites (commit 3e96c0b from M0.6 Phase A.2): if EPF_LSTM_EMBEDDINGS=true and any lstm_emb_* feature is missing from the row, raises RuntimeError instead of silently zero-filling. This catches any future re-enablement of LSTM before silent corruption can recur.

What’s still pending

Phase F — Cloud Run Job cutover. Until Phase F ships, the VM cron is still writing v10.1-era ensemble_hybrid15 rows (the 3-base-learner ensemble + residual_1w + broken-LSTM zero-fill). v11.0 exists locally as joblibs and a backtest tag in eval_daily_metrics. The dashboard at productjorge.com still shows v10.1-era numbers until Phase F runs.

Phase F is straightforward and de-risked: build the predictor image (already shipped in M0.6 Phase B), upload v11.0 joblibs to a GCS bucket, deploy two Cloud Run Jobs triggered by Cloud Scheduler, run 7-day shadow validation under xgboost_hybrid15_shadow, then disable the VM cron prediction lines and backfill 30 days of historical predictions under xgboost_hybrid15_v11.0_backfill. Estimated time: half a day of work + 7-day shadow window.

What this episode teaches

A single drift guard would have caught both bugs at the 30-second mark. The drift guard now committed in direct_predictor.py raises a RuntimeError if EPF_LSTM_EMBEDDINGS=true and the trained model expects lstm_emb_* features that aren’t present in the inference row. It’s opt-in via the same env var that enables LSTM, so production is unaffected. If anyone ever re-enables LSTM after a properly-fixed LSTM implementation, the guard catches the silent-zero-fill class of bug before it can corrupt predictions.
The new /api/v1/production-state endpoint and scripts/verify_production_state.py make every claim about “current production state” machine-verifiable. The new DOCUMENTATION_GUIDE.md §5a rule requires citing the verification command in commit messages before claiming any fact about production. None of this existed during the original v10.x promotion.
The four “LSTM hard rules” that had been treated as load-bearing project knowledge (“never combine LSTM with price weighting”, “always combine LSTM with residual_1w”, “never feature-select LSTM dims”, “residual_1w is the only sanctioned target transform”) were derived from bug-affected experiments. The first three are retired as suspect. The fourth (residual_1w is the right transform) survives independent validation against v6.3/v7.0 controls and remains load-bearing for v11.0.
The “promotion to production” workflow needs to verify deployment, not just env vars. The original v10.1 promotion on March 22 was a PR that updated /opt/epf/app/.env with EPF_LSTM_EMBEDDINGS=true + a few other vars. Nobody verified that torch was actually installed or that the LSTM artifact was on disk afterward. The whole class of “we promoted X but X never actually ran” silently passed CI, code review, and 17 days of production. The startup self-test in scripts/cloud_run_predict.py (M0.6 Phase B) and the production-state endpoint (Phase G) close this hole.

Files for follow-up

data/m06_phase_c_results.md (local-only, not in the public repo) — comprehensive Phase C report with all metric tables, the DST experiment data, the wall-time breakdown
docs/engineering/PROJECT_DOCUMENTATION.md §17 (Methodology Retractions) — formal index of retracted methodology decisions
docs/engineering/ENGINEERING_LOG.md [2026-04-09] M0.6 Phase C entry — full investigation narrative
CHANGELOG.md [2026-04-09] entry — project changelog
docs/engineering/IMPROVEMENT_PLAN.md — experiment tracking table with v11.0 row + v10.x retraction annotations
The retraction notices on v10.0, v10.1, and v10.2