Methodology Changelog

Version History

Version Tagging Convention

Every model experiment gets a version number. Status tags track lifecycle:

Tag	Meaning
(Production)	Currently deployed to production VM
(Scout)	Fast exploratory experiment (90-day, single model, DA-only) — not promotion-ready
(Validating)	Medium-scope validation (150-day, multiple models) — promotion candidate
(Rejected)	Evaluated and not promoted
(no tag)	Historical version, superseded

Model Versions

Version	Date	Status	Description	DA MAE	Strat MAE
Cloudflare CDN	2026-05-06	✅ Production	Edge termination at epf.productjorge.com — Cloudflare Free proxy in front of dashboard. Brotli, HTTP/3, EU PoPs cut TLS RTT ~150 ms → ~20 ms. Manual cache rules for JSON still pending.	—	—
Performance sprint (May 2026)	2026-05-05 → 2026-05-06	✅ Production	Lazy loading + batch endpoints — Lazy `MonthYearPicker`, `InfoPanel`, recharts-importing components. New `/forecast/combined/batch` + `/market/prices/batch` endpoints. -13 KB gzip critical-path JS, -3 round-trips on Multi-Country page.	—	—
Pipeline Monitoring Invariant	2026-04-25	✅ Production	Zero-rows = failed by default — `PipelineRunTracker` flips zero-rows runs to `failed` unless `allow_zero_rows=True`. Intraday now fail-loud; 14 days of silent intraday failures finally surfaced.	—	—
Evaluation Data Integrity Fix	2026-04-23	✅ Production	End-to-end data lineage cleanup — Three lineage fixes on `/api/v1/evaluation/benchmark-matrix`: country-filter dropouts in `backfill_actual_prices`, window-intersection alignment, dedupe invariants.	—	—
System Page Redesign	2026-04-23	✅ Production	Pass 1 + Pass 2A — Country health cards at top, plain-language labels, sorted failures, naive-baseline accuracy context on the Production Model card. No new endpoints.	—	—
Naive Benchmarks	2026-04-22	✅ Production (eval-only)	Per-country D+1 + strategic baselines — Three 15-min naive baselines (`persistence_d1`, `seasonal_weekly`, `similar_day`) for ES/PT/FR/DE on both `run_modes`. Surfaces in benchmark-matrix endpoint.	—	—
PT v6.0 hybrid15 fix	2026-04-16	✅ Production (PT)	PT resolution-correctness fix — Plumbs `country` + `resolution` cleanly through feature builder + trainer + predictor. PT DA MAE -9.2%, Strategic MAE -21% vs v2.0. Part of the broader Phase 5 sprint.	21.94 (PT)	24.67 (PT)
Phase 5 / v6.0 / Z3 ablation	2026-04-17	✅ Production (current)	Cross-price gating decision — 4-day multi-country feature sprint (Z1 holidays + Z2 gen-forecast + Z4 solar elevation) + Z3 ablation. Cross-country prices improved only DE; ES/FR/PT MAE dropped when Z3 was removed. Production: `EPF_CROSS_PRICE_COUNTRIES=DE`. ES v12.0-abl DA 13.99 (beats v11.0), PT v6.0-abl DA 21.94, FR v6.0-abl DA 24.52, DE v6.0 DA 27.64.	13.99–27.64	17.35–35.99
Multi-Country v2.0	2026-04-11	⬇️ Superseded by v6.0	PT/FR/DE v2.0 promoted — Fixed critical resolution mismatch + ES cross-price zero-fill in v1.0 models. Per-country tuning. PT MAE 8.43 (-80%), FR MAE 4.87 (-79%), DE MAE 4.14 (-80%). 27 total scouts across 3 countries. Cross-prices were applied universally; the Z3 ablation later showed this hurt 3 of 4 countries.	4.14–8.43	—
M0.6 Phase F — Cloud Run cutover	2026-04-09	✅ Production	ES predictions on Cloud Run Jobs — VM cron prediction lines disabled with `#CUTOVER_2026_04_09#`. Container `predictor:v11.0`, GCS joblibs, Cloud Scheduler triggers at 10:10/15:10 UTC, VPC connector to VM PG. Parity verified within 0.01 EUR vs VM reference.	—	—
v11.0	2026-04-09	⬇️ Superseded by v12.0-abl for ES DA; still canonical for ES ST	Post-LSTM correction — ES single XGBoost + residual_1w + pw3x + d365, no LSTM, no ensemble. DA MAE 14.26, Strategic MAE 17.35.	14.26	17.35
v10.2	2026-03-22	⚠️ Partially Retracted 2026-04-09	Feature re-selection + residual baseline variants — H1 (LSTM-aware feature selection) retracted because the underlying LSTM was broken. H2 (4-week residual baselines) survives independent validation; the +3-4 EUR bias finding is real and is why v11.0 uses residual_1w.	16.90 (ta-fs)	—
v10.1	2026-03-22	❌ Retracted 2026-04-09	Task-aligned LSTM encoder + ensemble validation — the headline metrics (MAE 15.73, bias −0.65, SpkR 24.1%) were measured on broken code. The LSTM block contributed zero useful signal. v8-res-1w-pw3-d365 (the predecessor) actually dominated v10.1 on every metric except bias. See v11.0 for the corrected production model.	15.73 (broken)	18.13 (broken)
v10.0	2026-03-21	❌ Retracted 2026-04-09	LSTM-XGBoost hybrid scout — same two LSTM bugs as v10.1. The 13.12 MAE was attributed to LSTM but was actually attributable to residual_1w (which works without LSTM). The “best structural metrics” were similarly bug-affected.	13.12 (broken)	—
v9.0	2026-03-21	Scout	Quantile ensemble — 7 experiments: q10/q50/q90 and q25/q55/q75 with various weight combos. Best MAE 14.48 (narrow spread). Didn’t beat single q=0.55 model (12.69). Averaging compressed predictions can’t break compression ceiling.	14.48 (qens)	—
v8.0	2026-03-21	Rejected	PyTorch neural network — 5 experiments: deep residual MLP with BatchNorm, dropout, cosine LR. Best MAE 28.13 (2.2× worse than XGB). Slope 0.39-0.55 (worse). Tabular NNs can’t beat trees on flattened features.	28.13 (tft)	—
v7.2	2026-03-21	Scout	Residual target transform — predict deviation from 1w baseline instead of raw EUR. Slope 0.71→0.75 (+5.5%), bias -51%, 80-130 MAE -26%, MaxPred 135→160. Structural thesis confirmed but overall MAE 12.92 (vs v7.0 12.69).	12.92 (xgb)	—
v7.1	2026-03-20	Rejected	MLP neural network — sklearn MLPRegressor MAE 37+ (3x worse than XGB). Slope 0.26-0.52 (worse, not better). Confirms compression is model-specific but sklearn MLP not viable.	37.24 (mlp)	—
v7.0	2026-03-20	Scout	Compression-breaking experiments — 17 experiments: price weighting, threshold tuning, Monday features. pw-3x-d365-t60 MAE 12.69 (-12.3% vs prod). Spike recall 76%.	12.69 (xgb)	—
v6.3	2026-03-19	Validated	48-experiment deep trees validation — XGB d12+d365 MAE 13.20 (-8.8% vs prod). Compression analysis: slope 0.70, 80-130 EUR range drives 40% of error.	13.20 (xgb)	—
v6.2	2026-03-18	Scout	24 structural scout experiments — deep trees winner	12.55 (xgb)	—
v6.1	2026-03-18	Scout	Structural diagnosis — range compression analysis + config scouts	13.41 (xgb)	—
v5.1	2026-03-17	Rejected	Winsorize 200 EUR + decay 365d — bias flipped positive +11.92	15.34 (ens)	—
v5.0	2026-03-11	Rejected	Peak/off-peak split — halved training data, DA MAE +10.1% worse	—	—
v5.0b	2026-03-17	Rejected	Log-transform target — compressed range, DA MAE +14.5% worse	—	—
v4.3	2026-03-05	Production	Feature selection pipeline	14.47 (ens)	19.79 (ens)
v4.2	2026-03-04		+24 crisis features	14.95 (ens)	20.36
v4.1	2026-03-02		Quantile loss + weather interactions	13.42 (ens)	20.47
v3.1	2026-02-26		MAE loss + asymmetric conformal	14.85	—
v3.0	2026-02-25		Two-product system	—	—
v2.1	2026-02-24		15-minute resolution	—	—
v2.0	2026-02-17		Multi-model ensemble	—	—
v1.1	2026-02-14		Feature expansion	—	—
v1.0	2026-02-12		Initial release	—	—

Current production (post Z3 ablation, 2026-04-17): Per-country single-XGBoost winners with cross-price gating.

ES — v12.0 ablation (DA MAE 13.99, beats v11.0 14.26) + v11.0 for strategic (17.35). Z3 off.
PT — v6.0 ablation for day-ahead (21.94) + v6.0 with Z3 for strategic (24.67). Mixed gating.
FR — v6.0 ablation (DA 24.52) + v5.0 for strategic (28.47, v6.0-abl 28.61 close). Z3 off.
DE — v6.0 with Z3 for both horizons (DA 27.64 / ST 35.99). The only country that benefits from cross-country prices.

Production env: EPF_CROSS_PRICE_COUNTRIES=DE. All models use the shared base recipe (XGBoost depth=12, q=0.55, pw3x above 60 EUR, sample decay 365d) with residual_1w target transform (except FR, which uses raw EUR).

v11.0 was the ES-only ancestor of the current stack and is the canonical entry point for understanding the retraction of the v10.x LSTM+XGBoost hybrid — two layered code bugs meant the LSTM block contributed zero useful signal throughout the v10.x line. See v11.0 for the retraction narrative and cross-price gating for the Z3 ablation result that shaped the current multi-country configuration. 120+ total experiments across v1.0–v6.1; the v10.x retraction is documented as part of the project’s methodology integrity.

Architecture & Data Versions

Versions that changed system architecture, data pipelines, or UI without retraining models.

Version	Date	Description	Impact
Cloudflare CDN	2026-05-06	Edge termination for `epf.productjorge.com`	EU TLS RTT ~150 ms → ~20 ms; Brotli + HTTP/3 + DDoS protection at the edge
Performance sprint	2026-05-05 → 2026-05-06	Lazy loading + batch endpoints	-13 KB gzip critical-path JS; -3 blocking round-trips on Multi-Country page
System Page Redesign	2026-04-23	Pass 1 + Pass 2A	Country health cards, naive-baseline accuracy context, plain-language labels
M0.6 Phase F — Cloud Run cutover	2026-04-09	ES predictions on Cloud Run Jobs	Predictions decoupled from VM cron; resolves D-10 from `PRODUCT_SCALE_PLAN.md`
v3.3	2026-03-02	Dashboard restructuring: Price Drivers promoted to top-level section	Cleaner navigation, 11 components deleted
v3.2	2026-03-02	Commodity fix (yfinance) + oil prices + Price Drivers redesign	Fixed broken commodity pipeline, added oil data

Key Transitions

Transition	Version	What Changed
Single → Ensemble	v2.0	Added LightGBM + XGBoost alongside HistGBT
Unified → Two-Product	v3.0	Split into day-ahead (D+1, 10:00 UTC) and strategic (D+2–D+7, 15:00 UTC)
MSE → MAE	v3.1	Switched to absolute error loss; strategic bias -6.94 → -0.30
MAE → Quantile	v4.1	Quantile loss q=0.55 targets 55th percentile; DA MAE -18.2%
Manual → Selected	v4.3	Two-stage feature selection (correlation + permutation importance) per horizon
Shallow → Deep Trees	v6.2	max_depth 8→12 with regularization; DA MAE -13.3% in scouts
Equal → Price-Weighted	v7.0	High-price samples weighted 3× during training; spike recall 56→76%
Raw → Residual Target	v7.2	Predict deviation from weekly baseline; slope ceiling 0.71→0.75, bias -51%
Tabular → LSTM Hybrid	v10.0	[RETRACTED 2026-04-09] LSTM temporal embeddings were intended to augment XGBoost features. Two layered code bugs meant the LSTM block contributed zero useful signal throughout. See v10.0 retraction.
LSTM Hybrid → Single XGBoost	v11.0	Post-LSTM correction — retracted v10.x, retrained the v8 winner (XGBoost + residual_1w + pw3x + d365) as the canonical production model. Strictly dominates v10.1 on every metric except bias on the same window.

Reading the Changelogs

Each version page documents:

What changed: Technical details of the update
Why: The problem or limitation being addressed
Impact: Measurable accuracy improvements or capability additions
Key files: Source code locations of the changes