Methodology Changelog
Version History
Version Tagging Convention
Every model experiment gets a version number. Status tags track lifecycle:
| Tag | Meaning |
|---|---|
| (Production) | Currently deployed to production VM |
| (Scout) | Fast exploratory experiment (90-day, single model, DA-only) — not promotion-ready |
| (Validating) | Medium-scope validation (150-day, multiple models) — promotion candidate |
| (Rejected) | Evaluated and not promoted |
| (no tag) | Historical version, superseded |
Model Versions
| Version | Date | Status | Description | DA MAE | Strat MAE |
|---|---|---|---|---|---|
| Cloudflare CDN | 2026-05-06 | ✅ Production | Edge termination at epf.productjorge.com — Cloudflare Free proxy in front of dashboard. Brotli, HTTP/3, EU PoPs cut TLS RTT ~150 ms → ~20 ms. Manual cache rules for JSON still pending. | — | — |
| Performance sprint (May 2026) | 2026-05-05 → 2026-05-06 | ✅ Production | Lazy loading + batch endpoints — Lazy MonthYearPicker, InfoPanel, recharts-importing components. New /forecast/combined/batch + /market/prices/batch endpoints. -13 KB gzip critical-path JS, -3 round-trips on Multi-Country page. | — | — |
| Pipeline Monitoring Invariant | 2026-04-25 | ✅ Production | Zero-rows = failed by default — PipelineRunTracker flips zero-rows runs to failed unless allow_zero_rows=True. Intraday now fail-loud; 14 days of silent intraday failures finally surfaced. | — | — |
| Evaluation Data Integrity Fix | 2026-04-23 | ✅ Production | End-to-end data lineage cleanup — Three lineage fixes on /api/v1/evaluation/benchmark-matrix: country-filter dropouts in backfill_actual_prices, window-intersection alignment, dedupe invariants. | — | — |
| System Page Redesign | 2026-04-23 | ✅ Production | Pass 1 + Pass 2A — Country health cards at top, plain-language labels, sorted failures, naive-baseline accuracy context on the Production Model card. No new endpoints. | — | — |
| Naive Benchmarks | 2026-04-22 | ✅ Production (eval-only) | Per-country D+1 + strategic baselines — Three 15-min naive baselines (persistence_d1, seasonal_weekly, similar_day) for ES/PT/FR/DE on both run_modes. Surfaces in benchmark-matrix endpoint. | — | — |
| PT v6.0 hybrid15 fix | 2026-04-16 | ✅ Production (PT) | PT resolution-correctness fix — Plumbs country + resolution cleanly through feature builder + trainer + predictor. PT DA MAE -9.2%, Strategic MAE -21% vs v2.0. Part of the broader Phase 5 sprint. | 21.94 (PT) | 24.67 (PT) |
| Phase 5 / v6.0 / Z3 ablation | 2026-04-17 | ✅ Production (current) | Cross-price gating decision — 4-day multi-country feature sprint (Z1 holidays + Z2 gen-forecast + Z4 solar elevation) + Z3 ablation. Cross-country prices improved only DE; ES/FR/PT MAE dropped when Z3 was removed. Production: EPF_CROSS_PRICE_COUNTRIES=DE. ES v12.0-abl DA 13.99 (beats v11.0), PT v6.0-abl DA 21.94, FR v6.0-abl DA 24.52, DE v6.0 DA 27.64. | 13.99–27.64 | 17.35–35.99 |
| Multi-Country v2.0 | 2026-04-11 | ⬇️ Superseded by v6.0 | PT/FR/DE v2.0 promoted — Fixed critical resolution mismatch + ES cross-price zero-fill in v1.0 models. Per-country tuning. PT MAE 8.43 (-80%), FR MAE 4.87 (-79%), DE MAE 4.14 (-80%). 27 total scouts across 3 countries. Cross-prices were applied universally; the Z3 ablation later showed this hurt 3 of 4 countries. | 4.14–8.43 | — |
| M0.6 Phase F — Cloud Run cutover | 2026-04-09 | ✅ Production | ES predictions on Cloud Run Jobs — VM cron prediction lines disabled with #CUTOVER_2026_04_09#. Container predictor:v11.0, GCS joblibs, Cloud Scheduler triggers at 10:10/15:10 UTC, VPC connector to VM PG. Parity verified within 0.01 EUR vs VM reference. | — | — |
| v11.0 | 2026-04-09 | ⬇️ Superseded by v12.0-abl for ES DA; still canonical for ES ST | Post-LSTM correction — ES single XGBoost + residual_1w + pw3x + d365, no LSTM, no ensemble. DA MAE 14.26, Strategic MAE 17.35. | 14.26 | 17.35 |
| v10.2 | 2026-03-22 | ⚠️ Partially Retracted 2026-04-09 | Feature re-selection + residual baseline variants — H1 (LSTM-aware feature selection) retracted because the underlying LSTM was broken. H2 (4-week residual baselines) survives independent validation; the +3-4 EUR bias finding is real and is why v11.0 uses residual_1w. | 16.90 (ta-fs) | — |
| v10.1 | 2026-03-22 | ❌ Retracted 2026-04-09 | Task-aligned LSTM encoder + ensemble validation — the headline metrics (MAE 15.73, bias −0.65, SpkR 24.1%) were measured on broken code. The LSTM block contributed zero useful signal. v8-res-1w-pw3-d365 (the predecessor) actually dominated v10.1 on every metric except bias. See v11.0 for the corrected production model. | 15.73 (broken) | 18.13 (broken) |
| v10.0 | 2026-03-21 | ❌ Retracted 2026-04-09 | LSTM-XGBoost hybrid scout — same two LSTM bugs as v10.1. The 13.12 MAE was attributed to LSTM but was actually attributable to residual_1w (which works without LSTM). The “best structural metrics” were similarly bug-affected. | 13.12 (broken) | — |
| v9.0 | 2026-03-21 | Scout | Quantile ensemble — 7 experiments: q10/q50/q90 and q25/q55/q75 with various weight combos. Best MAE 14.48 (narrow spread). Didn’t beat single q=0.55 model (12.69). Averaging compressed predictions can’t break compression ceiling. | 14.48 (qens) | — |
| v8.0 | 2026-03-21 | Rejected | PyTorch neural network — 5 experiments: deep residual MLP with BatchNorm, dropout, cosine LR. Best MAE 28.13 (2.2× worse than XGB). Slope 0.39-0.55 (worse). Tabular NNs can’t beat trees on flattened features. | 28.13 (tft) | — |
| v7.2 | 2026-03-21 | Scout | Residual target transform — predict deviation from 1w baseline instead of raw EUR. Slope 0.71→0.75 (+5.5%), bias -51%, 80-130 MAE -26%, MaxPred 135→160. Structural thesis confirmed but overall MAE 12.92 (vs v7.0 12.69). | 12.92 (xgb) | — |
| v7.1 | 2026-03-20 | Rejected | MLP neural network — sklearn MLPRegressor MAE 37+ (3x worse than XGB). Slope 0.26-0.52 (worse, not better). Confirms compression is model-specific but sklearn MLP not viable. | 37.24 (mlp) | — |
| v7.0 | 2026-03-20 | Scout | Compression-breaking experiments — 17 experiments: price weighting, threshold tuning, Monday features. pw-3x-d365-t60 MAE 12.69 (-12.3% vs prod). Spike recall 76%. | 12.69 (xgb) | — |
| v6.3 | 2026-03-19 | Validated | 48-experiment deep trees validation — XGB d12+d365 MAE 13.20 (-8.8% vs prod). Compression analysis: slope 0.70, 80-130 EUR range drives 40% of error. | 13.20 (xgb) | — |
| v6.2 | 2026-03-18 | Scout | 24 structural scout experiments — deep trees winner | 12.55 (xgb) | — |
| v6.1 | 2026-03-18 | Scout | Structural diagnosis — range compression analysis + config scouts | 13.41 (xgb) | — |
| v5.1 | 2026-03-17 | Rejected | Winsorize 200 EUR + decay 365d — bias flipped positive +11.92 | 15.34 (ens) | — |
| v5.0 | 2026-03-11 | Rejected | Peak/off-peak split — halved training data, DA MAE +10.1% worse | — | — |
| v5.0b | 2026-03-17 | Rejected | Log-transform target — compressed range, DA MAE +14.5% worse | — | — |
| v4.3 | 2026-03-05 | Production | Feature selection pipeline | 14.47 (ens) | 19.79 (ens) |
| v4.2 | 2026-03-04 | +24 crisis features | 14.95 (ens) | 20.36 | |
| v4.1 | 2026-03-02 | Quantile loss + weather interactions | 13.42 (ens) | 20.47 | |
| v3.1 | 2026-02-26 | MAE loss + asymmetric conformal | 14.85 | — | |
| v3.0 | 2026-02-25 | Two-product system | — | — | |
| v2.1 | 2026-02-24 | 15-minute resolution | — | — | |
| v2.0 | 2026-02-17 | Multi-model ensemble | — | — | |
| v1.1 | 2026-02-14 | Feature expansion | — | — | |
| v1.0 | 2026-02-12 | Initial release | — | — |
Current production (post Z3 ablation, 2026-04-17): Per-country single-XGBoost winners with cross-price gating.
- ES — v12.0 ablation (DA MAE 13.99, beats v11.0 14.26) + v11.0 for strategic (17.35). Z3 off.
- PT — v6.0 ablation for day-ahead (21.94) + v6.0 with Z3 for strategic (24.67). Mixed gating.
- FR — v6.0 ablation (DA 24.52) + v5.0 for strategic (28.47, v6.0-abl 28.61 close). Z3 off.
- DE — v6.0 with Z3 for both horizons (DA 27.64 / ST 35.99). The only country that benefits from cross-country prices.
Production env: EPF_CROSS_PRICE_COUNTRIES=DE. All models use the shared base recipe (XGBoost depth=12, q=0.55, pw3x above 60 EUR, sample decay 365d) with residual_1w target transform (except FR, which uses raw EUR).
v11.0 was the ES-only ancestor of the current stack and is the canonical entry point for understanding the retraction of the v10.x LSTM+XGBoost hybrid — two layered code bugs meant the LSTM block contributed zero useful signal throughout the v10.x line. See v11.0 for the retraction narrative and cross-price gating for the Z3 ablation result that shaped the current multi-country configuration. 120+ total experiments across v1.0–v6.1; the v10.x retraction is documented as part of the project’s methodology integrity.
Architecture & Data Versions
Versions that changed system architecture, data pipelines, or UI without retraining models.
| Version | Date | Description | Impact |
|---|---|---|---|
| Cloudflare CDN | 2026-05-06 | Edge termination for epf.productjorge.com | EU TLS RTT ~150 ms → ~20 ms; Brotli + HTTP/3 + DDoS protection at the edge |
| Performance sprint | 2026-05-05 → 2026-05-06 | Lazy loading + batch endpoints | -13 KB gzip critical-path JS; -3 blocking round-trips on Multi-Country page |
| System Page Redesign | 2026-04-23 | Pass 1 + Pass 2A | Country health cards, naive-baseline accuracy context, plain-language labels |
| M0.6 Phase F — Cloud Run cutover | 2026-04-09 | ES predictions on Cloud Run Jobs | Predictions decoupled from VM cron; resolves D-10 from PRODUCT_SCALE_PLAN.md |
| v3.3 | 2026-03-02 | Dashboard restructuring: Price Drivers promoted to top-level section | Cleaner navigation, 11 components deleted |
| v3.2 | 2026-03-02 | Commodity fix (yfinance) + oil prices + Price Drivers redesign | Fixed broken commodity pipeline, added oil data |
Key Transitions
| Transition | Version | What Changed |
|---|---|---|
| Single → Ensemble | v2.0 | Added LightGBM + XGBoost alongside HistGBT |
| Unified → Two-Product | v3.0 | Split into day-ahead (D+1, 10:00 UTC) and strategic (D+2–D+7, 15:00 UTC) |
| MSE → MAE | v3.1 | Switched to absolute error loss; strategic bias -6.94 → -0.30 |
| MAE → Quantile | v4.1 | Quantile loss q=0.55 targets 55th percentile; DA MAE -18.2% |
| Manual → Selected | v4.3 | Two-stage feature selection (correlation + permutation importance) per horizon |
| Shallow → Deep Trees | v6.2 | max_depth 8→12 with regularization; DA MAE -13.3% in scouts |
| Equal → Price-Weighted | v7.0 | High-price samples weighted 3× during training; spike recall 56→76% |
| Raw → Residual Target | v7.2 | Predict deviation from weekly baseline; slope ceiling 0.71→0.75, bias -51% |
| Tabular → LSTM Hybrid | v10.0 | [RETRACTED 2026-04-09] LSTM temporal embeddings were intended to augment XGBoost features. Two layered code bugs meant the LSTM block contributed zero useful signal throughout. See v10.0 retraction. |
| LSTM Hybrid → Single XGBoost | v11.0 | Post-LSTM correction — retracted v10.x, retrained the v8 winner (XGBoost + residual_1w + pw3x + d365) as the canonical production model. Strictly dominates v10.1 on every metric except bias on the same window. |
Reading the Changelogs
Each version page documents:
- What changed: Technical details of the update
- Why: The problem or limitation being addressed
- Impact: Measurable accuracy improvements or capability additions
- Key files: Source code locations of the changes