Skip to content

Methodology Changelog

Version History

Version Tagging Convention

Every model experiment gets a version number. Status tags track lifecycle:

TagMeaning
(Production)Currently deployed to production VM
(Scout)Fast exploratory experiment (90-day, single model, DA-only) — not promotion-ready
(Validating)Medium-scope validation (150-day, multiple models) — promotion candidate
(Rejected)Evaluated and not promoted
(no tag)Historical version, superseded

Model Versions

VersionDateStatusDescriptionDA MAEStrat MAE
Cloudflare CDN2026-05-06ProductionEdge termination at epf.productjorge.com — Cloudflare Free proxy in front of dashboard. Brotli, HTTP/3, EU PoPs cut TLS RTT ~150 ms → ~20 ms. Manual cache rules for JSON still pending.
Performance sprint (May 2026)2026-05-05 → 2026-05-06ProductionLazy loading + batch endpoints — Lazy MonthYearPicker, InfoPanel, recharts-importing components. New /forecast/combined/batch + /market/prices/batch endpoints. -13 KB gzip critical-path JS, -3 round-trips on Multi-Country page.
Pipeline Monitoring Invariant2026-04-25ProductionZero-rows = failed by defaultPipelineRunTracker flips zero-rows runs to failed unless allow_zero_rows=True. Intraday now fail-loud; 14 days of silent intraday failures finally surfaced.
Evaluation Data Integrity Fix2026-04-23ProductionEnd-to-end data lineage cleanup — Three lineage fixes on /api/v1/evaluation/benchmark-matrix: country-filter dropouts in backfill_actual_prices, window-intersection alignment, dedupe invariants.
System Page Redesign2026-04-23ProductionPass 1 + Pass 2A — Country health cards at top, plain-language labels, sorted failures, naive-baseline accuracy context on the Production Model card. No new endpoints.
Naive Benchmarks2026-04-22Production (eval-only)Per-country D+1 + strategic baselines — Three 15-min naive baselines (persistence_d1, seasonal_weekly, similar_day) for ES/PT/FR/DE on both run_modes. Surfaces in benchmark-matrix endpoint.
PT v6.0 hybrid15 fix2026-04-16Production (PT)PT resolution-correctness fix — Plumbs country + resolution cleanly through feature builder + trainer + predictor. PT DA MAE -9.2%, Strategic MAE -21% vs v2.0. Part of the broader Phase 5 sprint.21.94 (PT)24.67 (PT)
Phase 5 / v6.0 / Z3 ablation2026-04-17Production (current)Cross-price gating decision — 4-day multi-country feature sprint (Z1 holidays + Z2 gen-forecast + Z4 solar elevation) + Z3 ablation. Cross-country prices improved only DE; ES/FR/PT MAE dropped when Z3 was removed. Production: EPF_CROSS_PRICE_COUNTRIES=DE. ES v12.0-abl DA 13.99 (beats v11.0), PT v6.0-abl DA 21.94, FR v6.0-abl DA 24.52, DE v6.0 DA 27.64.13.99–27.6417.35–35.99
Multi-Country v2.02026-04-11⬇️ Superseded by v6.0PT/FR/DE v2.0 promoted — Fixed critical resolution mismatch + ES cross-price zero-fill in v1.0 models. Per-country tuning. PT MAE 8.43 (-80%), FR MAE 4.87 (-79%), DE MAE 4.14 (-80%). 27 total scouts across 3 countries. Cross-prices were applied universally; the Z3 ablation later showed this hurt 3 of 4 countries.4.14–8.43
M0.6 Phase F — Cloud Run cutover2026-04-09ProductionES predictions on Cloud Run Jobs — VM cron prediction lines disabled with #CUTOVER_2026_04_09#. Container predictor:v11.0, GCS joblibs, Cloud Scheduler triggers at 10:10/15:10 UTC, VPC connector to VM PG. Parity verified within 0.01 EUR vs VM reference.
v11.02026-04-09⬇️ Superseded by v12.0-abl for ES DA; still canonical for ES STPost-LSTM correction — ES single XGBoost + residual_1w + pw3x + d365, no LSTM, no ensemble. DA MAE 14.26, Strategic MAE 17.35.14.2617.35
v10.22026-03-22⚠️ Partially Retracted 2026-04-09Feature re-selection + residual baseline variants — H1 (LSTM-aware feature selection) retracted because the underlying LSTM was broken. H2 (4-week residual baselines) survives independent validation; the +3-4 EUR bias finding is real and is why v11.0 uses residual_1w.16.90 (ta-fs)
v10.12026-03-22Retracted 2026-04-09Task-aligned LSTM encoder + ensemble validation — the headline metrics (MAE 15.73, bias −0.65, SpkR 24.1%) were measured on broken code. The LSTM block contributed zero useful signal. v8-res-1w-pw3-d365 (the predecessor) actually dominated v10.1 on every metric except bias. See v11.0 for the corrected production model.15.73 (broken)18.13 (broken)
v10.02026-03-21Retracted 2026-04-09LSTM-XGBoost hybrid scout — same two LSTM bugs as v10.1. The 13.12 MAE was attributed to LSTM but was actually attributable to residual_1w (which works without LSTM). The “best structural metrics” were similarly bug-affected.13.12 (broken)
v9.02026-03-21ScoutQuantile ensemble — 7 experiments: q10/q50/q90 and q25/q55/q75 with various weight combos. Best MAE 14.48 (narrow spread). Didn’t beat single q=0.55 model (12.69). Averaging compressed predictions can’t break compression ceiling.14.48 (qens)
v8.02026-03-21RejectedPyTorch neural network — 5 experiments: deep residual MLP with BatchNorm, dropout, cosine LR. Best MAE 28.13 (2.2× worse than XGB). Slope 0.39-0.55 (worse). Tabular NNs can’t beat trees on flattened features.28.13 (tft)
v7.22026-03-21ScoutResidual target transform — predict deviation from 1w baseline instead of raw EUR. Slope 0.71→0.75 (+5.5%), bias -51%, 80-130 MAE -26%, MaxPred 135→160. Structural thesis confirmed but overall MAE 12.92 (vs v7.0 12.69).12.92 (xgb)
v7.12026-03-20RejectedMLP neural network — sklearn MLPRegressor MAE 37+ (3x worse than XGB). Slope 0.26-0.52 (worse, not better). Confirms compression is model-specific but sklearn MLP not viable.37.24 (mlp)
v7.02026-03-20ScoutCompression-breaking experiments — 17 experiments: price weighting, threshold tuning, Monday features. pw-3x-d365-t60 MAE 12.69 (-12.3% vs prod). Spike recall 76%.12.69 (xgb)
v6.32026-03-19Validated48-experiment deep trees validation — XGB d12+d365 MAE 13.20 (-8.8% vs prod). Compression analysis: slope 0.70, 80-130 EUR range drives 40% of error.13.20 (xgb)
v6.22026-03-18Scout24 structural scout experiments — deep trees winner12.55 (xgb)
v6.12026-03-18ScoutStructural diagnosis — range compression analysis + config scouts13.41 (xgb)
v5.12026-03-17RejectedWinsorize 200 EUR + decay 365d — bias flipped positive +11.9215.34 (ens)
v5.02026-03-11RejectedPeak/off-peak split — halved training data, DA MAE +10.1% worse
v5.0b2026-03-17RejectedLog-transform target — compressed range, DA MAE +14.5% worse
v4.32026-03-05ProductionFeature selection pipeline14.47 (ens)19.79 (ens)
v4.22026-03-04+24 crisis features14.95 (ens)20.36
v4.12026-03-02Quantile loss + weather interactions13.42 (ens)20.47
v3.12026-02-26MAE loss + asymmetric conformal14.85
v3.02026-02-25Two-product system
v2.12026-02-2415-minute resolution
v2.02026-02-17Multi-model ensemble
v1.12026-02-14Feature expansion
v1.02026-02-12Initial release

Current production (post Z3 ablation, 2026-04-17): Per-country single-XGBoost winners with cross-price gating.

  • ES — v12.0 ablation (DA MAE 13.99, beats v11.0 14.26) + v11.0 for strategic (17.35). Z3 off.
  • PT — v6.0 ablation for day-ahead (21.94) + v6.0 with Z3 for strategic (24.67). Mixed gating.
  • FR — v6.0 ablation (DA 24.52) + v5.0 for strategic (28.47, v6.0-abl 28.61 close). Z3 off.
  • DE — v6.0 with Z3 for both horizons (DA 27.64 / ST 35.99). The only country that benefits from cross-country prices.

Production env: EPF_CROSS_PRICE_COUNTRIES=DE. All models use the shared base recipe (XGBoost depth=12, q=0.55, pw3x above 60 EUR, sample decay 365d) with residual_1w target transform (except FR, which uses raw EUR).

v11.0 was the ES-only ancestor of the current stack and is the canonical entry point for understanding the retraction of the v10.x LSTM+XGBoost hybrid — two layered code bugs meant the LSTM block contributed zero useful signal throughout the v10.x line. See v11.0 for the retraction narrative and cross-price gating for the Z3 ablation result that shaped the current multi-country configuration. 120+ total experiments across v1.0–v6.1; the v10.x retraction is documented as part of the project’s methodology integrity.

Architecture & Data Versions

Versions that changed system architecture, data pipelines, or UI without retraining models.

VersionDateDescriptionImpact
Cloudflare CDN2026-05-06Edge termination for epf.productjorge.comEU TLS RTT ~150 ms → ~20 ms; Brotli + HTTP/3 + DDoS protection at the edge
Performance sprint2026-05-05 → 2026-05-06Lazy loading + batch endpoints-13 KB gzip critical-path JS; -3 blocking round-trips on Multi-Country page
System Page Redesign2026-04-23Pass 1 + Pass 2ACountry health cards, naive-baseline accuracy context, plain-language labels
M0.6 Phase F — Cloud Run cutover2026-04-09ES predictions on Cloud Run JobsPredictions decoupled from VM cron; resolves D-10 from PRODUCT_SCALE_PLAN.md
v3.32026-03-02Dashboard restructuring: Price Drivers promoted to top-level sectionCleaner navigation, 11 components deleted
v3.22026-03-02Commodity fix (yfinance) + oil prices + Price Drivers redesignFixed broken commodity pipeline, added oil data

Key Transitions

TransitionVersionWhat Changed
Single → Ensemblev2.0Added LightGBM + XGBoost alongside HistGBT
Unified → Two-Productv3.0Split into day-ahead (D+1, 10:00 UTC) and strategic (D+2–D+7, 15:00 UTC)
MSE → MAEv3.1Switched to absolute error loss; strategic bias -6.94 → -0.30
MAE → Quantilev4.1Quantile loss q=0.55 targets 55th percentile; DA MAE -18.2%
Manual → Selectedv4.3Two-stage feature selection (correlation + permutation importance) per horizon
Shallow → Deep Treesv6.2max_depth 8→12 with regularization; DA MAE -13.3% in scouts
Equal → Price-Weightedv7.0High-price samples weighted 3× during training; spike recall 56→76%
Raw → Residual Targetv7.2Predict deviation from weekly baseline; slope ceiling 0.71→0.75, bias -51%
Tabular → LSTM Hybridv10.0[RETRACTED 2026-04-09] LSTM temporal embeddings were intended to augment XGBoost features. Two layered code bugs meant the LSTM block contributed zero useful signal throughout. See v10.0 retraction.
LSTM Hybrid → Single XGBoostv11.0Post-LSTM correction — retracted v10.x, retrained the v8 winner (XGBoost + residual_1w + pw3x + d365) as the canonical production model. Strictly dominates v10.1 on every metric except bias on the same window.

Reading the Changelogs

Each version page documents:

  • What changed: Technical details of the update
  • Why: The problem or limitation being addressed
  • Impact: Measurable accuracy improvements or capability additions
  • Key files: Source code locations of the changes