Skip to content

System Page Redesign — Pass 1 + Pass 2A (Production)

Date: April 23, 2026 | Status: ✅ Production

Why this version exists

The System page had grown organically into a wall of internal-vocabulary status text — pipeline names like (CR) and (inferred), raw status flags, and data-quality bars that could read above 100%. It was useful for the project author but opaque to anyone else. Two passes on the same day rewrote the page around plain-language health summaries while leaving the underlying /api/v1/system endpoint untouched.

Pass 1 (commit 231c1b1) handled layout, language, and ordering. Pass 2A (commit fa1eb77) added an accuracy-context block that puts the production model’s recent MAE in the context of the naive baselines from the naive benchmarks work.

What changed in Pass 1

Country health cards at the top

Four cards (one per country) sit above everything else, classified Healthy / Minor issues / Issues / Broken based on the underlying pipeline statuses, with a single one-line context string that explains why if it’s not Healthy. Operators can now triage which country to look at first in two seconds rather than scanning the full pipeline grid.

Failures sort to the top

Within each country’s pipeline list, failures are sorted to the top, then degradations, then expected-failures (yellow badge), then healthy. The previous order was insertion order, which meant the most important rows were often hidden below scrolled-off success rows.

Plain-language labels

OldNew
(CR)(Cloud Run)
(inferred)(from predictions table)
Raw 0–∞ data-quality percentagesCapped at 100% in the bar visualization
0.0734 MAE0.07 EUR/MWh

Production Model card tightening

The Production Model card now renders MAE as X.XX EUR/MWh with a unit, shows the model name in regular weight rather than monospace, and groups the metrics into a tighter visual block.

What changed in Pass 2A

Accuracy context with naive-baseline comparison

The Production Model card now includes an accuracy-context block that classifies how the production model is doing relative to the naive baselines for the same country and window:

VerdictMeaning
beats_allProduction MAE is better than all three naive baselines on this window
beats_someProduction MAE is better than one or two of the naives, worse than the rest
loses_majorityProduction MAE is worse than two or three of the naives — actionable warning
insufficientNot enough aligned eval days to make a confident call

The block also surfaces a reference MAE (same-window backtest, with a fallback to a recent-tail computation when the same-window backtest isn’t available) and a one-sentence interpretation. There’s a deep link to the Evaluation page for operators who want the full benchmark matrix.

_build_accuracy_context() helper

A new helper computes the verdict, reference MAE, and interpretation server-side so the frontend just renders. The new SystemAccuracyContext schema is part of the existing /api/v1/system?country= response — no new endpoint, no new round-trip.

Open questions (deferred)

Two improvements were considered and intentionally deferred until the team can decide on a direction:

  • Splitting the Production Model card by run_mode. Aggregate MAE hides the fact that DE and FR are noticeably weaker on D+1 than on D+2..D+7 strategic. A split view would surface that, but doubles the card’s visual weight.
  • Raising the alignment threshold from 7 to 14 aligned days. The current 7-day threshold occasionally classifies a country as insufficient when the underlying signal is noisy.

Key files