System Page Redesign — Pass 1 + Pass 2A (Production)

Date: April 23, 2026 | Status: ✅ Production

Why this version exists

The System page had grown organically into a wall of internal-vocabulary status text — pipeline names like (CR) and (inferred), raw status flags, and data-quality bars that could read above 100%. It was useful for the project author but opaque to anyone else. Two passes on the same day rewrote the page around plain-language health summaries while leaving the underlying /api/v1/system endpoint untouched.

Pass 1 (commit 231c1b1) handled layout, language, and ordering. Pass 2A (commit fa1eb77) added an accuracy-context block that puts the production model’s recent MAE in the context of the naive baselines from the naive benchmarks work.

What changed in Pass 1

Country health cards at the top

Four cards (one per country) sit above everything else, classified Healthy / Minor issues / Issues / Broken based on the underlying pipeline statuses, with a single one-line context string that explains why if it’s not Healthy. Operators can now triage which country to look at first in two seconds rather than scanning the full pipeline grid.

Failures sort to the top

Within each country’s pipeline list, failures are sorted to the top, then degradations, then expected-failures (yellow badge), then healthy. The previous order was insertion order, which meant the most important rows were often hidden below scrolled-off success rows.

Plain-language labels

Old	New
`(CR)`	`(Cloud Run)`
`(inferred)`	`(from predictions table)`
Raw 0–∞ data-quality percentages	Capped at 100% in the bar visualization
`0.0734` MAE	`0.07 EUR/MWh`

Production Model card tightening

The Production Model card now renders MAE as X.XX EUR/MWh with a unit, shows the model name in regular weight rather than monospace, and groups the metrics into a tighter visual block.

What changed in Pass 2A

Accuracy context with naive-baseline comparison

The Production Model card now includes an accuracy-context block that classifies how the production model is doing relative to the naive baselines for the same country and window:

Verdict	Meaning
`beats_all`	Production MAE is better than all three naive baselines on this window
`beats_some`	Production MAE is better than one or two of the naives, worse than the rest
`loses_majority`	Production MAE is worse than two or three of the naives — actionable warning
`insufficient`	Not enough aligned eval days to make a confident call

The block also surfaces a reference MAE (same-window backtest, with a fallback to a recent-tail computation when the same-window backtest isn’t available) and a one-sentence interpretation. There’s a deep link to the Evaluation page for operators who want the full benchmark matrix.

`_build_accuracy_context()` helper

A new helper computes the verdict, reference MAE, and interpretation server-side so the frontend just renders. The new SystemAccuracyContext schema is part of the existing /api/v1/system?country= response — no new endpoint, no new round-trip.

Open questions (deferred)

Two improvements were considered and intentionally deferred until the team can decide on a direction:

Splitting the Production Model card by run_mode. Aggregate MAE hides the fact that DE and FR are noticeably weaker on D+1 than on D+2..D+7 strategic. A split view would surface that, but doubles the card’s visual weight.
Raising the alignment threshold from 7 to 14 aligned days. The current 7-day threshold occasionally classifies a country as insufficient when the underlying signal is noisy.

Key files

src/api/routes.py — /api/v1/system?country= and _build_accuracy_context()
src/api/schemas.py — SystemAccuracyContext
frontend/src/pages/SystemPage.jsx — frontend rendering of the cards and context block

Naive Benchmarks — supplies the naive references the accuracy context compares against
Evaluation Data Integrity Fix — the lineage cleanup that made these comparisons trustworthy
Pipeline Monitoring Invariant — the failure detection that feeds the country health cards