Evaluation Data Integrity Fix — Country Filter + Window Alignment

Date: April 23, 2026 | Status: ✅ Production

Why this version exists

When the System page redesign started surfacing naive-baseline comparisons, several of the early classifications looked wrong — countries flipping verdict (“beats_all” → “loses_majority”) between page loads, and per-country eval rows that didn’t match what the underlying eval_daily_metrics table actually contained. A read-only audit was opened to trace the data lineage end to end before patching anything.

The audit (docs/analysis/EVALUATION_DATA_LINEAGE_AUDIT.md) found three independent lineage issues that all converged on the benchmark-matrix endpoint. The fix landed shortly after on branch fix/evaluation-data-integrity, merged via commit 5b32dd2.

What changed

Issue 1 — Country-filter dropouts in `backfill_actual_prices`

backfill_actual_prices had country-aware paths for non-ES (market_hourly WHERE country = :c) but the ES path still used hardcoded ree_hourly queries that didn’t validate the country argument. When called for a non-ES country with a stale ES default, it would silently fall through to ES rows, contaminating the eval baseline for that other country.

Fix: the ES branches now explicitly assert country == 'ES' before reading from ree_hourly, and the dispatcher routes by country deterministically rather than falling back.

Issue 2 — Window misalignment between live and naive benchmarks

The benchmark-matrix endpoint was joining live model rows and naive baseline rows on date alone, without enforcing that both sides covered the same window. Naive baselines have continuous coverage; live models have gaps from re-runs and bug-affected experiment tags. A window with 132 live days and 145 naive days would produce a comparison where the live MAE was over a different (smaller) day-set than the naive — apples to oranges.

Fix: benchmark-matrix now computes the intersection of available dates across all compared series and reports MAE/bias/etc. over only that shared window. The endpoint surfaces the actual day count in the response so consumers (System page, Evaluation page) can show “n=132” rather than implying the comparison was over the full window.

Issue 3 — Dedupe gaps

A small handful of duplicated eval_daily_metrics rows existed for tag/country/date triples — left over from migrations and re-runs. The dedupe path had assumed primary-key uniqueness, so the duplicates leaked into MAE aggregation as double-weighted days.

Fix: explicit dedupe (SELECT DISTINCT + tie-break on inserted_at DESC) before aggregation, plus a unique index on (model_name, country, run_mode, eval_date) to prevent future duplicates.

Why this matters

Per-country eval rows now align with live-window baselines. The benchmark-matrix consistency invariant is enforced at the API layer rather than relying on every consumer to dedupe and align correctly. The System page accuracy verdicts stopped flipping between page loads after the fix landed.

Key files

src/api/routes.py — /api/v1/evaluation/benchmark-matrix endpoint
src/models/evaluation.py — backfill_actual_prices, dedupe + window-intersection logic
docs/analysis/EVALUATION_DATA_LINEAGE_AUDIT.md — the audit that preceded the fix

Naive Benchmarks — the consumer that revealed these issues
System Page Redesign — the surface that started flipping verdicts before this fix
Pipeline Monitoring Invariant — companion invariant work on the ingestion side