Naive Benchmarks — Per-Country D+1 + Strategic Baselines

Date: April 18-22, 2026 | Status: ✅ Production (eval-only; no model promotion)

Why this version exists

A working naive baseline tells you whether your model is actually adding value over the cheapest possible reference. Through the Phase 5 sprint, the project’s MAE numbers were being compared against other models — but never systematically against “what if we did nothing?”. The Phase 5 inversion (DE worst MAE 27.70 but best economic capture 67.3%; PT good MAE 24.80 but worst arbitrage 25.1%) made it concrete that MAE-only comparisons can hide model failures, and that without a naive floor we couldn’t distinguish “PT MAE is good” from “PT prices are easy and we’re barely beating yesterday’s.”

This entry adds three 15-min naive baselines for every country and every run mode, makes them queryable from the Evaluation API, and surfaces them in the System and Evaluation dashboards.

What changed

Three baselines, four countries, both run modes

Baseline	Definition
`naive_persistence_d1`	Target day uses prediction_date’s actuals — quarter-hour for quarter-hour. The dumbest possible “tomorrow looks like today” baseline.
`naive_seasonal_weekly`	Target day T uses actuals from `T − 7 days`. Captures the strong weekly seasonality of European DA prices without any model.
`naive_similar_day`	Target day T uses the mean of the last 4 same-weekday actuals on or before prediction_date. A small step up from `seasonal_weekly` — averaging out single-day noise.

All three run at 15-min resolution for every country (ES, PT, FR, DE) and produce separate rows per run_mode (dayahead = D+1; strategic = D+2 to D+7). That matrix is 3 × 4 × 2 = 24 baseline series, each backfilled across the full evaluation window.

Legacy name consolidation

Pre-existing naive rows used inconsistent names: naive_persistence, naive_weekly, naive_persistence_15min, naive_weekly_15min. The 15-min rows were renamed into the new tier and the hourly-only rows dropped via scripts/migrate_naive_rename.py. The new names are the only ones the Evaluation API exposes.

API surface

The benchmark-matrix endpoint on the Evaluation API surfaces all three baselines alongside production model rows for the same country/window/run_mode. The dashboard’s accuracy panel can now classify each model’s performance as beats_all, beats_some, loses_majority, or insufficient against the naive references.

Why this matters

The Phase 5 inversion is the load-bearing example. Once the naive baselines were available:

DE’s 27.70 MAE — visually the worst — turned out to handily beat the DE naive baselines on both MAE and economic capture. The model is doing real work; DE prices are just genuinely hard.
PT’s 24.80 MAE — visually competitive — was much closer to the PT naive baselines, with arbitrage capture worse than the seasonal_weekly naive. That’s the warning sign that surfaced via this work and shaped the PT v6.0 hybrid15 fix.

Without the naive layer, those two countries looked qualitatively similar from MAE alone. With the naive layer, they diverge sharply — and the diagnosis tells us where to invest engineering effort.

Key files

src/models/evaluation.py — naive baseline computation + benchmark-matrix endpoint
scripts/migrate_naive_rename.py — one-shot consolidation of legacy naive row names
src/api/routes.py — /api/v1/evaluation/benchmark-matrix

Multi-Country v2.0 — the country lineup these baselines cover
Phase 5 / v6.0 / Z3 Cross-Price Ablation — the sprint whose findings motivated the naive layer
System Page Redesign — surfaces naive comparisons in the production model accuracy card