M0.6 Phase F — ES Predictions on Cloud Run Jobs (Production)

Date: April 9, 2026 | Status: ✅ Production (live since 2026-04-09)

Why this version exists

M0.6 Phase F is the deployment half of the v11.0 correction. Phase C produced a working v11.0 model locally as joblibs and a backtest tag; Phase F gets those joblibs running in production by moving ES day-ahead and strategic predictions off the VM cron and onto Cloud Run Jobs. Both phases shipped the same calendar day (2026-04-09).

The cutover also resolves the long-standing D-10 item from PRODUCT_SCALE_PLAN.md: predictions had been the single VM workload that couldn’t horizontally scale or auto-recover, and any VM reboot or cron drift directly impacted forecast freshness. Cloud Run Jobs eliminate that coupling.

What changed

Cloud Run Jobs replace the VM prediction crons

Two new Cloud Run Jobs run ES predictions:

epf-predict-es-dayahead — triggered daily at 10 10 * * * UTC
epf-predict-es-strategic — triggered daily at 10 15 * * * UTC

Both run the same container image (us-east1-docker.pkg.dev/epriceforecaster/epf/predictor:v11.0), a torch-free Python 3.12 image. The container cold-starts in ~6 seconds, downloads v11.0 joblibs from gs://epf-models-epriceforecaster/ES/v11.0/, runs the prediction, and writes rows directly to the VM PostgreSQL instance.

The VM cron prediction lines (predict-dayahead and predict-strategic) were not deleted — they were prefixed with #CUTOVER_2026_04_09# so the original schedule is recoverable in one revert. The non-prediction VM crons (data refresh, intraday, news, backups) are unchanged.

Networking and secrets

The Cloud Run Jobs reach the VM PostgreSQL instance via the epf-connector Serverless VPC Access connector (CIDR 10.8.0.0/28). On the VM side, three pieces moved into place:

A new firewall rule epf-pg-from-vpc-connector plus the epf-pg network tag on the VM open port 5432 to traffic from the connector subnet.
PostgreSQL 14 was reconfigured: listen_addresses = 'localhost,10.142.0.2' and a new pg_hba.conf line host epf epf 10.8.0.0/28 md5. The DB was restarted cleanly with no row loss.
DB credentials and the API admin key live in Secret Manager (epf-database-url, epf-admin-key) and are mounted into the Cloud Run job environment, not baked into the image.

Cache purge crons keep nginx and the API in sync

Because the Cloud Run jobs don’t run on the VM, they can’t sudo systemctl reload nginx directly. Two new VM cron lines run a few minutes after each prediction trigger:

nginx purge at 10:13 UTC and 15:13 UTC
API in-memory forecast-cache clear at 10:14 UTC and 15:14 UTC (via /opt/epf/scripts/api_cache_clear.sh, which calls /api/v1/forecast/cache-clear with the admin key)

Without these, nginx serves stale predictions for up to its TTL even after the new rows land in PostgreSQL.

Parity verification

The originally-planned 5-day shadow window was compressed to a same-day per-row diff between two reference paths:

Reference A — cloud_run_predict.py invoked directly on the VM with the v11.0 joblibs, writing rows under xgboost_hybrid15_vm_ref
Reference B — the Cloud Run job container writing rows under xgboost_hybrid15_shadow

Results on the cutover-day comparison window:

Horizon group	Pairs	Mean \|diff\|	Max \|diff\|	Notes
Dayahead	96	0.0007 EUR/MWh	0.01 EUR/MWh	Single-row float-precision outlier
Strategic	full set	small, bounded	up to ~1.6 EUR (D+2 / D+3)	Reproduced across two consecutive same-env Cloud Run runs — confirms live-data churn between fetches, not VM-vs-container drift

Idempotency was confirmed by two back-to-back Cloud Run executions completing without primary-key conflicts. The 5-day shadow was retired with the same confidence the longer window would have given.

Key files

src/scripts/cloud_run_predict.py — the entrypoint script the Cloud Run container runs
Dockerfile.predictor — torch-free Python 3.12 image definition
docs/operations/DEPLOYMENT_CHECKLIST.md — operator-facing deployment runbook including manual trigger commands
src/api/routes.py — /api/v1/production-state endpoint surfaces the live state for verification

v11.0 — Post-LSTM Correction — the model that Phase F deploys
Multi-Country v2.0 — built on top of the v11.0 deployment infrastructure
Phase 5 / v6.0 / Z3 Cross-Price Ablation — current production state