M0.6 Phase F — ES Predictions on Cloud Run Jobs (Production)
Date: April 9, 2026 | Status: ✅ Production (live since 2026-04-09)
Why this version exists
M0.6 Phase F is the deployment half of the v11.0 correction. Phase C produced a working v11.0 model locally as joblibs and a backtest tag; Phase F gets those joblibs running in production by moving ES day-ahead and strategic predictions off the VM cron and onto Cloud Run Jobs. Both phases shipped the same calendar day (2026-04-09).
The cutover also resolves the long-standing D-10 item from PRODUCT_SCALE_PLAN.md: predictions had been the single VM workload that couldn’t horizontally scale or auto-recover, and any VM reboot or cron drift directly impacted forecast freshness. Cloud Run Jobs eliminate that coupling.
What changed
Cloud Run Jobs replace the VM prediction crons
Two new Cloud Run Jobs run ES predictions:
epf-predict-es-dayahead— triggered daily at10 10 * * *UTCepf-predict-es-strategic— triggered daily at10 15 * * *UTC
Both run the same container image (us-east1-docker.pkg.dev/epriceforecaster/epf/predictor:v11.0), a torch-free Python 3.12 image. The container cold-starts in ~6 seconds, downloads v11.0 joblibs from gs://epf-models-epriceforecaster/ES/v11.0/, runs the prediction, and writes rows directly to the VM PostgreSQL instance.
The VM cron prediction lines (predict-dayahead and predict-strategic) were not deleted — they were prefixed with #CUTOVER_2026_04_09# so the original schedule is recoverable in one revert. The non-prediction VM crons (data refresh, intraday, news, backups) are unchanged.
Networking and secrets
The Cloud Run Jobs reach the VM PostgreSQL instance via the epf-connector Serverless VPC Access connector (CIDR 10.8.0.0/28). On the VM side, three pieces moved into place:
- A new firewall rule
epf-pg-from-vpc-connectorplus theepf-pgnetwork tag on the VM open port 5432 to traffic from the connector subnet. - PostgreSQL 14 was reconfigured:
listen_addresses = 'localhost,10.142.0.2'and a newpg_hba.conflinehost epf epf 10.8.0.0/28 md5. The DB was restarted cleanly with no row loss. - DB credentials and the API admin key live in Secret Manager (
epf-database-url,epf-admin-key) and are mounted into the Cloud Run job environment, not baked into the image.
Cache purge crons keep nginx and the API in sync
Because the Cloud Run jobs don’t run on the VM, they can’t sudo systemctl reload nginx directly. Two new VM cron lines run a few minutes after each prediction trigger:
- nginx purge at 10:13 UTC and 15:13 UTC
- API in-memory forecast-cache clear at 10:14 UTC and 15:14 UTC (via
/opt/epf/scripts/api_cache_clear.sh, which calls/api/v1/forecast/cache-clearwith the admin key)
Without these, nginx serves stale predictions for up to its TTL even after the new rows land in PostgreSQL.
Parity verification
The originally-planned 5-day shadow window was compressed to a same-day per-row diff between two reference paths:
- Reference A —
cloud_run_predict.pyinvoked directly on the VM with the v11.0 joblibs, writing rows underxgboost_hybrid15_vm_ref - Reference B — the Cloud Run job container writing rows under
xgboost_hybrid15_shadow
Results on the cutover-day comparison window:
| Horizon group | Pairs | Mean |diff| | Max |diff| | Notes |
|---|---|---|---|---|
| Dayahead | 96 | 0.0007 EUR/MWh | 0.01 EUR/MWh | Single-row float-precision outlier |
| Strategic | full set | small, bounded | up to ~1.6 EUR (D+2 / D+3) | Reproduced across two consecutive same-env Cloud Run runs — confirms live-data churn between fetches, not VM-vs-container drift |
Idempotency was confirmed by two back-to-back Cloud Run executions completing without primary-key conflicts. The 5-day shadow was retired with the same confidence the longer window would have given.
Key files
src/scripts/cloud_run_predict.py— the entrypoint script the Cloud Run container runsDockerfile.predictor— torch-free Python 3.12 image definitiondocs/operations/DEPLOYMENT_CHECKLIST.md— operator-facing deployment runbook including manual trigger commandssrc/api/routes.py—/api/v1/production-stateendpoint surfaces the live state for verification
Related
- v11.0 — Post-LSTM Correction — the model that Phase F deploys
- Multi-Country v2.0 — built on top of the v11.0 deployment infrastructure
- Phase 5 / v6.0 / Z3 Cross-Price Ablation — current production state