Cloud Run Jobs for Prediction
Overview
Since M0.6 Phase F cutover (2026-04-09), EPF runs predictions on Cloud Run Jobs instead of VM cron. Each country × horizon group has its own job, triggered by Cloud Scheduler, which cold-starts a container, downloads the relevant joblib from GCS, runs predictions, and writes rows back to the VM PostgreSQL via a VPC connector.
This replaced the previous architecture (VM cron calling daily_pipeline.sh predict-dayahead and predict-strategic inline) because:
- The VM is an e2-micro with limited RAM; model-load contention with the FastAPI service was a recurring issue.
- The VM cron had no retry, no isolation, and no observability — a prediction-run failure left stale forecasts on the dashboard.
- Moving predictions off the VM makes it trivial to add new countries (two Cloud Run Jobs + two Schedulers + one GCS folder, no VM changes).
Architecture
Cloud Scheduler (per country × horizon) │ ▼Cloud Run Job (predictor:v11.0) │ ├─→ Download joblib: gs://epf-models-epriceforecaster/<COUNTRY>/<VERSION>/ │ ├─→ Fetch features from VM PG via VPC connector (epf-connector, 10.8.0.0/28) │ ├─→ Run prediction with country-aware env (EPF_CROSS_PRICE_COUNTRIES, etc.) │ └─→ Write predictions + latest_forecasts rows to VM PG │ ▼ VM cron (10:13 + 15:13 UTC, 10:14 + 15:14 UTC) │ ├─→ Purge /tmp/nginx-epf-cache │ └─→ POST /api/v1/forecast/cache-clear (API in-memory cache)Components
Cloud Run Jobs
| Job | Country | Horizon | Trigger |
|---|---|---|---|
epf-predict-es-dayahead | ES | D+1 | 10:10 UTC |
epf-predict-es-strategic | ES | D+2…D+7 | 15:10 UTC |
epf-predict-pt-dayahead | PT | D+1 | 10:10 UTC |
epf-predict-pt-strategic | PT | D+2…D+7 | 15:10 UTC |
epf-predict-fr-dayahead | FR | D+1 | 10:10 UTC |
epf-predict-fr-strategic | FR | D+2…D+7 | 15:10 UTC |
epf-predict-de-dayahead | DE | D+1 | 10:10 UTC |
epf-predict-de-strategic | DE | D+2…D+7 | 15:10 UTC |
All jobs run the same container image (us-east1-docker.pkg.dev/epriceforecaster/epf/predictor:v11.0) but with different env variables for country, run mode, and feature gating.
Container image
The predictor container is built from Dockerfile.predictor. Key properties:
- Python 3.12, torch-free (v11.0+ doesn’t use LSTM, so no PyTorch in the image — was >2 GB, now ~400 MB)
- Installs only prediction-path dependencies (pandas, numpy, scikit-learn, xgboost, SQLAlchemy, psycopg2, google-cloud-storage, python-dotenv)
- Entrypoint:
python scripts/cloud_run_predict.py --country $COUNTRY --run-mode $RUN_MODE
GCS joblib layout
gs://epf-models-epriceforecaster/├── ES/│ ├── v11.0/direct_model_xgboost_15min_hybrid_{dayahead,strategic}_2026-04-09.joblib│ └── v12.0-abl/direct_model_xgboost_15min_hybrid_dayahead_2026-04-16.joblib├── PT/│ └── v6.0/direct_model_xgboost_PT_15min_hybrid_{dayahead,strategic}_2026-04-16.joblib├── FR/│ └── v6.0/direct_model_xgboost_FR_15min_hybrid_{dayahead,strategic}_2026-04-16.joblib└── DE/ └── v6.0/direct_model_xgboost_DE_15min_hybrid_{dayahead,strategic}_2026-04-15.joblibEvery Cloud Run cold-start downloads the joblib fresh. This keeps container images model-agnostic (rolling out a new model version = uploading a new joblib + updating an env var, no container rebuild).
Secret Manager
Two secrets are mounted per job at start:
epf-database-url— the VM PG connection string (via internal IP 10.142.0.2)epf-admin-key— used to POST the cache-clear call after a successful write
VPC Access Connector
Cloud Run Jobs cannot directly reach the VM’s private IP. The epf-connector Serverless VPC Access connector (subnet 10.8.0.0/28, us-east1) routes traffic from Cloud Run into the VPC, where it reaches 10.142.0.2:5432 (the VM’s internal IP for PostgreSQL).
Firewall rule epf-pg-from-vpc-connector allows inbound from the connector subnet to VMs with the epf-pg tag. PostgreSQL is configured to listen_addresses=localhost,10.142.0.2 and pg_hba.conf allows host epf epf 10.8.0.0/28.
VM cron post-writes
Cloud Run can’t sudo systemctl reload nginx on the VM directly. Two VM cron lines fire ~3–4 minutes after each Cloud Run prediction run:
13 10,15 * * *UTC —sudo /bin/rm -rf /tmp/nginx-epf-cache/* && sudo /bin/systemctl reload nginx14 10,15 * * *UTC —/opt/epf/scripts/api_cache_clear.sh(POSTs to/api/v1/forecast/cache-clearwith the admin key)
Without these, nginx and the API in-memory forecast cache would serve stale predictions for up to their TTL.
Operational runbook
Manual trigger
# Kick off an ES dayahead prediction nowgcloud run jobs execute epf-predict-es-dayahead \ --project epriceforecaster \ --region us-east1 \ --waitCheck the last run
gcloud run jobs executions list \ --job epf-predict-es-dayahead \ --project epriceforecaster \ --region us-east1 \ --limit 5Dashboard verification
curl -s 'https://epf.productjorge.com/api/v1/forecast/combined?country=ES' | \ jq '{rows: (.dayahead | length), latest_written: .metadata.latest_write}'Rollback
If a Cloud Run Job is broken (e.g. a bad model rollout), re-enable the old VM cron line:
gcloud compute ssh epf-vm --zone=us-east1-bsudo -u epf crontab -e# Remove the #CUTOVER_2026_04_09# prefix from the predict-dayahead lineThe legacy daily_pipeline.sh predict-dayahead + predict-strategic paths still work — they’ve been kept as a fallback.
Why not Cloud Functions
Cloud Functions (2nd gen) would work here, but Cloud Run Jobs were a better fit:
- Jobs are explicitly batch; Functions are request-triggered (Scheduler → Pub/Sub → Function is a more complicated wiring)
- Cold-start download of a ~100 MB joblib is fine for a Job but tighter for a Function
- Jobs give longer timeouts (up to 24h) which is useful for strategic horizons that sweep D+2…D+7
Where this lives
- Scripts:
scripts/cloud_run_predict.py(entrypoint),Dockerfile.predictor(image) - Internal runbook:
data/m06_phase_f_runbook.md - Internal memory:
memory/project_cloudrun_predictions.md