Cloud Run Jobs for Prediction

Overview

Since M0.6 Phase F cutover (2026-04-09), EPF runs predictions on Cloud Run Jobs instead of VM cron. Each country × horizon group has its own job, triggered by Cloud Scheduler, which cold-starts a container, downloads the relevant joblib from GCS, runs predictions, and writes rows back to the VM PostgreSQL via a VPC connector.

This replaced the previous architecture (VM cron calling daily_pipeline.sh predict-dayahead and predict-strategic inline) because:

The VM is an e2-micro with limited RAM; model-load contention with the FastAPI service was a recurring issue.
The VM cron had no retry, no isolation, and no observability — a prediction-run failure left stale forecasts on the dashboard.
Moving predictions off the VM makes it trivial to add new countries (two Cloud Run Jobs + two Schedulers + one GCS folder, no VM changes).

Architecture

Cloud Scheduler (per country × horizon)
     │
     ▼
Cloud Run Job (predictor:v11.0)
     │
     ├─→ Download joblib: gs://epf-models-epriceforecaster/<COUNTRY>/<VERSION>/
     │
     ├─→ Fetch features from VM PG via VPC connector (epf-connector, 10.8.0.0/28)
     │
     ├─→ Run prediction with country-aware env (EPF_CROSS_PRICE_COUNTRIES, etc.)
     │
     └─→ Write predictions + latest_forecasts rows to VM PG
            │
            ▼
    VM cron (10:13 + 15:13 UTC, 10:14 + 15:14 UTC)
     │
     ├─→ Purge /tmp/nginx-epf-cache
     │
     └─→ POST /api/v1/forecast/cache-clear (API in-memory cache)

Components

Cloud Run Jobs

Job	Country	Horizon	Trigger
`epf-predict-es-dayahead`	ES	D+1	10:10 UTC
`epf-predict-es-strategic`	ES	D+2…D+7	15:10 UTC
`epf-predict-pt-dayahead`	PT	D+1	10:10 UTC
`epf-predict-pt-strategic`	PT	D+2…D+7	15:10 UTC
`epf-predict-fr-dayahead`	FR	D+1	10:10 UTC
`epf-predict-fr-strategic`	FR	D+2…D+7	15:10 UTC
`epf-predict-de-dayahead`	DE	D+1	10:10 UTC
`epf-predict-de-strategic`	DE	D+2…D+7	15:10 UTC

All jobs run the same container image (us-east1-docker.pkg.dev/epriceforecaster/epf/predictor:v11.0) but with different env variables for country, run mode, and feature gating.

Container image

The predictor container is built from Dockerfile.predictor. Key properties:

Python 3.12, torch-free (v11.0+ doesn’t use LSTM, so no PyTorch in the image — was >2 GB, now ~400 MB)
Installs only prediction-path dependencies (pandas, numpy, scikit-learn, xgboost, SQLAlchemy, psycopg2, google-cloud-storage, python-dotenv)
Entrypoint: python scripts/cloud_run_predict.py --country $COUNTRY --run-mode $RUN_MODE

GCS joblib layout

gs://epf-models-epriceforecaster/
├── ES/
│   ├── v11.0/direct_model_xgboost_15min_hybrid_{dayahead,strategic}_2026-04-09.joblib
│   └── v12.0-abl/direct_model_xgboost_15min_hybrid_dayahead_2026-04-16.joblib
├── PT/
│   └── v6.0/direct_model_xgboost_PT_15min_hybrid_{dayahead,strategic}_2026-04-16.joblib
├── FR/
│   └── v6.0/direct_model_xgboost_FR_15min_hybrid_{dayahead,strategic}_2026-04-16.joblib
└── DE/
    └── v6.0/direct_model_xgboost_DE_15min_hybrid_{dayahead,strategic}_2026-04-15.joblib

Every Cloud Run cold-start downloads the joblib fresh. This keeps container images model-agnostic (rolling out a new model version = uploading a new joblib + updating an env var, no container rebuild).

Secret Manager

Two secrets are mounted per job at start:

epf-database-url — the VM PG connection string (via internal IP 10.142.0.2)
epf-admin-key — used to POST the cache-clear call after a successful write

VPC Access Connector

Cloud Run Jobs cannot directly reach the VM’s private IP. The epf-connector Serverless VPC Access connector (subnet 10.8.0.0/28, us-east1) routes traffic from Cloud Run into the VPC, where it reaches 10.142.0.2:5432 (the VM’s internal IP for PostgreSQL).

Firewall rule epf-pg-from-vpc-connector allows inbound from the connector subnet to VMs with the epf-pg tag. PostgreSQL is configured to listen_addresses=localhost,10.142.0.2 and pg_hba.conf allows host epf epf 10.8.0.0/28.

VM cron post-writes

Cloud Run can’t sudo systemctl reload nginx on the VM directly. Two VM cron lines fire ~3–4 minutes after each Cloud Run prediction run:

13 10,15 * * * UTC — sudo /bin/rm -rf /tmp/nginx-epf-cache/* && sudo /bin/systemctl reload nginx
14 10,15 * * * UTC — /opt/epf/scripts/api_cache_clear.sh (POSTs to /api/v1/forecast/cache-clear with the admin key)

Without these, nginx and the API in-memory forecast cache would serve stale predictions for up to their TTL.

Operational runbook

Manual trigger

# Kick off an ES dayahead prediction now
gcloud run jobs execute epf-predict-es-dayahead \
  --project epriceforecaster \
  --region us-east1 \
  --wait

Check the last run

gcloud run jobs executions list \
  --job epf-predict-es-dayahead \
  --project epriceforecaster \
  --region us-east1 \
  --limit 5

Dashboard verification

curl -s 'https://epf.productjorge.com/api/v1/forecast/combined?country=ES' | \
  jq '{rows: (.dayahead | length), latest_written: .metadata.latest_write}'

Rollback

If a Cloud Run Job is broken (e.g. a bad model rollout), re-enable the old VM cron line:

gcloud compute ssh epf-vm --zone=us-east1-b
sudo -u epf crontab -e
# Remove the #CUTOVER_2026_04_09# prefix from the predict-dayahead line

The legacy daily_pipeline.sh predict-dayahead + predict-strategic paths still work — they’ve been kept as a fallback.

Why not Cloud Functions

Cloud Functions (2nd gen) would work here, but Cloud Run Jobs were a better fit:

Jobs are explicitly batch; Functions are request-triggered (Scheduler → Pub/Sub → Function is a more complicated wiring)
Cold-start download of a ~100 MB joblib is fine for a Job but tighter for a Function
Jobs give longer timeouts (up to 24h) which is useful for strategic horizons that sweep D+2…D+7

Where this lives

Scripts: scripts/cloud_run_predict.py (entrypoint), Dockerfile.predictor (image)
Internal runbook: data/m06_phase_f_runbook.md
Internal memory: memory/project_cloudrun_predictions.md