Skip to content

Cloud Run Jobs for Prediction

Overview

Since M0.6 Phase F cutover (2026-04-09), EPF runs predictions on Cloud Run Jobs instead of VM cron. Each country × horizon group has its own job, triggered by Cloud Scheduler, which cold-starts a container, downloads the relevant joblib from GCS, runs predictions, and writes rows back to the VM PostgreSQL via a VPC connector.

This replaced the previous architecture (VM cron calling daily_pipeline.sh predict-dayahead and predict-strategic inline) because:

  • The VM is an e2-micro with limited RAM; model-load contention with the FastAPI service was a recurring issue.
  • The VM cron had no retry, no isolation, and no observability — a prediction-run failure left stale forecasts on the dashboard.
  • Moving predictions off the VM makes it trivial to add new countries (two Cloud Run Jobs + two Schedulers + one GCS folder, no VM changes).

Architecture

Cloud Scheduler (per country × horizon)
Cloud Run Job (predictor:v11.0)
├─→ Download joblib: gs://epf-models-epriceforecaster/<COUNTRY>/<VERSION>/
├─→ Fetch features from VM PG via VPC connector (epf-connector, 10.8.0.0/28)
├─→ Run prediction with country-aware env (EPF_CROSS_PRICE_COUNTRIES, etc.)
└─→ Write predictions + latest_forecasts rows to VM PG
VM cron (10:13 + 15:13 UTC, 10:14 + 15:14 UTC)
├─→ Purge /tmp/nginx-epf-cache
└─→ POST /api/v1/forecast/cache-clear (API in-memory cache)

Components

Cloud Run Jobs

JobCountryHorizonTrigger
epf-predict-es-dayaheadESD+110:10 UTC
epf-predict-es-strategicESD+2…D+715:10 UTC
epf-predict-pt-dayaheadPTD+110:10 UTC
epf-predict-pt-strategicPTD+2…D+715:10 UTC
epf-predict-fr-dayaheadFRD+110:10 UTC
epf-predict-fr-strategicFRD+2…D+715:10 UTC
epf-predict-de-dayaheadDED+110:10 UTC
epf-predict-de-strategicDED+2…D+715:10 UTC

All jobs run the same container image (us-east1-docker.pkg.dev/epriceforecaster/epf/predictor:v11.0) but with different env variables for country, run mode, and feature gating.

Container image

The predictor container is built from Dockerfile.predictor. Key properties:

  • Python 3.12, torch-free (v11.0+ doesn’t use LSTM, so no PyTorch in the image — was >2 GB, now ~400 MB)
  • Installs only prediction-path dependencies (pandas, numpy, scikit-learn, xgboost, SQLAlchemy, psycopg2, google-cloud-storage, python-dotenv)
  • Entrypoint: python scripts/cloud_run_predict.py --country $COUNTRY --run-mode $RUN_MODE

GCS joblib layout

gs://epf-models-epriceforecaster/
├── ES/
│ ├── v11.0/direct_model_xgboost_15min_hybrid_{dayahead,strategic}_2026-04-09.joblib
│ └── v12.0-abl/direct_model_xgboost_15min_hybrid_dayahead_2026-04-16.joblib
├── PT/
│ └── v6.0/direct_model_xgboost_PT_15min_hybrid_{dayahead,strategic}_2026-04-16.joblib
├── FR/
│ └── v6.0/direct_model_xgboost_FR_15min_hybrid_{dayahead,strategic}_2026-04-16.joblib
└── DE/
└── v6.0/direct_model_xgboost_DE_15min_hybrid_{dayahead,strategic}_2026-04-15.joblib

Every Cloud Run cold-start downloads the joblib fresh. This keeps container images model-agnostic (rolling out a new model version = uploading a new joblib + updating an env var, no container rebuild).

Secret Manager

Two secrets are mounted per job at start:

  • epf-database-url — the VM PG connection string (via internal IP 10.142.0.2)
  • epf-admin-key — used to POST the cache-clear call after a successful write

VPC Access Connector

Cloud Run Jobs cannot directly reach the VM’s private IP. The epf-connector Serverless VPC Access connector (subnet 10.8.0.0/28, us-east1) routes traffic from Cloud Run into the VPC, where it reaches 10.142.0.2:5432 (the VM’s internal IP for PostgreSQL).

Firewall rule epf-pg-from-vpc-connector allows inbound from the connector subnet to VMs with the epf-pg tag. PostgreSQL is configured to listen_addresses=localhost,10.142.0.2 and pg_hba.conf allows host epf epf 10.8.0.0/28.

VM cron post-writes

Cloud Run can’t sudo systemctl reload nginx on the VM directly. Two VM cron lines fire ~3–4 minutes after each Cloud Run prediction run:

  • 13 10,15 * * * UTC — sudo /bin/rm -rf /tmp/nginx-epf-cache/* && sudo /bin/systemctl reload nginx
  • 14 10,15 * * * UTC — /opt/epf/scripts/api_cache_clear.sh (POSTs to /api/v1/forecast/cache-clear with the admin key)

Without these, nginx and the API in-memory forecast cache would serve stale predictions for up to their TTL.

Operational runbook

Manual trigger

Terminal window
# Kick off an ES dayahead prediction now
gcloud run jobs execute epf-predict-es-dayahead \
--project epriceforecaster \
--region us-east1 \
--wait

Check the last run

Terminal window
gcloud run jobs executions list \
--job epf-predict-es-dayahead \
--project epriceforecaster \
--region us-east1 \
--limit 5

Dashboard verification

Terminal window
curl -s 'https://epf.productjorge.com/api/v1/forecast/combined?country=ES' | \
jq '{rows: (.dayahead | length), latest_written: .metadata.latest_write}'

Rollback

If a Cloud Run Job is broken (e.g. a bad model rollout), re-enable the old VM cron line:

Terminal window
gcloud compute ssh epf-vm --zone=us-east1-b
sudo -u epf crontab -e
# Remove the #CUTOVER_2026_04_09# prefix from the predict-dayahead line

The legacy daily_pipeline.sh predict-dayahead + predict-strategic paths still work — they’ve been kept as a fallback.

Why not Cloud Functions

Cloud Functions (2nd gen) would work here, but Cloud Run Jobs were a better fit:

  • Jobs are explicitly batch; Functions are request-triggered (Scheduler → Pub/Sub → Function is a more complicated wiring)
  • Cold-start download of a ~100 MB joblib is fine for a Job but tighter for a Function
  • Jobs give longer timeouts (up to 24h) which is useful for strategic horizons that sweep D+2…D+7

Where this lives

  • Scripts: scripts/cloud_run_predict.py (entrypoint), Dockerfile.predictor (image)
  • Internal runbook: data/m06_phase_f_runbook.md
  • Internal memory: memory/project_cloudrun_predictions.md