Skip to content

Origin Filtering

Overview

Origin filtering ensures that each model is trained only on samples from time windows that match its production execution schedule. Day-ahead models see only morning origins; strategic models see only afternoon origins. This prevents information leakage and aligns training conditions with deployment reality.

Why Filter Origins?

Consider a model trained on all origins (00:00–23:00 UTC). At 10:00 UTC, it has access to the morning’s weather, overnight prices, and early demand data. At 15:00 UTC, it additionally has access to D+1 published prices, afternoon demand, and updated weather.

A model trained on mixed origins would learn to sometimes use D+1 prices (when available at 15:00) and sometimes not (when unavailable at 10:00). This creates ambiguity — the model doesn’t know which information set it’s operating with.

Origin filtering resolves this by training separate models on separate origin windows, each with a consistent information set.

Filter Configurations

Day-Ahead Origins

DAYAHEAD_ORIGIN_HOURS = range(8, 13) # 08:00, 09:00, 10:00, 11:00, 12:00 UTC

The production day-ahead run executes at ~10:00 UTC. Training includes origins from 08:00–12:00 to provide robustness:

  • 08:00–09:00: Slightly earlier origins where some morning data may be incomplete
  • 10:00: Exact production time
  • 11:00–12:00: Slightly later origins that might have more complete data

This window range ensures the model performs well even if the production run timing varies by an hour or two.

Strategic Origins

STRATEGIC_ORIGIN_HOURS = range(13, 19) # 13:00, 14:00, ..., 18:00 UTC

The production strategic run executes at ~15:00 UTC, after OMIE publishes D+1 prices (~13:00 UTC). Training includes origins from 13:00–18:00:

  • 13:00: Earliest possible after OMIE publication
  • 15:00: Target production time
  • 18:00: Latest reasonable afternoon origin

All strategic origins have access to D+1 published prices, so the model always learns with these features present.

Implementation

During data construction, origins outside the allowed window are skipped:

for i in range(min_history, len(df)):
origin_dt = df.index[i]
if allowed_hours is not None and origin_dt.hour not in allowed_hours:
continue # skip this origin
# ... construct samples for this origin

This filtering happens before any feature computation, ensuring no processing time is wasted on irrelevant origins.

Impact on Training Set Size

Origin filtering reduces the number of training samples compared to using all 24 hourly origins:

ProductOrigins per DayFraction of All Origins
Day-ahead5 (08–12)21%
Strategic6 (13–18)25%
Legacy (unfiltered)24 (00–23)100%

With 1 year of data, the day-ahead model has approximately 1,825 origin-days (365 × 5) while the unfiltered legacy model would have 8,760. The reduction is justified because:

  1. Each filtered sample is more representative of production conditions
  2. Including irrelevant origins adds noise (e.g., 03:00 UTC origin is never used in production)
  3. The remaining sample count is still sufficient for gradient boosting training

Legacy Mode

The legacy mode (23:00 UTC origin) uses no origin filtering — it trains on all available hourly origins:

if run_mode in ("dayahead", "strategic"):
allowed_hours = set(DAYAHEAD_ORIGIN_HOURS if run_mode == "dayahead"
else STRATEGIC_ORIGIN_HOURS)
else:
allowed_hours = None # no filter (legacy)

This backward-compatible behavior is retained for comparison and transition purposes.

Interaction with D+1 Price Features

Origin filtering and D+1 price features are tightly coupled:

ProductOrigin WindowD+1 PricesReason
Day-ahead08–12 UTCNot availableOMIE hasn’t published yet
Strategic13–18 UTCAvailableOMIE published ~13:00 UTC

The strategic model’s feature engineering includes D+1 price extraction precisely because all its training origins are after the OMIE publication time. The day-ahead model’s feature engineering excludes D+1 prices because none of its training origins have access to them.

This consistency between training and production information sets is the core purpose of origin filtering.