Origin Filtering
Overview
Origin filtering ensures that each model is trained only on samples from time windows that match its production execution schedule. Day-ahead models see only morning origins; strategic models see only afternoon origins. This prevents information leakage and aligns training conditions with deployment reality.
Why Filter Origins?
Consider a model trained on all origins (00:00–23:00 UTC). At 10:00 UTC, it has access to the morning’s weather, overnight prices, and early demand data. At 15:00 UTC, it additionally has access to D+1 published prices, afternoon demand, and updated weather.
A model trained on mixed origins would learn to sometimes use D+1 prices (when available at 15:00) and sometimes not (when unavailable at 10:00). This creates ambiguity — the model doesn’t know which information set it’s operating with.
Origin filtering resolves this by training separate models on separate origin windows, each with a consistent information set.
Filter Configurations
Day-Ahead Origins
DAYAHEAD_ORIGIN_HOURS = range(8, 13) # 08:00, 09:00, 10:00, 11:00, 12:00 UTCThe production day-ahead run executes at ~10:00 UTC. Training includes origins from 08:00–12:00 to provide robustness:
- 08:00–09:00: Slightly earlier origins where some morning data may be incomplete
- 10:00: Exact production time
- 11:00–12:00: Slightly later origins that might have more complete data
This window range ensures the model performs well even if the production run timing varies by an hour or two.
Strategic Origins
STRATEGIC_ORIGIN_HOURS = range(13, 19) # 13:00, 14:00, ..., 18:00 UTCThe production strategic run executes at ~15:00 UTC, after OMIE publishes D+1 prices (~13:00 UTC). Training includes origins from 13:00–18:00:
- 13:00: Earliest possible after OMIE publication
- 15:00: Target production time
- 18:00: Latest reasonable afternoon origin
All strategic origins have access to D+1 published prices, so the model always learns with these features present.
Implementation
During data construction, origins outside the allowed window are skipped:
for i in range(min_history, len(df)): origin_dt = df.index[i]
if allowed_hours is not None and origin_dt.hour not in allowed_hours: continue # skip this origin
# ... construct samples for this originThis filtering happens before any feature computation, ensuring no processing time is wasted on irrelevant origins.
Impact on Training Set Size
Origin filtering reduces the number of training samples compared to using all 24 hourly origins:
| Product | Origins per Day | Fraction of All Origins |
|---|---|---|
| Day-ahead | 5 (08–12) | 21% |
| Strategic | 6 (13–18) | 25% |
| Legacy (unfiltered) | 24 (00–23) | 100% |
With 1 year of data, the day-ahead model has approximately 1,825 origin-days (365 × 5) while the unfiltered legacy model would have 8,760. The reduction is justified because:
- Each filtered sample is more representative of production conditions
- Including irrelevant origins adds noise (e.g., 03:00 UTC origin is never used in production)
- The remaining sample count is still sufficient for gradient boosting training
Legacy Mode
The legacy mode (23:00 UTC origin) uses no origin filtering — it trains on all available hourly origins:
if run_mode in ("dayahead", "strategic"): allowed_hours = set(DAYAHEAD_ORIGIN_HOURS if run_mode == "dayahead" else STRATEGIC_ORIGIN_HOURS)else: allowed_hours = None # no filter (legacy)This backward-compatible behavior is retained for comparison and transition purposes.
Interaction with D+1 Price Features
Origin filtering and D+1 price features are tightly coupled:
| Product | Origin Window | D+1 Prices | Reason |
|---|---|---|---|
| Day-ahead | 08–12 UTC | Not available | OMIE hasn’t published yet |
| Strategic | 13–18 UTC | Available | OMIE published ~13:00 UTC |
The strategic model’s feature engineering includes D+1 price extraction precisely because all its training origins are after the OMIE publication time. The day-ahead model’s feature engineering excludes D+1 prices because none of its training origins have access to them.
This consistency between training and production information sets is the core purpose of origin filtering.