Data Construction
Overview
The direct forecasting approach requires training data structured as origin-horizon pairs: each sample represents a specific prediction origin (when the forecast is made) paired with a specific target horizon (what’s being predicted). This page explains how these samples are constructed.
The Origin-Horizon Pair
Each training sample captures:
- Origin: A point in time when a forecast could be made, with all features computed from data available at that moment
- Target: A future hour (or quarter-hour) that the forecast aims to predict
- Features: Everything known at origin time — lags, rolling statistics, weather, commodities, temporal encoding
- Label: The actual price at the target time (known in retrospect)
Origin: 2025-09-15 10:00 UTCTarget: 2025-09-16 14:00 UTC (28 hours ahead, DA2 group)
Features at origin: price_lag_24h = price at 2025-09-14 10:00 demand_rolling_24h = mean demand over last 24h hour_sin = sin(2π × 10/24) target_hour_sin = sin(2π × 14/24) hours_ahead = 28 ...
Label: actual_price at 2025-09-16 14:00 = 52.30 EUR/MWhConstruction Algorithm
For each row i in the historical dataset (after warmup period): origin_dt = index[i] (the origin timestamp)
# Skip if origin hour not in allowed range if origin_dt.hour not in allowed_hours: continue
# Extract features known at origin time origin_features = { price_lag_1h: price[i-1], price_lag_24h: price[i-24], price_lag_168h: price[i-168], rolling_24h: mean(price[i-24:i]), ... origin_hour_sin: sin(2π × origin_dt.hour / 24), origin_dow_sin: sin(2π × origin_dt.dayofweek / 7), }
# For each hour in the horizon group for h in horizon_hours: (e.g., 14-25 for DA1) target_idx = i + h target_dt = index[target_idx]
# Add target time features sample = origin_features.copy() sample["target_hour_sin"] = sin(2π × target_dt.hour / 24) sample["target_dow_sin"] = sin(2π × target_dt.dayofweek / 7) sample["target_is_weekend"] = (target_dt.dayofweek >= 5) sample["hours_ahead"] = h
# Add target weather (if available) sample["target_temp"] = weather[target_idx]
# Label sample["target_price"] = price[target_idx]
training_samples.append(sample)Warmup Period
The first 168 rows (7 days) of data are excluded from training because the longest lag feature (price_lag_168h) requires 7 days of history. For 15-minute models, the warmup is 672 steps (7 × 96).
Feature Availability at Origin
A critical constraint: features must only use information that would actually be available at the origin time in production.
Available at origin:
- All historical prices up to origin time
- Weather observations up to origin time
- Commodity prices (daily, available by morning)
- REE generation and demand data (near real-time)
- D+1 published prices (strategic mode only, after ~13:00 UTC)
NOT available at origin:
- Future prices (the target we’re predicting)
- Future demand (except the REE forecast)
- Realized weather beyond origin time (forecasts may be used)
Samples Per Horizon Group
Each horizon group produces multiple samples per origin:
| Group | Hours in Group | Samples per Origin |
|---|---|---|
| DA1 | 12 (hours 14–25) | 12 |
| DA2 | 12 (hours 26–37) | 12 |
| S1 | 24 (hours 33–56) | 24 |
| S2 | 24 (hours 57–80) | 24 |
| S3 | 24 (hours 81–104) | 24 |
| S4 | 24 (hours 105–128) | 24 |
| S5 | 48 (hours 129–176) | 48 |
With 1 year of hourly data and hourly origins within the allowed window (5 hours for day-ahead):
DA1: 365 days × 5 origins/day × 12 hours = 21,900 samplesS5: 365 days × 6 origins/day × 48 hours = 105,120 samplesData Validation During Construction
During construction, the pipeline logs:
- Target price range: Min, max, mean of target prices (catches data issues)
- Feature completeness: Percentage of non-null features (drops rows with too many missing values)
- Sample count: Total samples per horizon group (ensures sufficient training data)