Skip to content

Data Construction

Overview

The direct forecasting approach requires training data structured as origin-horizon pairs: each sample represents a specific prediction origin (when the forecast is made) paired with a specific target horizon (what’s being predicted). This page explains how these samples are constructed.

The Origin-Horizon Pair

Each training sample captures:

  • Origin: A point in time when a forecast could be made, with all features computed from data available at that moment
  • Target: A future hour (or quarter-hour) that the forecast aims to predict
  • Features: Everything known at origin time — lags, rolling statistics, weather, commodities, temporal encoding
  • Label: The actual price at the target time (known in retrospect)
Origin: 2025-09-15 10:00 UTC
Target: 2025-09-16 14:00 UTC (28 hours ahead, DA2 group)
Features at origin:
price_lag_24h = price at 2025-09-14 10:00
demand_rolling_24h = mean demand over last 24h
hour_sin = sin(2π × 10/24)
target_hour_sin = sin(2π × 14/24)
hours_ahead = 28
...
Label: actual_price at 2025-09-16 14:00 = 52.30 EUR/MWh

Construction Algorithm

For each row i in the historical dataset (after warmup period):
origin_dt = index[i] (the origin timestamp)
# Skip if origin hour not in allowed range
if origin_dt.hour not in allowed_hours:
continue
# Extract features known at origin time
origin_features = {
price_lag_1h: price[i-1],
price_lag_24h: price[i-24],
price_lag_168h: price[i-168],
rolling_24h: mean(price[i-24:i]),
...
origin_hour_sin: sin(2π × origin_dt.hour / 24),
origin_dow_sin: sin(2π × origin_dt.dayofweek / 7),
}
# For each hour in the horizon group
for h in horizon_hours: (e.g., 14-25 for DA1)
target_idx = i + h
target_dt = index[target_idx]
# Add target time features
sample = origin_features.copy()
sample["target_hour_sin"] = sin(2π × target_dt.hour / 24)
sample["target_dow_sin"] = sin(2π × target_dt.dayofweek / 7)
sample["target_is_weekend"] = (target_dt.dayofweek >= 5)
sample["hours_ahead"] = h
# Add target weather (if available)
sample["target_temp"] = weather[target_idx]
# Label
sample["target_price"] = price[target_idx]
training_samples.append(sample)

Warmup Period

The first 168 rows (7 days) of data are excluded from training because the longest lag feature (price_lag_168h) requires 7 days of history. For 15-minute models, the warmup is 672 steps (7 × 96).

Feature Availability at Origin

A critical constraint: features must only use information that would actually be available at the origin time in production.

Available at origin:

  • All historical prices up to origin time
  • Weather observations up to origin time
  • Commodity prices (daily, available by morning)
  • REE generation and demand data (near real-time)
  • D+1 published prices (strategic mode only, after ~13:00 UTC)

NOT available at origin:

  • Future prices (the target we’re predicting)
  • Future demand (except the REE forecast)
  • Realized weather beyond origin time (forecasts may be used)

Samples Per Horizon Group

Each horizon group produces multiple samples per origin:

GroupHours in GroupSamples per Origin
DA112 (hours 14–25)12
DA212 (hours 26–37)12
S124 (hours 33–56)24
S224 (hours 57–80)24
S324 (hours 81–104)24
S424 (hours 105–128)24
S548 (hours 129–176)48

With 1 year of hourly data and hourly origins within the allowed window (5 hours for day-ahead):

DA1: 365 days × 5 origins/day × 12 hours = 21,900 samples
S5: 365 days × 6 origins/day × 48 hours = 105,120 samples

Data Validation During Construction

During construction, the pipeline logs:

  • Target price range: Min, max, mean of target prices (catches data issues)
  • Feature completeness: Percentage of non-null features (drops rows with too many missing values)
  • Sample count: Total samples per horizon group (ensures sufficient training data)