Data Construction

Overview

The direct forecasting approach requires training data structured as origin-horizon pairs: each sample represents a specific prediction origin (when the forecast is made) paired with a specific target horizon (what’s being predicted). This page explains how these samples are constructed.

The Origin-Horizon Pair

Each training sample captures:

Origin: A point in time when a forecast could be made, with all features computed from data available at that moment
Target: A future hour (or quarter-hour) that the forecast aims to predict
Features: Everything known at origin time — lags, rolling statistics, weather, commodities, temporal encoding
Label: The actual price at the target time (known in retrospect)

Origin: 2025-09-15 10:00 UTC
Target: 2025-09-16 14:00 UTC (28 hours ahead, DA2 group)

Features at origin:
  price_lag_24h = price at 2025-09-14 10:00
  demand_rolling_24h = mean demand over last 24h
  hour_sin = sin(2π × 10/24)
  target_hour_sin = sin(2π × 14/24)
  hours_ahead = 28
  ...

Label: actual_price at 2025-09-16 14:00 = 52.30 EUR/MWh

Construction Algorithm

For each row i in the historical dataset (after warmup period):
    origin_dt = index[i]  (the origin timestamp)

    # Skip if origin hour not in allowed range
    if origin_dt.hour not in allowed_hours:
        continue

    # Extract features known at origin time
    origin_features = {
        price_lag_1h:  price[i-1],
        price_lag_24h: price[i-24],
        price_lag_168h: price[i-168],
        rolling_24h:   mean(price[i-24:i]),
        ...
        origin_hour_sin: sin(2π × origin_dt.hour / 24),
        origin_dow_sin:  sin(2π × origin_dt.dayofweek / 7),
    }

    # For each hour in the horizon group
    for h in horizon_hours:  (e.g., 14-25 for DA1)
        target_idx = i + h
        target_dt = index[target_idx]

        # Add target time features
        sample = origin_features.copy()
        sample["target_hour_sin"] = sin(2π × target_dt.hour / 24)
        sample["target_dow_sin"] = sin(2π × target_dt.dayofweek / 7)
        sample["target_is_weekend"] = (target_dt.dayofweek >= 5)
        sample["hours_ahead"] = h

        # Add target weather (if available)
        sample["target_temp"] = weather[target_idx]

        # Label
        sample["target_price"] = price[target_idx]

        training_samples.append(sample)

Warmup Period

The first 168 rows (7 days) of data are excluded from training because the longest lag feature (price_lag_168h) requires 7 days of history. For 15-minute models, the warmup is 672 steps (7 × 96).

Feature Availability at Origin

A critical constraint: features must only use information that would actually be available at the origin time in production.

Available at origin:

All historical prices up to origin time
Weather observations up to origin time
Commodity prices (daily, available by morning)
REE generation and demand data (near real-time)
D+1 published prices (strategic mode only, after ~13:00 UTC)

NOT available at origin:

Future prices (the target we’re predicting)
Future demand (except the REE forecast)
Realized weather beyond origin time (forecasts may be used)

Samples Per Horizon Group

Each horizon group produces multiple samples per origin:

Group	Hours in Group	Samples per Origin
DA1	12 (hours 14–25)	12
DA2	12 (hours 26–37)	12
S1	24 (hours 33–56)	24
S2	24 (hours 57–80)	24
S3	24 (hours 81–104)	24
S4	24 (hours 105–128)	24
S5	48 (hours 129–176)	48

With 1 year of hourly data and hourly origins within the allowed window (5 hours for day-ahead):

DA1: 365 days × 5 origins/day × 12 hours = 21,900 samples
S5:  365 days × 6 origins/day × 48 hours = 105,120 samples

Data Validation During Construction

During construction, the pipeline logs:

Target price range: Min, max, mean of target prices (catches data issues)
Feature completeness: Percentage of non-null features (drops rows with too many missing values)
Sample count: Total samples per horizon group (ensures sufficient training data)