Data Quality & Validation

Overview

Raw energy market data is messy. APIs return partial results, sub-hourly summation inflates values, and timestamps can be duplicated or missing. The EPF pipeline implements four validation layers that check incoming data before it enters the forecasting models.

Layer 1: Completeness Checks

Every data fetch is checked against a minimum completeness threshold before storage:

Completeness = (non-null rows) / (expected rows for time range)
Threshold: 90% minimum

If a fetch returns less than 90% complete data (e.g., 18 of 24 hours), the pipeline logs an error and raises a ValueError rather than storing partial data that could corrupt downstream features.

This prevents scenarios where an API outage returns only a few hours of data, which would create misleading lag features and rolling statistics.

Layer 2: Range Validation

Price Sanity Check

The most critical validation: if the maximum day-ahead price exceeds 500 EUR/MWh, the pipeline aborts with a clear error message:

Price data appears inflated (max=210.40 EUR/MWh).
Aborting storage. Check time_agg parameter in config.

Why 500 EUR/MWh? Spanish day-ahead prices historically range from -20 to ~200 EUR/MWh, with rare spikes to ~400 EUR/MWh during extreme events. A price above 500 EUR/MWh almost certainly indicates a data processing error — specifically, missing the time_agg=average parameter, which causes the REE API to sum four quarter-hourly values instead of averaging them, inflating prices approximately 4×.

Other Range Checks

Demand: Must be positive (negative demand is physically impossible)
Generation: Must be non-negative (individual sources cannot generate negative power)
Temperature: Must fall within -30°C to +55°C (physically plausible for Spain)
Wind speed: Must be non-negative

Layer 3: Temporal Consistency

Duplicate Handling

Duplicate timestamps are removed (keeping the first occurrence). This handles cases where overlapping API queries return the same data points twice.

Chronological Ordering

The index is sorted chronologically after every merge operation. Out-of-order timestamps would corrupt lag calculations (e.g., lag_1h would point to the wrong row).

Gap Detection

Missing hours are detected and logged. The feature engineering pipeline handles gaps through forward-filling for slowly-changing variables (temperature, commodity prices) and NaN propagation for rapidly-changing variables (prices, demand). Rows with insufficient lag warmup (first 168 hours = 7 days) are dropped from training sets.

Layer 4: Cross-Variable Plausibility

time_agg Configuration

The root cause of the price inflation bug is documented in the pipeline’s configuration. Every REE indicator is explicitly configured with time_agg=average:

REE_INDICATORS = {
    600: {"name": "day_ahead_price", "time_agg": "average", ...},
    1293: {"name": "real_demand", "time_agg": "average", ...},
    # ... all 16 indicators
}

This is enforced at the configuration level, not at the query level, so individual API calls cannot accidentally omit the parameter.

API Retry Strategy

Transient API failures are handled with exponential backoff:

Attempt	Delay
1st retry	1 second
2nd retry	4 seconds
3rd retry	16 seconds

After 3 failed attempts, the pipeline logs the failure and continues with available data rather than blocking the entire forecast run.

Combine-First Merge Strategy

When merging new data into existing tables, the pipeline uses a combine_first strategy: new values only fill gaps in existing data. This prevents partial API responses from overwriting complete historical records.

Validation Flow

API Response
  │
  ├─ Layer 1: Completeness ≥ 90%?
  │    └─ No → Abort + log error
  │
  ├─ Layer 2: Price ≤ 500 EUR/MWh?
  │    └─ No → Abort + log error
  │
  ├─ Layer 3: Deduplicate, sort, detect gaps
  │    └─ Log warnings for missing hours
  │
  └─ Layer 4: Merge with combine_first
       └─ Store to SQLite

Why Abort Rather Than Impute?

The pipeline deliberately aborts on critical failures (inflated prices, insufficient completeness) rather than attempting to impute or fix the data. This is a design choice:

Imputation masks errors — silently filling in bad data can propagate through lag features, rolling statistics, and model predictions without any visible warning
Forecasting with bad data is worse than no forecast — an inflated price forecast could lead to costly trading decisions
Manual review is fast — the error message indicates the exact problem (e.g., time_agg misconfiguration), enabling quick resolution