Data Quality & Validation
Overview
Raw energy market data is messy. APIs return partial results, sub-hourly summation inflates values, and timestamps can be duplicated or missing. The EPF pipeline implements four validation layers that check incoming data before it enters the forecasting models.
Layer 1: Completeness Checks
Every data fetch is checked against a minimum completeness threshold before storage:
Completeness = (non-null rows) / (expected rows for time range)Threshold: 90% minimumIf a fetch returns less than 90% complete data (e.g., 18 of 24 hours), the pipeline logs an error and raises a ValueError rather than storing partial data that could corrupt downstream features.
This prevents scenarios where an API outage returns only a few hours of data, which would create misleading lag features and rolling statistics.
Layer 2: Range Validation
Price Sanity Check
The most critical validation: if the maximum day-ahead price exceeds 500 EUR/MWh, the pipeline aborts with a clear error message:
Price data appears inflated (max=210.40 EUR/MWh).Aborting storage. Check time_agg parameter in config.Why 500 EUR/MWh? Spanish day-ahead prices historically range from -20 to ~200 EUR/MWh, with rare spikes to ~400 EUR/MWh during extreme events. A price above 500 EUR/MWh almost certainly indicates a data processing error — specifically, missing the time_agg=average parameter, which causes the REE API to sum four quarter-hourly values instead of averaging them, inflating prices approximately 4×.
Other Range Checks
- Demand: Must be positive (negative demand is physically impossible)
- Generation: Must be non-negative (individual sources cannot generate negative power)
- Temperature: Must fall within -30°C to +55°C (physically plausible for Spain)
- Wind speed: Must be non-negative
Layer 3: Temporal Consistency
Duplicate Handling
Duplicate timestamps are removed (keeping the first occurrence). This handles cases where overlapping API queries return the same data points twice.
Chronological Ordering
The index is sorted chronologically after every merge operation. Out-of-order timestamps would corrupt lag calculations (e.g., lag_1h would point to the wrong row).
Gap Detection
Missing hours are detected and logged. The feature engineering pipeline handles gaps through forward-filling for slowly-changing variables (temperature, commodity prices) and NaN propagation for rapidly-changing variables (prices, demand). Rows with insufficient lag warmup (first 168 hours = 7 days) are dropped from training sets.
Layer 4: Cross-Variable Plausibility
time_agg Configuration
The root cause of the price inflation bug is documented in the pipeline’s configuration. Every REE indicator is explicitly configured with time_agg=average:
REE_INDICATORS = { 600: {"name": "day_ahead_price", "time_agg": "average", ...}, 1293: {"name": "real_demand", "time_agg": "average", ...}, # ... all 16 indicators}This is enforced at the configuration level, not at the query level, so individual API calls cannot accidentally omit the parameter.
API Retry Strategy
Transient API failures are handled with exponential backoff:
| Attempt | Delay |
|---|---|
| 1st retry | 1 second |
| 2nd retry | 4 seconds |
| 3rd retry | 16 seconds |
After 3 failed attempts, the pipeline logs the failure and continues with available data rather than blocking the entire forecast run.
Combine-First Merge Strategy
When merging new data into existing tables, the pipeline uses a combine_first strategy: new values only fill gaps in existing data. This prevents partial API responses from overwriting complete historical records.
Validation Flow
API Response │ ├─ Layer 1: Completeness ≥ 90%? │ └─ No → Abort + log error │ ├─ Layer 2: Price ≤ 500 EUR/MWh? │ └─ No → Abort + log error │ ├─ Layer 3: Deduplicate, sort, detect gaps │ └─ Log warnings for missing hours │ └─ Layer 4: Merge with combine_first └─ Store to SQLiteWhy Abort Rather Than Impute?
The pipeline deliberately aborts on critical failures (inflated prices, insufficient completeness) rather than attempting to impute or fix the data. This is a design choice:
- Imputation masks errors — silently filling in bad data can propagate through lag features, rolling statistics, and model predictions without any visible warning
- Forecasting with bad data is worse than no forecast — an inflated price forecast could lead to costly trading decisions
- Manual review is fast — the error message indicates the exact problem (e.g.,
time_aggmisconfiguration), enabling quick resolution