Calibration Testing
What Is Calibration?
A confidence interval is calibrated when its stated coverage matches its empirical coverage. A 90% confidence interval should contain the actual price approximately 90% of the time. If it contains the actual price 95% of the time, the intervals are too wide (conservative). If only 80% of the time, the intervals are too narrow (overconfident).
Coverage Rate Calculation
Coverage = (count of actuals within interval) / (total predictions with actuals)
For 90% CI: within = count where lower_90 ≤ actual_price ≤ upper_90 coverage_90 = within / totalTarget Coverage Levels
| Interval | Target | Acceptable Range |
|---|---|---|
| 50% CI | 50% | 45–55% |
| 90% CI | 90% | 85–95% |
Deviations beyond the acceptable range (±5 percentage points) indicate miscalibration.
Calibration by Dimension
Coverage rates are checked across multiple dimensions to identify systematic miscalibration:
By Horizon Bucket
Longer horizons typically have wider intervals and may drift from calibration more:
| Bucket | Expected | Check |
|---|---|---|
| DA1 (D+1 morning) | 90% | Separate residual distribution |
| DA2 (D+1 afternoon) | 90% | Separate residual distribution |
| S1 (D+2) | 90% | Separate residual distribution |
| S5 (D+6–D+7) | 90% | Widest intervals, check for overcoverage |
By Hour of Day
Peak hours (10:00–14:00) have higher price volatility. Intervals should be wider at these hours. If coverage drops below 85% only at peak hours, the calibration residuals may not have enough peak-hour samples.
By Price Level
High-price periods are harder to predict and may show undercoverage (intervals too narrow). Low-price stable periods may show overcoverage (intervals unnecessarily wide).
Recalibration
When coverage drifts outside the acceptable range, the conformal calibrator can be rebuilt from recent prediction data:
1. Collect recent predictions with backfilled actuals (last 168+ hours)2. Compute residuals: actual - predicted3. Group by horizon bucket4. Recompute quantiles for each bucket5. Replace old calibrator with updated oneThe minimum sample size of 168 residuals (one week of hourly predictions) ensures enough data points for reliable quantile estimation.
Common Miscalibration Patterns
Overcoverage (Intervals Too Wide)
90% CI actual coverage: 97%Causes:
- Calibration residuals from a volatile period; current market is calmer
- Recent model improvement reduced errors, but calibrator uses old (larger) residuals
Fix: Recalibrate with recent residuals that reflect current accuracy.
Undercoverage (Intervals Too Narrow)
90% CI actual coverage: 82%Causes:
- Market regime change (e.g., gas crisis, renewable surge) creating larger errors
- Model drift — accuracy has degraded since calibration
- Insufficient calibration samples for the specific horizon bucket
Fix: Recalibrate, investigate model drift, or widen intervals by using a more conservative quantile (e.g., 3rd and 97th percentile instead of 5th and 95th).
Asymmetric Miscalibration
Actual prices above upper_90: 8% (too many)Actual prices below lower_90: 2% (fine)The upper bound is too tight (model underestimates upside risk). This commonly occurs when the calibration period lacked price spikes.
Relationship to Conformal Prediction
The conformal prediction framework provides asymptotic coverage guarantees: as the calibration set grows, empirical coverage converges to the target level. In practice, with finite samples, some deviation is expected. The calibration testing framework monitors this deviation and triggers recalibration when it exceeds acceptable bounds.