Calibration Testing

What Is Calibration?

A confidence interval is calibrated when its stated coverage matches its empirical coverage. A 90% confidence interval should contain the actual price approximately 90% of the time. If it contains the actual price 95% of the time, the intervals are too wide (conservative). If only 80% of the time, the intervals are too narrow (overconfident).

Coverage Rate Calculation

Coverage = (count of actuals within interval) / (total predictions with actuals)

For 90% CI:
  within = count where lower_90 ≤ actual_price ≤ upper_90
  coverage_90 = within / total

Target Coverage Levels

Interval	Target	Acceptable Range
50% CI	50%	45–55%
90% CI	90%	85–95%

Deviations beyond the acceptable range (±5 percentage points) indicate miscalibration.

Calibration by Dimension

Coverage rates are checked across multiple dimensions to identify systematic miscalibration:

By Horizon Bucket

Longer horizons typically have wider intervals and may drift from calibration more:

Bucket	Expected	Check
DA1 (D+1 morning)	90%	Separate residual distribution
DA2 (D+1 afternoon)	90%	Separate residual distribution
S1 (D+2)	90%	Separate residual distribution
S5 (D+6–D+7)	90%	Widest intervals, check for overcoverage

By Hour of Day

Peak hours (10:00–14:00) have higher price volatility. Intervals should be wider at these hours. If coverage drops below 85% only at peak hours, the calibration residuals may not have enough peak-hour samples.

By Price Level

High-price periods are harder to predict and may show undercoverage (intervals too narrow). Low-price stable periods may show overcoverage (intervals unnecessarily wide).

Recalibration

When coverage drifts outside the acceptable range, the conformal calibrator can be rebuilt from recent prediction data:

1. Collect recent predictions with backfilled actuals (last 168+ hours)
2. Compute residuals: actual - predicted
3. Group by horizon bucket
4. Recompute quantiles for each bucket
5. Replace old calibrator with updated one

The minimum sample size of 168 residuals (one week of hourly predictions) ensures enough data points for reliable quantile estimation.

Common Miscalibration Patterns

Overcoverage (Intervals Too Wide)

90% CI actual coverage: 97%

Causes:

Calibration residuals from a volatile period; current market is calmer
Recent model improvement reduced errors, but calibrator uses old (larger) residuals

Fix: Recalibrate with recent residuals that reflect current accuracy.

Undercoverage (Intervals Too Narrow)

90% CI actual coverage: 82%

Causes:

Market regime change (e.g., gas crisis, renewable surge) creating larger errors
Model drift — accuracy has degraded since calibration
Insufficient calibration samples for the specific horizon bucket

Fix: Recalibrate, investigate model drift, or widen intervals by using a more conservative quantile (e.g., 3rd and 97th percentile instead of 5th and 95th).

Asymmetric Miscalibration

Actual prices above upper_90: 8% (too many)
Actual prices below lower_90: 2% (fine)

The upper bound is too tight (model underestimates upside risk). This commonly occurs when the calibration period lacked price spikes.

Relationship to Conformal Prediction

The conformal prediction framework provides asymptotic coverage guarantees: as the calibration set grows, empirical coverage converges to the target level. In practice, with finite samples, some deviation is expected. The calibration testing framework monitors this deviation and triggers recalibration when it exceeds acceptable bounds.