v8.0 — PyTorch Neural Network (Rejected)
Date: March 21, 2026 | Status: Rejected
Why This Experiment Exists
After 83 tree-based experiments, models hit a regression slope ceiling at 0.70-0.75. The v7.1 sklearn MLP proved NNs CAN output extreme values (MaxPred 207 vs XGB 135) but failed because sklearn lacks mini-batching, BatchNorm, and dropout. This experiment tested whether a proper PyTorch implementation — addressing every sklearn limitation — could break the slope ceiling.
Hypothesis: The slope ceiling is caused by tree leaf averaging. Neural networks don’t have this constraint, so a PyTorch MLP with modern training techniques should achieve slope > 0.75.
What We Built
TFTRegressor — a PyTorch deep residual MLP with:
- ResidualBlock: Linear → BatchNorm → ReLU → Dropout with skip connections
- Adam + cosine LR schedule with 5-epoch warmup
- Gradient clipping (max_norm=1.0) for training stability
- Early stopping on chronological validation split (last 10%)
- Internal NaN imputation + StandardScaler (no Pipeline wrapper needed)
- GPU acceleration (RTX 3060 CUDA)
- joblib serialization via
__getstate__/__setstate__with state_dict
sklearn-compatible interface: fit(X, y, sample_weight) / predict(X) — plugged into existing training pipeline with zero changes to trainer or predictor code.
What We Tested
Wave 1: Find working architecture (3 experiments)
| # | Name | Hidden | Layers | Dropout | LR | Loss | MAE | Slope | MaxPred |
|---|---|---|---|---|---|---|---|---|---|
| 1 | nn-base | 128 | 3 | 0.2 | 1e-3 | huber | 64.51 | 0.550 | 319 |
| 2 | nn-wide | 256 | 3 | 0.2 | 1e-3 | huber | 36.84 | 0.494 | 190 |
| 3 | nn-deep | 128 | 5 | 0.3 | 5e-4 | huber | 31.59 | 0.368 | 110 |
Diagnosis: Early stopping too aggressive (epoch 2-5). Model barely trains before validation loss plateaus.
Wave 2: Fix convergence (2 experiments)
| # | Name | Hidden | Layers | Dropout | LR | Loss | MAE | Slope | MaxPred |
|---|---|---|---|---|---|---|---|---|---|
| 4 | nn-slow-lr | 256 | 3 | 0.1 | 1e-4 | mae | 28.13 | 0.393 | 123 |
| 5 | nn-wide-mae | 256 | 3 | 0.15 | 5e-4 | mae | 29.47 | 0.423 | 136 |
For reference — XGBoost v7.0 best: MAE 12.69, Slope 0.710, MaxPred 135
What Failed and Why
The optimizer works — the architecture doesn’t
Wave 2 trained to epoch 80-120 with steadily decreasing val MAE (17.5 at convergence), confirming that mini-batching, BatchNorm, cosine LR, and gradient clipping all work correctly. The problem isn’t training mechanics — it’s what the model learns.
Tabular NNs are fundamentally disadvantaged on flattened lags
The current feature matrix has ~106 columns including individual lag columns (price_lag_1h, price_lag_24h, price_lag_168h). Trees learn conditional rules naturally:
if price_lag_24h > 80 AND wind_share < 0.3 AND gas_price > 30: → predict 95 EURThis conditional logic is trivial for tree splits but requires enormous NN capacity to approximate through weight matrices. The NN ends up learning a smoothed average rather than the sharp conditional structure trees discover.
The slope DECREASED, not increased
| Model | Slope | Interpretation |
|---|---|---|
| XGBoost v7.0 | 0.710 | Captures 71% of price variation |
| nn-wide (best slope) | 0.550 | Captures only 55% — WORSE |
| nn-slow-lr (best MAE) | 0.393 | Captures only 39% — much WORSE |
Counter to the hypothesis, the NN compresses predictions MORE than trees. It learns to predict toward the mean (safe loss minimum) rather than capturing the conditional structure.
Wider networks trade slope for range
nn-wide achieved range ratio 1.027 and MaxPred 190 — proving the network CAN output extreme values. But at slope 0.494, these predictions are poorly correlated with actuals. More capacity produces more noise, not more signal.
What We Learned
The problem is representation, not optimization
Flattened lag columns lose the temporal structure that trees implicitly recover through split cascades. A tree can discover “price went up 3 days in a row” through recursive conditional splits on price_lag_24h, price_lag_48h, price_lag_72h. An MLP sees three independent numbers and must learn this temporal relationship from scratch.
The path to NN-based EPF requires sequence models
A proper Temporal Fusion Transformer or N-BEATS feeding raw price history as a time series (not flattened lags) would preserve temporal structure. This requires restructuring the data pipeline from tabular (one row per prediction) to sequential (encoder-decoder with historical context window).
Infrastructure is reusable
The TFTRegressor class, tft model type in the factory, and the scout pipeline can be reused for future sequence model experiments. The sklearn-compatible interface and joblib serialization are model-agnostic.
Decision
Rejected — tabular PyTorch MLP cannot compete with XGBoost on the current EPF feature set. The architecture remains available for future sequence model experiments.