v8.0 — PyTorch Neural Network (Rejected)

Date: March 21, 2026 | Status: Rejected

Why This Experiment Exists

After 83 tree-based experiments, models hit a regression slope ceiling at 0.70-0.75. The v7.1 sklearn MLP proved NNs CAN output extreme values (MaxPred 207 vs XGB 135) but failed because sklearn lacks mini-batching, BatchNorm, and dropout. This experiment tested whether a proper PyTorch implementation — addressing every sklearn limitation — could break the slope ceiling.

Hypothesis: The slope ceiling is caused by tree leaf averaging. Neural networks don’t have this constraint, so a PyTorch MLP with modern training techniques should achieve slope > 0.75.

What We Built

TFTRegressor — a PyTorch deep residual MLP with:

ResidualBlock: Linear → BatchNorm → ReLU → Dropout with skip connections
Adam + cosine LR schedule with 5-epoch warmup
Gradient clipping (max_norm=1.0) for training stability
Early stopping on chronological validation split (last 10%)
Internal NaN imputation + StandardScaler (no Pipeline wrapper needed)
GPU acceleration (RTX 3060 CUDA)
joblib serialization via __getstate__/__setstate__ with state_dict

sklearn-compatible interface: fit(X, y, sample_weight) / predict(X) — plugged into existing training pipeline with zero changes to trainer or predictor code.

What We Tested

Wave 1: Find working architecture (3 experiments)

#	Name	Hidden	Layers	Dropout	LR	Loss	MAE	Slope	MaxPred
1	nn-base	128	3	0.2	1e-3	huber	64.51	0.550	319
2	nn-wide	256	3	0.2	1e-3	huber	36.84	0.494	190
3	nn-deep	128	5	0.3	5e-4	huber	31.59	0.368	110

Diagnosis: Early stopping too aggressive (epoch 2-5). Model barely trains before validation loss plateaus.

Wave 2: Fix convergence (2 experiments)

#	Name	Hidden	Layers	Dropout	LR	Loss	MAE	Slope	MaxPred
4	nn-slow-lr	256	3	0.1	1e-4	mae	28.13	0.393	123
5	nn-wide-mae	256	3	0.15	5e-4	mae	29.47	0.423	136

For reference — XGBoost v7.0 best: MAE 12.69, Slope 0.710, MaxPred 135

What Failed and Why

The optimizer works — the architecture doesn’t

Wave 2 trained to epoch 80-120 with steadily decreasing val MAE (17.5 at convergence), confirming that mini-batching, BatchNorm, cosine LR, and gradient clipping all work correctly. The problem isn’t training mechanics — it’s what the model learns.

Tabular NNs are fundamentally disadvantaged on flattened lags

The current feature matrix has ~106 columns including individual lag columns (price_lag_1h, price_lag_24h, price_lag_168h). Trees learn conditional rules naturally:

if price_lag_24h > 80 AND wind_share < 0.3 AND gas_price > 30:
    → predict 95 EUR

This conditional logic is trivial for tree splits but requires enormous NN capacity to approximate through weight matrices. The NN ends up learning a smoothed average rather than the sharp conditional structure trees discover.

The slope DECREASED, not increased

Model	Slope	Interpretation
XGBoost v7.0	0.710	Captures 71% of price variation
nn-wide (best slope)	0.550	Captures only 55% — WORSE
nn-slow-lr (best MAE)	0.393	Captures only 39% — much WORSE

Counter to the hypothesis, the NN compresses predictions MORE than trees. It learns to predict toward the mean (safe loss minimum) rather than capturing the conditional structure.

Wider networks trade slope for range

nn-wide achieved range ratio 1.027 and MaxPred 190 — proving the network CAN output extreme values. But at slope 0.494, these predictions are poorly correlated with actuals. More capacity produces more noise, not more signal.

What We Learned

The problem is representation, not optimization

Flattened lag columns lose the temporal structure that trees implicitly recover through split cascades. A tree can discover “price went up 3 days in a row” through recursive conditional splits on price_lag_24h, price_lag_48h, price_lag_72h. An MLP sees three independent numbers and must learn this temporal relationship from scratch.

The path to NN-based EPF requires sequence models

A proper Temporal Fusion Transformer or N-BEATS feeding raw price history as a time series (not flattened lags) would preserve temporal structure. This requires restructuring the data pipeline from tabular (one row per prediction) to sequential (encoder-decoder with historical context window).

Infrastructure is reusable

The TFTRegressor class, tft model type in the factory, and the scout pipeline can be reused for future sequence model experiments. The sklearn-compatible interface and joblib serialization are model-agnostic.

Decision

Rejected — tabular PyTorch MLP cannot compete with XGBoost on the current EPF feature set. The architecture remains available for future sequence model experiments.