Skip to content

v7.1 — MLP Neural Network (Rejected)

Date: March 20, 2026 | Status: Rejected

Why This Experiment Exists

The v6.3 validation and v7.0 compression-breaking campaigns established that tree-based models hit a regression slope ceiling at 0.70-0.72 — predictions capture only 70% of actual price variation due to leaf-node averaging. The hypothesis was that neural networks, which don’t have this structural constraint, could break through the ceiling.

Critical question: Is the compression ceiling model-specific (tree architecture) or data-specific (insufficient features)?

What We Tested

sklearn’s MLPRegressor wrapped in a Pipeline with:

  • SimpleImputer(strategy="median") — handles NaN in commodity features (30% missing)
  • StandardScaler — normalizes all features to mean=0, std=1 (critical for gradient-based optimization)
  • MLPRegressor with early stopping, various architectures and learning rates

Experiments

#NameArchitectureLRMAESlopeRangeMaxPredSpike Rec
1mlp-base(256,128,64)0.00137.240.5241.07920733.0%
2mlp-lr01(256,128,64)0.0138.750.3370.65712910.8%
3mlp-deep(256,128,64,32)0.00138.960.2630.51714712.9%

For reference — best XGBoost (pw-3x-d365-t60): MAE 12.69, Slope 0.707, MaxPred ~135

What Failed and Why

MLP is 3x worse than XGBoost on every metric except MaxPred. The model either:

  1. Overfits wildly (mlp-base): Range ratio 1.079 means predictions oscillate MORE than actual prices. MaxPred 207 confirms NNs CAN output extreme values, but the predictions are poorly calibrated — high variance, low accuracy.

  2. Collapses to near-mean (mlp-deep, mlp-lr01): Slope 0.26-0.34 means the model learned to predict ~50-60 EUR for everything. Bias of -34 to -37 confirms near-constant output.

Root causes:

  • sklearn MLPRegressor uses L-BFGS or Adam without mini-batching. On 360K samples × 100+ features, each iteration processes the entire dataset, making training extremely slow (2-4 hours per experiment) and prone to poor convergence.
  • No batch normalization or dropout. Without these regularization techniques, the network either memorizes noise (overfitting) or fails to learn meaningful patterns (underfitting).
  • 100+ features with different information densities. Price lags are highly informative while commodity ratios are noisy. The MLP treats all scaled features equally, diluting signal with noise.
  • Single-threaded CPU training. Each experiment took 2-4 hours vs 30 min for XGBoost, making iterative tuning impractical.

What We Learned

The compression ceiling IS model-specific, not data-specific

The mlp-base experiment proved this: MaxPred 207 (vs XGBoost’s 135) confirms that the neural network CAN output extreme values. The features contain enough signal for 200+ EUR predictions. Trees can’t make these predictions because of leaf averaging; NNs can but sklearn’s optimizer can’t find the right weights.

sklearn MLPRegressor is the wrong tool

The failure is implementation-specific, not architecture-specific:

Capabilitysklearn MLPPyTorch MLP
Mini-batch trainingNo (full dataset per iter)Yes
Batch normalizationNoYes
Dropout regularizationNoYes
Learning rate schedulingNoYes (cosine, warmup, etc.)
GPU accelerationNoYes
Custom loss functionsNoYes
Training speed (360K rows)2-4 hours~15-30 min (estimated)

A PyTorch implementation could address every failure mode we observed.

Decision

Rejected — sklearn MLPRegressor is not viable for electricity price forecasting at this dataset scale. The architecture concept (neural networks for decompression) remains valid but requires PyTorch infrastructure.

Next steps:

  • Option A: Deploy best XGBoost config (pw-3x-d365-t60, MAE 12.69) as production v7.1
  • Option B: Build PyTorch MLP infrastructure (~200 lines) for proper NN experiments
  • Option C: Pursue hybrid XGB+NN approach once PyTorch is available