Overview
This demonstration compares how different regularization schemes affect linear regression fits when the model can overfit (e.g., high-degree polynomial features). You can regenerate synthetic data, change feature degree, and sweep $\lambda$ to see how the fitted function, coefficient magnitudes, and train/test error change.
| Control (OLS) | $\min_w\ \tfrac{1}{N}\lVert Xw - y\rVert^2$ |
| Ridge | $\min_w\ \tfrac{1}{N}\lVert Xw - y\rVert^2 + \lambda\lVert w\rVert^2$ |
| Lasso | $\min_w\ \tfrac{1}{N}\lVert Xw - y\rVert^2 + \lambda\lVert w\rVert_1$ |
| L2 penalty (GD) | $w\leftarrow w - \eta\left(\nabla_w\,\tfrac{1}{N}\lVert Xw-y\rVert^2 + 2\lambda w\right)$ |
| Weight decay (GD) | $w\leftarrow (1-2\eta\lambda)w - \eta\nabla_w\,\tfrac{1}{N}\lVert Xw-y\rVert^2$ |
Visualization
Bias-variance tradeoff
Looking at the mean squared error (MSE) swept across a range of $\lambda$ values for each method, we see how regularization impacts the ability for models to generalize to unseen data. For high-capacity models, minimizing regularization yields very low training MSEs, but fails to produce that same accuracy on unseen test data, overfitting to observed samples and noise.
As regularization increases, training MSE typically rises, but test MSE can improve as the model learns to capture more general patterns rather than noise. When regularization becomes too strong, the model is too inhibited to fit the training data well, and both train and test MSE rise due to underfitting.
Observations
- As degree increases, OLS tends to overfit (wiggly curve, large coefficients, worse test error).
- Ridge typically shrinks coefficients smoothly, reducing variance and improving test error for a wide range of $\lambda$.
- Lasso can drive some coefficients to exactly 0, acting like a form of feature selection.
- The GD methods blow up if the regularization ($\lambda$) or the learning rate ($\eta$) is set too high.
- For the sine function on high-degree polynomials, the models generally prefer odd-degree terms, mimicking its infinite series expansion.
Learning rate and regularization boundaries
Both L2 penalty and weight decay depend on $\lambda$ or $\eta$ being within certain bounds for the gradient descent to converge properly.
- L2 minimizes $J(w) = \mathrm{MSE}(w) + \lambda \|w\|_2^2$. Its gradient is $\nabla J(w) = \nabla \mathrm{MSE}(w) + 2\lambda w$. When updating $$w_{t+1} = w_t - \eta \nabla J(w_t) = w_t - \eta (\nabla \mathrm{MSE}(w_t) + 2\lambda w_t),$$ if $\lambda$ or $\eta$ is too large, the term dominates the update and the training behaves more like $w_{t+1} \approx w_t - 2\eta \lambda w_t$ or $w_{t+1} \approx (1 - 2\eta \lambda) w_t$.
- If $2\eta \lambda > 1$, the sign flips on each step.
- If $1 < 2\eta \lambda < 2$, it will still converge, but will oscillate back and forth while doing so, causing slower convergence and potential instability in loss.
- If $2\eta \lambda > 2$, this continuous overshooting will diverge rapidly.
- Weight decay behaves similarly. It uses the decoupled update rule $w_{t+1} = (1 - 2\eta \lambda) w_t - \eta \nabla \mathrm{MSE}(w_t)$ (intercept unpenalized). When $\lambda$ or $\eta$ is too large, we fall into the same $w_{t+1} \approx (1 - 2\eta \lambda) w_t$ behavior.
- Note: The 2 is not significant and can be removed from L2 as well by redefining the regularization term as $\frac{\lambda}{2} \|w\|_2^2$.