Overview

This demonstration compares how different regularization schemes affect linear regression fits when the model can overfit (e.g., high-degree polynomial features). You can regenerate synthetic data, change feature degree, and sweep $\lambda$ to see how the fitted function, coefficient magnitudes, and train/test error change.

Control (OLS)$\min_w\ \tfrac{1}{N}\lVert Xw - y\rVert^2$
Ridge$\min_w\ \tfrac{1}{N}\lVert Xw - y\rVert^2 + \lambda\lVert w\rVert^2$
Lasso$\min_w\ \tfrac{1}{N}\lVert Xw - y\rVert^2 + \lambda\lVert w\rVert_1$
L2 penalty (GD)$w\leftarrow w - \eta\left(\nabla_w\,\tfrac{1}{N}\lVert Xw-y\rVert^2 + 2\lambda w\right)$
Weight decay (GD)$w\leftarrow (1-2\eta\lambda)w - \eta\nabla_w\,\tfrac{1}{N}\lVert Xw-y\rVert^2$
Note
We do not penalize the intercept term. Weight decay and L2 penalty are shown as training update rules; they are closely related but not identical for all optimizers. L2 penalty is Ridge implemented via gradient descent (GD) rather than as a closed-form solution.

Visualization

Fitted curves
The curves show model predictions on a dense grid; points are the sampled training set. Click a model to inspect its coefficients.
Coefficients of -
-
Models
-

Bias-variance tradeoff

Looking at the mean squared error (MSE) swept across a range of $\lambda$ values for each method, we see how regularization impacts the ability for models to generalize to unseen data. For high-capacity models, minimizing regularization yields very low training MSEs, but fails to produce that same accuracy on unseen test data, overfitting to observed samples and noise.

As regularization increases, training MSE typically rises, but test MSE can improve as the model learns to capture more general patterns rather than noise. When regularization becomes too strong, the model is too inhibited to fit the training data well, and both train and test MSE rise due to underfitting.

Mean squared error
Bias-variance decomposition
Lowest testing MSE stats
-
Note
Bias-variance decomposition shows the test MSE decomposed as the sum of bias$^2$, variance, and noise. Refer to Estimation for more on MSE.

Observations

  • As degree increases, OLS tends to overfit (wiggly curve, large coefficients, worse test error).
  • Ridge typically shrinks coefficients smoothly, reducing variance and improving test error for a wide range of $\lambda$.
  • Lasso can drive some coefficients to exactly 0, acting like a form of feature selection.
  • The GD methods blow up if the regularization ($\lambda$) or the learning rate ($\eta$) is set too high.
  • For the sine function on high-degree polynomials, the models generally prefer odd-degree terms, mimicking its infinite series expansion.

Learning rate and regularization boundaries

Both L2 penalty and weight decay depend on $\lambda$ or $\eta$ being within certain bounds for the gradient descent to converge properly.

  • L2 minimizes $J(w) = \mathrm{MSE}(w) + \lambda \|w\|_2^2$. Its gradient is $\nabla J(w) = \nabla \mathrm{MSE}(w) + 2\lambda w$. When updating $$w_{t+1} = w_t - \eta \nabla J(w_t) = w_t - \eta (\nabla \mathrm{MSE}(w_t) + 2\lambda w_t),$$ if $\lambda$ or $\eta$ is too large, the term dominates the update and the training behaves more like $w_{t+1} \approx w_t - 2\eta \lambda w_t$ or $w_{t+1} \approx (1 - 2\eta \lambda) w_t$.
    • If $2\eta \lambda > 1$, the sign flips on each step.
    • If $1 < 2\eta \lambda < 2$, it will still converge, but will oscillate back and forth while doing so, causing slower convergence and potential instability in loss.
    • If $2\eta \lambda > 2$, this continuous overshooting will diverge rapidly.
  • Weight decay behaves similarly. It uses the decoupled update rule $w_{t+1} = (1 - 2\eta \lambda) w_t - \eta \nabla \mathrm{MSE}(w_t)$ (intercept unpenalized). When $\lambda$ or $\eta$ is too large, we fall into the same $w_{t+1} \approx (1 - 2\eta \lambda) w_t$ behavior.
    • Note: The 2 is not significant and can be removed from L2 as well by redefining the regularization term as $\frac{\lambda}{2} \|w\|_2^2$.