Unsupervised Supervised Learning

Overview

This demo seeks to illustrate how a supervised learning problem such as learning $p(y|x)$ can be solved by learning the joint distribution $p(x, y)$ using unsupervised density estimation, and then applying Bayes' rule to infer the posterior. We will implement two methods for estimating class-conditional densities $p(x|y)$: Kernel Density Estimation (KDE) and Gaussian Discriminant Analysis (GDA). The demo aims to provide intuition for generative modeling approaches to classification, and how they differ from direct discriminative methods.

In a supervised classification, the goal is to learn the mapping from features $x$ to labels $y$, i.e. the posterior distribution $p(y|x)$ or an equivalent decision rule for classification. This is a conditional probability of how likely each class label $y$ is given the observed features $x$. By definition, $p(y|x) = \frac{p(x,y)}{p(x)}$, so if we can estimate the joint distribution $p(x,y)$ we can compute the posterior likelihood of each class $y$ given the observed $x$ using Bayes' rule.

The joint distribution $p(x,y)$ can be decomposed as $p(x,y) = p(y) p(x|y)$, where $p(y)$ is the class prior, formed by the relative frequencies of each class in the training data, and $p(x|y)$ is the class-conditional, learned using unsupervised density estimation within each class. Once the data is split by class label $y$, we can apply any unsupervised density estimation method to learn $p(x|y)$ for each class separately. In this demo we implement two aforementioned methods: KDE, a non-parametric density estimator that places kernels at each data point and averages them; and GDA, a parametric generative model that fits a Gaussian distribution to each class. The means and covariances estimated by GDA are estimated exactly as in a purely unsupervised Gaussian model, but applied separately to each class subset of the data.

Input dataset

Paste a CSV of points $x_1, x_2, y$ where $x_1, x_2 \in \mathbb{R}$ and the class $y$ is any integer. Edits below apply automatically.

CSV data Upload CSV file

Calculations: priors and approaches

To learn $p(y)$, we compute the priors. A prior is the probability of a class before observing any data. This is computed as the fraction of samples belonging to each class, or:

$$p(y=k) = \frac{\text{num samples with y=k}}{\text{total samples}}$$

To learn $p(x|y)$, we have two approaches.

1. KDE: For a class $k$, given samples {$x_i, ...$} where $y_i = k$, we estimate the class-conditional density $p(x|y=k)$ using Kernel Density Estimation (KDE). KDE places a Gaussian kernel at each sample point and averages them to form a smooth density estimate. The bandwidth parameter $h$ sets the standard deviation of each Gaussian, controlling how fast each point's influence decays with distance. In other words, the bandwidth sets the smoothness of the estimate. The constant $d$ is the dimensionality of the input space. This estimator looks like:

$$\hat{p}(x|y=k) = \frac{1}{n_kh^d}\sum_{i:\,y_i=k} K(x; x_i, h)$$

$$K(x; x_i, h) = \frac{1}{(2\pi)^{\frac{d}{2}}} \exp\left(-\frac{\|x-x_i\|^2}{2h^2}\right)$$

Combining these two, we get a uniform mixture of Gaussians where each mean is centered at each data point and each have covariance $h^2I$. Note: KDE is non-parametric as it grows with the number of samples. This approach makes no assumption that the class forms a single cluster.

$$\hat{p}(x|y=k) = \frac{1}{n_k}\sum_{i:\,y_i=k}^{n_k} \mathcal{N}(x; x_i, h^2I)$$

2. GDA: For a class $k$, we fit a single multivariate Gaussian distribution to the samples {$x_i, ...$} where $y_i = k$. This involves estimating the mean vector $μ_k$ and covariance matrix $Σ_k$ for each class $k$ using maximum likelihood estimation:

$$\mu_k = \frac{1}{N_k}\sum_{i:\,y_i=k} x_i$$

$$\Sigma_k = \frac{1}{N_k}\sum_{i:\,y_i=k} (x_i - \mu_k)(x_i - \mu_k)^T$$

For each class $y=k$, GDA assumes that all points of class $k$ are generated from one Gaussian:

$$p(x|y=k) = \mathcal{N}(x; \mu_k, \Sigma_k)$$

Multivariate normal density used by GDA:

$$\mathcal{N}(x;\mu,\Sigma) = \frac{1}{(2\pi)^{\frac{d}{2}}\sqrt{|\Sigma|}}\,\exp\left(-\tfrac12 (x-\mu)^T\Sigma^{-1}(x-\mu)\right)$$

After loading data, click "Fit GDA" to compute class priors and Gaussian parameters (mean $\mu$ and covariance $\Sigma$) for each class.

Fitted parameters

Visuals: KDE and GDA

Class-conditional densities and decision regions

Click on the plot to query $p(y=k \mid x)$ and $p(x \mid y=k)$ at that point.

Query result

No query yet

Interpreting the results

After loading data and fitting GDA, the visualization panel shows the data points colored by class, along with optional overlays for the KDE density estimates and GDA Gaussian ellipses. The KDE heatmap visualizes the estimated class-conditional densities $p(x|y=k)$ for each class using Kernel Density Estimation. The GDA ellipses represent one standard deviation contours of the fitted Gaussian distributions for each class, illustrating how GDA models the data.

Clicking on the plot queries the posterior probabilities $p(y=k|x^*)$ and class-conditional densities $p(x^*|y=k)$ at the selected point $x^*$. The posterior probabilities indicate the likelihood of each class given the observed features, while the class-conditional densities show how likely the features are to appear under each class model.

Note: To see the log-space stabilization effect, reduce the bandwidth and click on a point far from the data. With a tiny bandwidth, KDE can produce extremely small likelihoods that underflow to zero in normal space, but remain finite in log-space. GDA can too, though the point would have to be much further away.

Posterior computation

Walking through the calculations for both KDE and GDA to compute the posterior probabilities $p(y|x^*)$ at a query point $x^*$.

Compute for x* = ( , )

Or click anywhere on the plot above to query a point.

No point computed yet.

Generative versus Discriminative Models

The generative approach learns the data-generating story, typically the joint distribution $p(x,y)$, then classifies by applying Bayes' rule. This differs from the discriminative approach, which directly learns the decision boundary $f(x)\to y$ or posterior $p(y\mid x)$. Other examples of generative approaches than seen here include naive Bayes, linear discriminant analysis (LDA), variational autoencoders (VAEs), and diffusion models. Examples for discriminative approaches include logistic regression, support vector machines (SVMs), and standard neural-network classifiers.

Practical implications

Assumptions: generative models often impose a distributional shape for $p(x\mid y)$ (e.g., Gaussian in GDA). If that assumption is roughly right, they can work very well; if it's wrong, not as well. Discriminative models don't need a full model of $x$, so they're usually less sensitive to "wrong density shape" assumptions.
Data efficiency: with limited labeled data, generative models can perform well because they exploit structure in $p(x\mid y)$, and learning $p(x\mid y)$ + $p(y)$ can be easier than directly learning the boundary. With lots of data, discriminative models often win on pure classification accuracy since they focus directly on optimizing the decision boundary.
Extra capabilities: generative models can score plausibility (likelihood), handle missing features more naturally, and can generate/simulate samples. Discriminative models focus on predicting $y$ from $x$.
Training objectives:
- Generative training fits parameters by maximizing the joint log-likelihood of the labeled data under a model of $p(x,y)$.
$$\max_{\theta}\ \sum_{i=1}^N \log p_{\theta}(x_i, y_i)\ =\ \sum_{i=1}^N \big[\log p_{\theta}(y_i) + \log p_{\theta}(x_i\mid y_i)\big]$$
- Discriminative training fits parameters by maximizing the conditional log-likelihood of the labeled data under a model of $p(y\mid x)$.
$$\max_{\phi}\ \sum_{i=1}^N \log p_{\phi}(y_i\mid x_i)$$