Amplitude

Amplitude is the height of a wave, how far it swings above and below its average. In a Fourier representation, the amplitude of each frequency is the magnitude of its complex coefficient.

See alsofrequency, phase, discrete Fourier transform

Learn moreFourier image decomposition

A
Amplitude is the wave's height above its baseline.

Approximate inference

Approximate inference estimates a posterior that has no closed form, either by sampling from it (MCMC) or by fitting a simpler distribution to it (variational inference). It trades exactness for the ability to handle models where conjugacy does not apply.

See alsoMCMC, variational inference, conjugacy, posterior

Learn moreBayesian inference

Stand in for the distribution with draws from it.

Bandwidth

Bandwidth is the smoothing parameter of kernel density estimation: the width of the kernel placed at each point. Small bandwidth gives a spiky estimate that follows the data closely; large bandwidth gives a smoother but blurrier one.

See alsokernel density estimation, kernel, density estimation

Learn moreGenerative classification

Small bandwidth (spiky) vs large (smooth).

Basis

A basis is a set of independent directions in which every point can be written as one unique combination. Choosing a basis fixes the coordinates used to describe the data; PCA swaps the original basis for one aligned to the directions of greatest variance.

See alsochange of basis, principal component, orthogonal

Learn morePrincipal component analysis

e₁e₂
Two directions that span the space.

Bayes' rule

Bayes' rule turns a prior and a likelihood into a posterior: $p(\theta\mid\mathcal{D}) = \frac{p(\mathcal{D}\mid\theta)\,p(\theta)}{p(\mathcal{D})}$. The denominator, the evidence, normalizes the result so the posterior integrates to one.

See alsoprior, posterior, likelihood, marginal likelihood

Learn moreBayesian inference

The posterior core sits between the prior and likelihood cores.

Bayesian inference

Bayesian inference treats unknown parameters as random and updates beliefs about them with data. Starting from a prior, Bayes' rule produces a posterior that keeps uncertainty as part of the answer rather than collapsing to a single estimate.

See alsoprior, posterior, Bayes' rule, MAP

Learn moreBayesian inference

n grows →
More data, narrower belief.

Bias

The bias of an estimator is the difference between its expected value and the quantity it estimates: $\text{Bias}(\hat\theta)=\mathbb{E}[\hat\theta]-\theta$. An estimator with zero bias is centered on the target on average; a biased one is off in a systematic direction that does not wash out as more data is averaged.

See alsovariance, unbiased estimator, MSE

Learn moreEstimation

Estimates land off-center: a systematic offset from the target.

Bias-variance decomposition

For squared-error prediction, the expected error splits into three nonnegative parts: $\mathbb{E}[(y-\hat f(x))^2]=\text{Bias}[\hat f(x)]^2+\text{Var}[\hat f(x)]+\sigma^2$. The first two belong to the estimator; the last is irreducible noise. Lowering bias often raises variance, so the question is which combination gives the smallest total.

See alsobias, variance, irreducible noise, MSE

Learn moreEstimation

low var high var low bias high bias
Bias shifts the cluster off-center; variance spreads it out.

Bootstrap

The bootstrap approximates the sampling distribution of an estimator by resampling the observed data with replacement. Each resample of size $n$ is treated as a fresh dataset, the estimator is recomputed on it, and the spread of those values stands in for variability we could not otherwise see from one sample.

See alsosampling distribution, standard error, confidence interval

Learn moreEstimation

sample resample repeats allowed
Resample with replacement; some points recur, others drop out.

Change of basis

A change of basis rewrites the same vectors in a new coordinate system without moving the points themselves. PCA is a change of basis into the principal directions, where the axes line up with how the data actually varies.

See alsobasis, principal component, eigenvector

Learn morePrincipal component analysis

New axes rotate and stretch the grid; the points stay put.

Chebyshev polynomials

Chebyshev polynomials are an orthogonal basis on $[-1,1]$ under a weighting that concentrates near the endpoints. Their truncation minimizes the largest pointwise error (the minimax property), spreading approximation error more evenly than Legendre.

See alsoLegendre polynomials, orthogonal, basis

Learn moreFourier image decomposition

Orthogonal on [-1, 1]; the minimax error equioscillates between bounds, bunching toward the endpoints.

Class-conditional density

A class-conditional density is the distribution of the features within one class, $p(x\mid y=k)$. A generative classifier estimates one per class, with KDE or a Gaussian, and combines them with the class priors through Bayes' rule.

See alsogenerative model, class prior, kernel density estimation, GDA

Learn moreGenerative classification

One density per class, here two overlapping classes.

Class prior

The class prior $p(y=k)$ is the probability of a class before any features are seen, estimated as the fraction of training samples in that class. It weights each class-conditional density when the posterior is formed.

See alsoprior, generative model, Bayes' rule, posterior

Learn moreGenerative classification

Each class's share, by raw counts.

Conditional log-likelihood

When a model specifies the distribution of an output given an input, $p(y\mid x;\theta)$, the conditional log-likelihood sums $\log p(y_i\mid x_i;\theta)$ over the data. Maximizing it fits a predictive model, and for common output distributions it reduces to a familiar loss such as squared error or cross-entropy.

See alsolog-likelihood, maximum likelihood, cross-entropy

Learn moreEstimation

x
Predicted distribution of y, centered on the fit at x.

Confidence interval

A confidence interval is a data-dependent range built so that, over repeated samples, a stated proportion of such intervals contain the true value. The guarantee is about the procedure across samples, not about any single interval, which either contains the value or does not.

See alsostandard error, sampling distribution, bootstrap

Learn moreEstimation

true
Most intervals cover the true value; some miss.

Conjugacy

Conjugacy is when a prior and likelihood pair so that the posterior stays in the same distribution family as the prior, just with updated parameters. It makes Bayesian updating exact and closed-form, as in the Beta-Binomial and Normal-Normal models.

See alsoprior, posterior, Bayes' rule, approximate inference

Learn moreBayesian inference

same family
Prior (dashed) and posterior (solid) share a family.

Consistency

An estimator is consistent if it converges to the true value as the sample size grows: $\hat\theta_n \to \theta$ in probability as $n\to\infty$. Consistency is an asymptotic promise about behavior with more data, distinct from being unbiased at any fixed sample size.

See alsoconvergence in probability, unbiased estimator, bias

Learn moreEstimation

n →
Uncertainty around the truth shrinks.

Convergence in probability

A sequence of estimators converges in probability to $\theta$ if, for every tolerance $\varepsilon>0$, the chance of landing farther than $\varepsilon$ from $\theta$ goes to zero as $n$ grows: $P(|\hat\theta_n-\theta|>\varepsilon)\to 0$. It is the mode of convergence behind consistency.

See alsoconsistency

Learn moreEstimation

P(|θ_n − θ| > ε) n → → 0
The chance of a miss bigger than ε vanishes.

Covariance matrix

The covariance matrix collects how each pair of coordinates varies together, $C=\frac{1}{n-1}X^\top X$ for centered data. It is symmetric, and its eigenvectors point along the directions of greatest variance while its eigenvalues give the variance along them.

See alsovariance, eigenvector, eigenvalue, principal component

Learn morePrincipal component analysis

The spread and tilt of the cloud; the axes are its principal directions.

Credible interval

A credible interval is a range that contains the parameter with a stated posterior probability, say 95%. Unlike a confidence interval, it is a direct statement about the parameter given the observed data, prior, and likelihood.

See alsoposterior, confidence interval, Bayesian inference

Learn moreBayesian inference

95%
A 95% credible interval holds 95% of the posterior mass.

Cross-entropy

Cross-entropy measures the cost of using a predicted distribution $q$ when outcomes follow $p$: $-\sum_y p(y)\log q(y)$. Minimizing it over model parameters is equivalent to maximizing the conditional log-likelihood for categorical outputs, which is why it is the standard classification loss.

See alsoconditional log-likelihood, likelihood, maximum likelihood

Learn moreEstimation

lossq→1
Per-sample loss scatters around the −log(q) trend.

Cumulative distribution function CDF

A cumulative distribution function gives the probability that a random variable is at most $x$, $F(x)=P(X\le x)$. It rises monotonically from 0 to 1 and is the running integral of the density.

See alsoprobability density function, probability distribution, random variable

Learn moreDistribution visualizer

1
Probability of being at most x, rising 0 to 1.

Decision boundary

A decision boundary is the surface in feature space where a classifier switches its predicted class. For two classes it is where the posteriors are equal, $p(y{=}0\mid x)=p(y{=}1\mid x)$.

See alsogenerative model, discriminative model, GDA, posterior

Learn moreGenerative classification

Where the predicted class flips between two groups.

Density estimation

Density estimation infers the probability distribution that unlabeled data was drawn from. Generative classification applies it within each class to model $p(x\mid y)$, using either KDE or a fitted Gaussian.

See alsokernel density estimation, Gaussian distribution, class-conditional density

Learn moreGenerative classification

Smooth out the histogram into a density.

Dimensionality reduction

Dimensionality reduction represents data with fewer coordinates while preserving as much structure as possible. PCA does this by keeping only the top principal components, trading a little variance for a much smaller representation.

See alsoprincipal component, explained variance, low-rank approximation, projection

Learn morePrincipal component analysis

Same points, fewer dimensions.

Dirac delta

The Dirac delta places all probability at a single point, an idealized spike of zero width and unit area. It is the limiting case of a density concentrating onto one value.

See alsoprobability distribution, probability density function

Learn moreDistribution visualizer

A spike of all the mass at one value.

Discrete Fourier transform DFT

The discrete Fourier transform rewrites a signal or image as a sum of complex sinusoids, $F[u,v]=\sum f[x,y]\,e^{-i2\pi(ux+vy)/N}$. Each coefficient says how much of one frequency is present, so it is a change of basis from pixels to frequencies.

See alsofrequency, spectral energy, change of basis, low-pass filter

Learn moreFourier image decomposition

Sines at 1, 2, 3, … cycles form the basis.

Discriminative model

A discriminative model learns the decision boundary or posterior $p(y\mid x)$ directly, without modeling how the features are generated. Logistic regression and SVMs are examples; with ample data they often beat generative models on accuracy.

See alsogenerative model, decision boundary, posterior, conditional log-likelihood

Learn moreGenerative classification

f
The function splits the classes; one side is the positive class.

Eigendecomposition

An eigendecomposition writes a symmetric matrix as $C=V\Lambda V^\top$: a rotation $V$ into its eigenvectors, a diagonal scaling $\Lambda$ by the eigenvalues, and a rotation back. For the covariance matrix this exposes the principal directions and the variance along each.

See alsoeigenvector, eigenvalue, covariance matrix, SVD

Learn morePrincipal component analysis

The operator stretches along its eigen-axes.

Eigenvalue

An eigenvalue is the factor by which its eigenvector is scaled when the matrix is applied: $Cv=\lambda v$. For a covariance matrix the eigenvalues are the variances along the principal directions, so larger eigenvalues mark the directions that carry more of the spread.

See alsoeigenvector, covariance matrix, explained variance

Learn morePrincipal component analysis

λ₁λ₂
Axis lengths are the eigenvalues.

Eigenvector

An eigenvector is a direction left unchanged except for scaling when a matrix is applied: $Cv=\lambda v$. The eigenvectors of the covariance matrix are the principal directions of the data, ordered by their eigenvalues.

See alsoeigenvalue, covariance matrix, principal component

Learn morePrincipal component analysis

v λv
An eigenvector keeps its direction; only its length changes.

Embedding

An embedding is the low-dimensional set of coordinates a manifold method assigns to each point, chosen so that selected relationships (distances or neighborhoods) are preserved. It is the unfolded, flattened view of the data.

See alsomanifold, manifold learning, dimensionality reduction

Learn moreManifold learning

The curved surface unfolded into flat coordinates.

Estimator

An estimator is a rule that maps data to an estimate of an unknown quantity, $\hat\theta=T(\mathcal{D})$. Because the data are random, the estimator is itself a random variable, so its quality is judged by properties of its distribution such as bias, variance, and consistency.

See alsoparameter, bias, variance, sampling distribution

Learn moreEstimation

data T estimate
A rule from data to an estimate; the inputs are random, so the output is too.

Evidence lower bound ELBO

The evidence lower bound is a tractable lower bound on the log-evidence used by variational inference. Maximizing it is equivalent to minimizing the KL divergence from the approximation $q$ to the true posterior.

See alsovariational inference, KL divergence, marginal likelihood

Learn moreBayesian inference

log p(D) ELBOstep →
Optimization pushes the bound up toward the evidence.

Explained variance

Explained variance is the share of the data's total variance captured by a component, its eigenvalue divided by the sum of all eigenvalues. It is how PCA decides how many components are worth keeping.

See alsovariance, eigenvalue, principal component, dimensionality reduction

Learn morePrincipal component analysis

Each direction's length is the variance it explains.

Exponential distribution

The exponential distribution models the waiting time between events in a memoryless process. Its density peaks at zero and decays at a constant rate, so longer waits are progressively less likely.

See alsoprobability distribution, probability density function

Learn moreDistribution visualizer

A constant-rate decay from a peak at zero.

Feature selection

Feature selection keeps only a subset of the available inputs and discards the rest. Lasso does this implicitly by driving some coefficients to exactly zero, so the fitted model ignores those features.

See alsolasso, regularization, shrinkage

Learn moreLinear regression regularization

Some features kept (accent), others dropped.

Frequency

Frequency is how many cycles a wave completes per unit of space or time. Low frequencies vary slowly and carry smooth structure; high frequencies vary rapidly and carry fine detail and edges.

See alsoamplitude, phase, Nyquist frequency, discrete Fourier transform

Learn moreFourier image decomposition

Low frequency varies slowly; high frequency varies fast.

Function estimation

Function estimation infers a whole mapping $f$ rather than a finite set of numbers, as in regression where $y=f(x)+\varepsilon$. The unknown object is effectively infinite-dimensional, so estimates are controlled through smoothness or capacity assumptions.

See alsopoint estimation, parameter, irreducible noise

Learn moreEstimation

Estimate the function behind the data.

Gaussian discriminant analysis GDA

Gaussian discriminant analysis is a generative classifier that fits one multivariate Gaussian per class, $p(x\mid y{=}k)=\mathcal{N}(\mu_k,\Sigma_k)$, then classifies with Bayes' rule. Its per-class mean and covariance come from maximum likelihood.

See alsogenerative model, Gaussian distribution, class-conditional density, MLE

Learn moreGenerative classification

Class contours meet at the decision boundary.

Gaussian distribution

The Gaussian (normal) distribution is the bell-shaped density set by a mean and variance, or a mean vector and covariance matrix in higher dimensions. It is the default model for continuous data and the per-class model in GDA.

See alsocovariance matrix, GDA, density estimation

Learn moreGenerative classification

The bell curve set by a mean and spread.

Generalization

Generalization is how well a model trained on a sample performs on unseen data from the same process. Regularization improves it by discouraging fits that chase noise in the training set.

See alsooverfitting, underfitting, regularization, bias-variance decomposition

Learn moreLinear regression regularization

traintest
Test error rises again when the model overfits.

Generative model

A generative model learns the joint distribution $p(x,y)=p(y)\,p(x\mid y)$, the data-generating story, then classifies by applying Bayes' rule. Unlike a discriminative model it can also score plausibility and generate samples.

See alsodiscriminative model, joint distribution, class-conditional density, Bayes' rule

Learn moreGenerative classification

new samples
It learns the full data distribution, so it can generate new samples like the data.

Geodesic distance

Geodesic distance is the distance measured along the manifold surface rather than straight through the surrounding space. Isomap approximates it by shortest paths along a neighborhood graph, so points are not treated as close just because the surface folds back near them.

See alsomanifold, k-nearest-neighbor graph, Isomap

Learn moreManifold learning

Geodesic distance follows the surface (solid), not the straight chord (dashed).

Gibbs phenomenon

The Gibbs phenomenon is the ringing overshoot that appears near a sharp edge when it is rebuilt from a truncated set of frequencies. The overshoot does not shrink as more terms are added; it just narrows.

See alsolow-pass filter, discrete Fourier transform, frequency

Learn moreFourier image decomposition

Rebuilding a sharp edge from limited frequencies overshoots and rings.

Gradient descent

Gradient descent minimizes a loss by repeatedly stepping in the direction of steepest decrease, $w \leftarrow w - \eta\nabla J(w)$. The step size is the learning rate; too large a step can oscillate or diverge instead of settling at the minimum.

See alsolearning rate, weight decay, ridge

Learn moreLinear regression regularization

Each step moves downhill toward the minimum.

Haar wavelets

Haar wavelets are a basis of translated, dilated square pulses, so each coefficient depends only on a small local region. They give a multi-scale pyramid that captures edges at several scales, at the cost of blocky coarse reconstructions.

See alsobasis, orthogonal, discrete Fourier transform

Learn moreFourier image decomposition

Reconstructions look blocky at coarse scales.

Intrinsic dimension

The intrinsic dimension is the number of coordinates genuinely needed to describe data on its manifold, regardless of how many dimensions it is measured in. A sheet curled through 3D has intrinsic dimension two.

See alsomanifold, dimensionality reduction, embedding

Learn moreManifold learning

A 2D surface curved through 3D.

Irreducible noise

Irreducible noise is the part of an outcome that no model can explain because it does not depend on the inputs, written $\sigma^2$ in $y=f(x)+\varepsilon$. It sets a floor on achievable prediction error regardless of how good the estimator is.

See alsobias-variance decomposition, function estimation

Learn moreEstimation

Scatter remains even with the true function.

Isomap

Isomap is a manifold method that preserves geodesic distances. It builds a neighborhood graph, computes shortest-path distances on it, and then applies multidimensional scaling to place points in low dimensions while keeping those distances.

See alsogeodesic distance, MDS, k-nearest-neighbor graph, manifold learning

Learn moreManifold learning

Distance along the graph, not the straight chord.

Joint distribution

The joint distribution $p(x,y)$ gives the probability of features and label together. Factoring it as $p(y)\,p(x\mid y)$ is the basis of generative classification.

See alsogenerative model, class prior, class-conditional density, posterior

Learn moreGenerative classification

xy
Both variables vary together; contours show the joint density.

Kernel

A kernel is a small, smooth bump, often a Gaussian, placed on a data point in kernel density estimation. Summing the kernels over all points builds a smooth estimate of the density.

See alsokernel density estimation, bandwidth, density estimation

Learn moreGenerative classification

A bump centered on one point.

Kernel density estimation KDE

Kernel density estimation is a non-parametric density estimate that centers a kernel on every data point and averages them, $\hat p(x)=\frac1{n}\sum_i K(x;x_i,h)$. The bandwidth $h$ sets the smoothness, and it makes no assumption that the data forms a single cluster.

See alsokernel, bandwidth, density estimation, class-conditional density

Learn moreGenerative classification

A bump per point; their average is the density estimate.

KL divergence

The Kullback-Leibler divergence measures how much information is lost when one distribution is used in place of another, $\mathrm{KL}(q\,\|\,p)$. It is zero when the two match and grows as they differ; variational inference minimizes it.

See alsovariational inference, ELBO, cross-entropy

Learn moreBayesian inference

p q
KL divergence measures the gap between q and the true posterior p.

k-nearest-neighbor graph kNN

A k-nearest-neighbor graph connects each point to its $k$ closest points. It captures the local structure of data and is the scaffold manifold methods build on, for example to estimate geodesic distances or local geometry.

See alsogeodesic distance, Isomap, manifold learning

Learn moreManifold learning

Each point links to its nearest neighbors.

Laplace distribution

The Laplace distribution is a symmetric double-sided exponential with a sharp peak and heavier tails than a Gaussian. Its log-density is the L1 penalty behind lasso.

See alsoGaussian distribution, lasso, probability density function

Learn moreDistribution visualizer

A sharp central peak with long tails.

Laplacian eigenmaps

Laplacian eigenmaps embed data using the eigenvectors of the graph Laplacian of a neighborhood graph, keeping nearby points nearby. It is a spectral method closely tied to diffusion and clustering on graphs.

See alsok-nearest-neighbor graph, eigenvector, embedding, manifold learning

Learn moreManifold learning

An eigenvector varies smoothly across the graph.

Lasso

Lasso adds an L1 penalty $\lambda\lVert w\rVert_1$ to the least-squares loss. Because the L1 penalty has corners, it drives some coefficients to exactly zero, producing a sparse model and acting as feature selection.

See alsoridge, regularization, shrinkage, feature selection

Learn moreLinear regression regularization

Lasso sends some coefficients exactly to zero.

Learning rate

The learning rate $\eta$ scales each gradient-descent step. Too small and training crawls; too large and the updates overshoot, oscillate, or diverge, especially when combined with a strong penalty.

See alsogradient descent, weight decay

Learn moreLinear regression regularization

smallgoodlarge
Too small crawls, just right converges, too large overshoots.

Legendre polynomials

Legendre polynomials are an orthogonal basis on $[-1,1]$ under the unweighted inner product. A degree-$d$ expansion gives the best squared-error polynomial fit on the domain, but high-degree fits diverge just outside it (Runge's phenomenon).

See alsoChebyshev polynomials, Runge's phenomenon, orthogonal

Learn moreFourier image decomposition

Orthogonal inside [-1,1]; high degrees blow up just outside.

Likelihood

The likelihood is the probability of the observed data read as a function of the parameters, $L(\theta)=p(\mathcal{D};\theta)$. It is not a distribution over $\theta$; it ranks parameter values by how well they would have produced the data we saw.

See alsolog-likelihood, maximum likelihood

Learn moreEstimation

θ
Likelihood as a function of the parameter.

Linear regression

Linear regression fits a linear function of the inputs to a continuous target by minimizing squared error, $\min_w \lVert Xw-y\rVert^2$. Its coefficients have a closed form, and it is the base model that regularization modifies.

See alsoordinary least squares, ridge, lasso, MSE

Learn moreLinear regression regularization

Minimize the squared residuals.

Locally linear embedding LLE

Locally linear embedding represents each point as a weighted combination of its neighbors, then finds low-dimensional coordinates that preserve those same local weights. It captures local geometry without needing global distances.

See alsomanifold learning, k-nearest-neighbor graph, embedding, Laplacian eigenmaps

Learn moreManifold learning

Each point is a blend of its neighbors.

Log-likelihood

The log-likelihood is the logarithm of the likelihood, $\ell(\theta)=\log L(\theta)$. The log turns products over independent observations into sums, which is numerically stable and easier to differentiate, while preserving where the maximum sits.

See alsolikelihood, maximum likelihood, conditional log-likelihood

Learn moreEstimation

same peak
Same maximizer as the likelihood, easier to work with.

Low-pass filter

A low-pass filter keeps the low-frequency coefficients and discards the high ones, producing a smoothed version of the signal or image. Applied to non-periodic images it can introduce Gibbs ringing at the edges.

See alsofrequency, discrete Fourier transform, Gibbs phenomenon, spectral energy

Learn moreFourier image decomposition

Keep the low frequencies inside the cutoff; drop the rest.

Low-rank approximation

A low-rank approximation rebuilds a matrix from only its largest components, $X_{\text{rank-}k}=U_k\Sigma_k V_k^\top$. Keeping the top $k$ singular values gives the closest rank-$k$ matrix under squared error, which is how PCA compresses data.

See alsoSVD, dimensionality reduction, reconstruction, explained variance

Learn morePrincipal component analysis

X rank k
A few components rebuild most of the matrix.

Manifold

A manifold is a surface that looks flat up close but can curve through a higher-dimensional space, like a sheet rolled into a spiral. Manifold learning assumes data lies on such a low-dimensional surface embedded in many measured dimensions.

See alsomanifold learning, embedding, intrinsic dimension, geodesic distance

Learn moreManifold learning

Data lies on a low-dimensional surface that curves through space.

Manifold learning

Manifold learning recovers the low-dimensional surface that data lies on and lays it out in few coordinates. Methods like Isomap, locally linear embedding, and Laplacian eigenmaps differ in how they preserve local or global structure while unfolding it.

See alsomanifold, Isomap, embedding, dimensionality reduction

Learn moreManifold learning

Find the surface and flatten it.

Marginal likelihood evidence

The marginal likelihood, or evidence, is the probability of the data averaged over the prior, $p(\mathcal{D})=\int p(\mathcal{D}\mid\theta)p(\theta)\,d\theta$. It normalizes Bayes' rule and is often the hard part to compute, which motivates approximate inference.

See alsoBayes' rule, posterior, approximate inference, ELBO

Learn moreBayesian inference

area = p(D)
Evidence is the total area under the product.

Markov chain Monte Carlo MCMC

Markov chain Monte Carlo approximates a posterior by building a Markov chain whose stationary distribution is that posterior. After enough steps its samples behave like posterior draws, targeting the true posterior at the cost of computation and mixing diagnostics.

See alsoapproximate inference, variational inference, posterior

Learn moreBayesian inference

The chain lingers where the posterior is high.

Maximum a posteriori MAP

Maximum a posteriori estimation picks the parameter at the peak of the posterior, $\hat\theta_{\mathrm{MAP}}=\arg\max_\theta p(\theta\mid\mathcal{D})$. It equals maximum likelihood plus a prior term, so the prior acts as a regularizer; ridge regression is the MAP estimate under a Gaussian prior.

See alsoMLE, posterior, prior, ridge

Learn moreBayesian inference

MAP
MAP is the parameter at the posterior's peak.

Maximum likelihood estimation MLE

Maximum likelihood estimation chooses the parameter that makes the observed data most probable: $\hat\theta=\arg\max_\theta \ell(\theta)$. It is consistent under broad conditions and connects many familiar losses to a single principle of fitting by probability.

See alsolikelihood, log-likelihood, cross-entropy, consistency

Learn moreEstimation

θ̂
The estimate is the parameter at the peak.

Mean squared error MSE

Mean squared error averages squared deviations of an estimate from the truth, $\mathbb{E}[(\hat\theta-\theta)^2]$, and decomposes exactly into $\text{Bias}(\hat\theta)^2+\text{Var}(\hat\theta)$. It rewards estimators that are both centered and stable, and penalizes large misses heavily.

See alsobias, variance, bias-variance decomposition

Learn moreEstimation

variance bias² MSE
MSE splits into squared bias plus variance.

Mixture model

A mixture model combines several weighted component distributions into one density. It captures data with multiple clusters or modes that a single distribution cannot represent.

See alsoGaussian distribution, probability distribution, density estimation

Learn moreDistribution visualizer

Weighted components sum to one multi-modal density.

Multidimensional scaling MDS

Multidimensional scaling places points in a low-dimensional space so their pairwise distances match a given set of target distances as closely as possible. It is the final step of Isomap and a manifold method in its own right.

See alsoIsomap, embedding, dimensionality reduction

Learn moreManifold learning

distances
Recover a layout that matches the distances.

Nyquist frequency

The Nyquist frequency is the highest frequency a sampled signal can represent, one cycle per two samples. Anything faster cannot be resolved and masquerades as a lower frequency (aliasing).

See alsofrequency, discrete Fourier transform

Learn moreFourier image decomposition

The fastest wave resolvable: two samples per cycle.

Ordinary least squares OLS

Ordinary least squares is linear regression with no penalty, choosing the coefficients that minimize the mean squared error. With many features it can overfit, fitting noise as if it were signal.

See alsolinear regression, overfitting, ridge, MSE

Learn moreLinear regression regularization

Minimize the total area of the residual squares.

Orthogonal

Two directions are orthogonal if they meet at a right angle and share no common component. PCA's principal directions are mutually orthogonal, so each captures variance the others do not.

See alsoprincipal component, basis, eigenvector

Learn morePrincipal component analysis

u · v = 0
Their dot product is zero.

Overfitting

Overfitting is when a model fits the training data too closely, capturing noise rather than the underlying pattern, so it does poorly on new data. High-capacity models such as high-degree polynomials overfit unless they are regularized.

See alsounderfitting, generalization, regularization, variance

Learn moreLinear regression regularization

Overfitting chases every point instead of the trend.

Parameter

A parameter is the fixed but unknown quantity a model is built around, such as a mean $\mu$ or a coefficient vector $\theta\in\mathbb{R}^p$. Estimation is the task of recovering it from data; it is the target, not a function of the data.

See alsoestimator, point estimation, function estimation

Learn moreEstimation

θ
A fixed unknown value to recover.

Periodicity

Periodicity is the property of repeating at a fixed interval. The Fourier basis is built from periodic waves, so a reconstruction treats the image as one tile of an infinitely repeating pattern.

See alsofrequency, discrete Fourier transform, Gibbs phenomenon

Learn moreFourier image decomposition

T
The same shape repeats every period T.

Phase

Phase is where a wave sits in its cycle, its horizontal shift. In a Fourier coefficient the phase is the argument of the complex number while the amplitude is its magnitude; both are needed to place the wave.

See alsoamplitude, frequency, discrete Fourier transform

Learn moreFourier image decomposition

phase
A horizontal shift of the same wave.

Point estimation

Point estimation produces a single best value for a finite-dimensional unknown, $\hat\theta=T(\mathcal{D})\in\mathbb{R}^p$, as opposed to an interval or a whole function. It is the most direct estimation task and the setting for bias, variance, and MSE.

See alsoparameter, estimator, confidence interval, function estimation

Learn moreEstimation

θ̂
Collapse the data to a single value.

Posterior

The posterior is the updated belief about a parameter after seeing data, $p(\theta\mid\mathcal{D})$. Bayes' rule forms it by combining the prior with the likelihood; it is a compromise that moves from the prior toward the likelihood as data accumulates.

See alsoprior, likelihood, Bayes' rule, credible interval

Learn moreBayesian inference

Data updates the prior (dashed) into a sharper posterior.

Posterior predictive

The posterior predictive distribution predicts new data by averaging the likelihood over the posterior, $p(x_{\text{new}}\mid\mathcal{D})=\int p(x_{\text{new}}\mid\theta)p(\theta\mid\mathcal{D})\,d\theta$. It accounts for parameter uncertainty instead of plugging in a single estimate.

See alsoposterior, Bayesian inference, marginal likelihood

Learn moreBayesian inference

Averaging candidates widens the prediction.

Precision

Precision is the inverse of variance. In Bayesian updates, combining a prior and data is a precision-weighted average: sources with smaller variance (higher precision) get more weight.

See alsovariance, posterior, prior

Learn moreBayesian inference

highlow
More precision, narrower spread.

Principal component

A principal component is one of the orthogonal directions PCA finds, ordered so the first captures the most variance, the second the most of what remains, and so on. They are the eigenvectors of the covariance matrix, equivalently the right singular vectors of the data.

See alsoprincipal component analysis, eigenvector, explained variance, covariance matrix

Learn morePrincipal component analysis

The first component follows the most variance; the second, what remains.

Principal component analysis PCA

Principal component analysis reorganizes data around the orthogonal directions of greatest variance. Computed from the SVD or the covariance eigendecomposition, it supports visualization, dimensionality reduction, and denoising by ordering directions from most to least informative.

See alsoprincipal component, SVD, covariance matrix, dimensionality reduction

Learn morePrincipal component analysis

Rotate into the axes that capture the variance.

Prior

The prior $p(\theta)$ encodes what is believed about a parameter before seeing data. It can express genuine prior knowledge or act as a regularizer, and Bayes' rule updates it into the posterior.

See alsoposterior, likelihood, Bayes' rule, regularization

Learn moreBayesian inference

weak prior
A flat prior expresses little prior knowledge.

Probability density function PDF

A probability density function gives the relative likelihood of each value of a continuous random variable. Areas under it are probabilities, and the total area integrates to one.

See alsocumulative distribution function, probability distribution, random variable

Learn moreDistribution visualizer

Area under the curve is probability.

Probability distribution

A probability distribution describes how likely each value of a random variable is. For continuous variables it is summarized by a density (PDF) and a cumulative function (CDF).

See alsoprobability density function, cumulative distribution function, random variable

Learn moreDistribution visualizer

A density over a continuous range.

Projection

A projection drops each point onto a lower-dimensional subspace, keeping the part that lies within it. PCA projects data onto the span of the top principal components, the rank-$k$ projection that preserves the most variance.

See alsoprincipal component, dimensionality reduction, low-rank approximation, orthogonal

Learn morePrincipal component analysis

A point keeps only its component along the subspace.

Random variable

A random variable is a quantity whose value is set by a random process. Its behavior is captured by a probability distribution over the values it can take.

See alsoprobability distribution, probability density function, sampling distribution

Learn moreDistribution visualizer

X
Chance turns the process into a distribution.

Reconstruction

Reconstruction rebuilds data from its reduced representation by mapping the kept components back into the original coordinates. The reconstruction error is what the discarded components would have contributed, which PCA keeps small by dropping only low-variance directions.

See alsolow-rank approximation, dimensionality reduction, explained variance

Learn morePrincipal component analysis

rebuilt
The rebuilt signal (solid) tracks the original (faint).

Regularization

Regularization adds a penalty on model complexity to the training objective, trading a little fit on the data for better behavior on new data. Ridge and lasso are the common penalties for linear models.

See alsoridge, lasso, overfitting, generalization, shrinkage

Learn moreLinear regression regularization

Penalizing wiggle yields a smoother fit.

Ridge

Ridge regression adds an L2 penalty $\lambda\lVert w\rVert^2$ to the least-squares loss. It shrinks all coefficients smoothly toward zero without forcing any to exactly zero, reducing variance at the cost of a little bias.

See alsolasso, shrinkage, regularization, bias-variance decomposition

Learn moreLinear regression regularization

Every coefficient shrinks, but stays nonzero.

Runge's phenomenon

Runge's phenomenon is the large oscillation that high-degree polynomial fits develop near the edges of their domain. It is why global polynomial bases like Legendre extrapolate badly just outside the fitted region.

See alsoLegendre polynomials, Chebyshev polynomials, overfitting

Learn moreFourier image decomposition

It fits the middle but swings near the ends.

Sampling distribution

The sampling distribution is the distribution of an estimator across repeated samples from the same process. Its center relates to bias and its spread to variance, so most questions about an estimator's reliability are questions about this distribution.

See alsostandard error, bias, variance, bootstrap

Learn moreEstimation

θ
The distribution of the estimate over repeated samples.

Shrinkage

Shrinkage pulls estimated coefficients toward zero, accepting a little bias to cut variance. It is the mechanism behind ridge regression and a common way to stabilize high-variance estimates.

See alsoridge, regularization, variance, bias

Learn moreLinear regression regularization

0
Coefficients are pulled toward zero.

Singular value decomposition SVD

The singular value decomposition factors any matrix as $X=U\Sigma V^\top$: orthogonal singular vectors in $U$ and $V$, and nonnegative singular values in $\Sigma$. For centered data it yields PCA directly, with the right singular vectors as principal directions and the squared singular values as variances.

See alsoprincipal component, covariance matrix, eigendecomposition, low-rank approximation

Learn morePrincipal component analysis

X = U Σ Vᵀ
Orthogonal U and V with diagonal Σ.

Spectral energy

Spectral energy is the squared magnitude of a frequency coefficient, $|F[u,v]|^2$, measuring how much of a signal's total energy sits in that one wave. Smooth images concentrate it in low frequencies; sharp ones spread it into high frequencies.

See alsofrequency, discrete Fourier transform, low-pass filter

Learn moreFourier image decomposition

Energy concentrates at the low-frequency center.

Standard error

The standard error is the standard deviation of an estimator's sampling distribution, $\text{SE}(\hat\theta)=\sqrt{\text{Var}(\hat\theta)}$. It reports how much an estimate would wobble from sample to sample and sets the scale for confidence intervals.

See alsosampling distribution, variance, confidence interval

Learn moreEstimation

2·SE
The spread of the sampling distribution.

Swiss roll

The swiss roll is a standard test dataset: a 2D sheet rolled into a spiral through 3D. It is useful because its true intrinsic structure is known, so a method's unrolled embedding can be judged against it.

See alsomanifold, intrinsic dimension, embedding

Learn moreManifold learning

A sheet coiled into a spiral.

Unbiased estimator

An unbiased estimator has expected value equal to the quantity it estimates, $\mathbb{E}[\hat\theta]=\theta$, so it is centered on the target on average. Unbiasedness alone does not guarantee accuracy; a high-variance unbiased estimator can still miss badly on any single sample.

See alsobias, variance, MSE, consistency

Learn moreEstimation

mean on target
Scattered, but centered on the target on average.

Underfitting

Underfitting is when a model is too constrained to capture the real pattern, so both training and test error stay high. Too strong a regularization penalty causes it.

See alsooverfitting, generalization, regularization

Learn moreLinear regression regularization

Too simple to follow the data's shape.

Uniform distribution

The uniform distribution spreads probability evenly over an interval, so every value in the range is equally likely. Its density is a flat rectangle and zero outside the range.

See alsoprobability distribution, probability density function

Learn moreDistribution visualizer

Equal likelihood across the interval.

Variance

The variance of an estimator measures how much it changes from sample to sample, $\text{Var}(\hat\theta)=\mathbb{E}[(\hat\theta-\mathbb{E}[\hat\theta])^2]$. High variance means the estimate is sensitive to the particular data drawn, even if it is correct on average.

See alsobias, standard error, MSE, bias-variance decomposition

Learn moreEstimation

wide scatter
Estimates scatter widely from sample to sample.

Variational inference

Variational inference approximates a posterior by choosing the closest member of a tractable family $q(\theta)$, turning inference into optimization. It maximizes the ELBO (equivalently minimizes the KL divergence to the posterior), trading exactness for speed and scale.

See alsoMCMC, ELBO, KL divergence, approximate inference

Learn moreBayesian inference

Optimization selects the best-fitting candidate.

Weight decay

Weight decay multiplies the weights by a factor slightly less than one on each gradient step, $w \leftarrow (1-2\eta\lambda)w - \eta\nabla\,\text{MSE}$. It is closely related to an L2 penalty and pulls weights toward zero as training proceeds.

See alsoridge, gradient descent, learning rate, shrinkage

Learn moreLinear regression regularization

wstep →
Weights are nudged toward zero each step.