Amplitude
Amplitude is the height of a wave, how far it swings above and below its average. In a Fourier representation, the amplitude of each frequency is the magnitude of its complex coefficient.
See alsofrequency, phase, discrete Fourier transform
Learn moreFourier image decomposition
Approximate inference
Approximate inference estimates a posterior that has no closed form, either by sampling from it (MCMC) or by fitting a simpler distribution to it (variational inference). It trades exactness for the ability to handle models where conjugacy does not apply.
See alsoMCMC, variational inference, conjugacy, posterior
Learn moreBayesian inference
Bandwidth
Bandwidth is the smoothing parameter of kernel density estimation: the width of the kernel placed at each point. Small bandwidth gives a spiky estimate that follows the data closely; large bandwidth gives a smoother but blurrier one.
See alsokernel density estimation, kernel, density estimation
Learn moreGenerative classification
Basis
A basis is a set of independent directions in which every point can be written as one unique combination. Choosing a basis fixes the coordinates used to describe the data; PCA swaps the original basis for one aligned to the directions of greatest variance.
See alsochange of basis, principal component, orthogonal
Learn morePrincipal component analysis
Bayes' rule
Bayes' rule turns a prior and a likelihood into a posterior: $p(\theta\mid\mathcal{D}) = \frac{p(\mathcal{D}\mid\theta)\,p(\theta)}{p(\mathcal{D})}$. The denominator, the evidence, normalizes the result so the posterior integrates to one.
See alsoprior, posterior, likelihood, marginal likelihood
Learn moreBayesian inference
Bayesian inference
Bayesian inference treats unknown parameters as random and updates beliefs about them with data. Starting from a prior, Bayes' rule produces a posterior that keeps uncertainty as part of the answer rather than collapsing to a single estimate.
See alsoprior, posterior, Bayes' rule, MAP
Learn moreBayesian inference
Bias
The bias of an estimator is the difference between its expected value and the quantity it estimates: $\text{Bias}(\hat\theta)=\mathbb{E}[\hat\theta]-\theta$. An estimator with zero bias is centered on the target on average; a biased one is off in a systematic direction that does not wash out as more data is averaged.
See alsovariance, unbiased estimator, MSE
Learn moreEstimation
Bias-variance decomposition
For squared-error prediction, the expected error splits into three nonnegative parts: $\mathbb{E}[(y-\hat f(x))^2]=\text{Bias}[\hat f(x)]^2+\text{Var}[\hat f(x)]+\sigma^2$. The first two belong to the estimator; the last is irreducible noise. Lowering bias often raises variance, so the question is which combination gives the smallest total.
See alsobias, variance, irreducible noise, MSE
Learn moreEstimation
Bootstrap
The bootstrap approximates the sampling distribution of an estimator by resampling the observed data with replacement. Each resample of size $n$ is treated as a fresh dataset, the estimator is recomputed on it, and the spread of those values stands in for variability we could not otherwise see from one sample.
See alsosampling distribution, standard error, confidence interval
Learn moreEstimation
Change of basis
A change of basis rewrites the same vectors in a new coordinate system without moving the points themselves. PCA is a change of basis into the principal directions, where the axes line up with how the data actually varies.
See alsobasis, principal component, eigenvector
Learn morePrincipal component analysis
Chebyshev polynomials
Chebyshev polynomials are an orthogonal basis on $[-1,1]$ under a weighting that concentrates near the endpoints. Their truncation minimizes the largest pointwise error (the minimax property), spreading approximation error more evenly than Legendre.
See alsoLegendre polynomials, orthogonal, basis
Learn moreFourier image decomposition
Class-conditional density
A class-conditional density is the distribution of the features within one class, $p(x\mid y=k)$. A generative classifier estimates one per class, with KDE or a Gaussian, and combines them with the class priors through Bayes' rule.
See alsogenerative model, class prior, kernel density estimation, GDA
Learn moreGenerative classification
Class prior
The class prior $p(y=k)$ is the probability of a class before any features are seen, estimated as the fraction of training samples in that class. It weights each class-conditional density when the posterior is formed.
See alsoprior, generative model, Bayes' rule, posterior
Learn moreGenerative classification
Conditional log-likelihood
When a model specifies the distribution of an output given an input, $p(y\mid x;\theta)$, the conditional log-likelihood sums $\log p(y_i\mid x_i;\theta)$ over the data. Maximizing it fits a predictive model, and for common output distributions it reduces to a familiar loss such as squared error or cross-entropy.
See alsolog-likelihood, maximum likelihood, cross-entropy
Learn moreEstimation
Confidence interval
A confidence interval is a data-dependent range built so that, over repeated samples, a stated proportion of such intervals contain the true value. The guarantee is about the procedure across samples, not about any single interval, which either contains the value or does not.
See alsostandard error, sampling distribution, bootstrap
Learn moreEstimation
Conjugacy
Conjugacy is when a prior and likelihood pair so that the posterior stays in the same distribution family as the prior, just with updated parameters. It makes Bayesian updating exact and closed-form, as in the Beta-Binomial and Normal-Normal models.
See alsoprior, posterior, Bayes' rule, approximate inference
Learn moreBayesian inference
Consistency
An estimator is consistent if it converges to the true value as the sample size grows: $\hat\theta_n \to \theta$ in probability as $n\to\infty$. Consistency is an asymptotic promise about behavior with more data, distinct from being unbiased at any fixed sample size.
See alsoconvergence in probability, unbiased estimator, bias
Learn moreEstimation
Convergence in probability
A sequence of estimators converges in probability to $\theta$ if, for every tolerance $\varepsilon>0$, the chance of landing farther than $\varepsilon$ from $\theta$ goes to zero as $n$ grows: $P(|\hat\theta_n-\theta|>\varepsilon)\to 0$. It is the mode of convergence behind consistency.
See alsoconsistency
Learn moreEstimation
Covariance matrix
The covariance matrix collects how each pair of coordinates varies together, $C=\frac{1}{n-1}X^\top X$ for centered data. It is symmetric, and its eigenvectors point along the directions of greatest variance while its eigenvalues give the variance along them.
See alsovariance, eigenvector, eigenvalue, principal component
Learn morePrincipal component analysis
Credible interval
A credible interval is a range that contains the parameter with a stated posterior probability, say 95%. Unlike a confidence interval, it is a direct statement about the parameter given the observed data, prior, and likelihood.
See alsoposterior, confidence interval, Bayesian inference
Learn moreBayesian inference
Cross-entropy
Cross-entropy measures the cost of using a predicted distribution $q$ when outcomes follow $p$: $-\sum_y p(y)\log q(y)$. Minimizing it over model parameters is equivalent to maximizing the conditional log-likelihood for categorical outputs, which is why it is the standard classification loss.
See alsoconditional log-likelihood, likelihood, maximum likelihood
Learn moreEstimation
Cumulative distribution function CDF
A cumulative distribution function gives the probability that a random variable is at most $x$, $F(x)=P(X\le x)$. It rises monotonically from 0 to 1 and is the running integral of the density.
See alsoprobability density function, probability distribution, random variable
Learn moreDistribution visualizer
Decision boundary
A decision boundary is the surface in feature space where a classifier switches its predicted class. For two classes it is where the posteriors are equal, $p(y{=}0\mid x)=p(y{=}1\mid x)$.
See alsogenerative model, discriminative model, GDA, posterior
Learn moreGenerative classification
Density estimation
Density estimation infers the probability distribution that unlabeled data was drawn from. Generative classification applies it within each class to model $p(x\mid y)$, using either KDE or a fitted Gaussian.
See alsokernel density estimation, Gaussian distribution, class-conditional density
Learn moreGenerative classification
Dimensionality reduction
Dimensionality reduction represents data with fewer coordinates while preserving as much structure as possible. PCA does this by keeping only the top principal components, trading a little variance for a much smaller representation.
See alsoprincipal component, explained variance, low-rank approximation, projection
Learn morePrincipal component analysis
Dirac delta
The Dirac delta places all probability at a single point, an idealized spike of zero width and unit area. It is the limiting case of a density concentrating onto one value.
See alsoprobability distribution, probability density function
Learn moreDistribution visualizer
Discrete Fourier transform DFT
The discrete Fourier transform rewrites a signal or image as a sum of complex sinusoids, $F[u,v]=\sum f[x,y]\,e^{-i2\pi(ux+vy)/N}$. Each coefficient says how much of one frequency is present, so it is a change of basis from pixels to frequencies.
See alsofrequency, spectral energy, change of basis, low-pass filter
Learn moreFourier image decomposition
Discriminative model
A discriminative model learns the decision boundary or posterior $p(y\mid x)$ directly, without modeling how the features are generated. Logistic regression and SVMs are examples; with ample data they often beat generative models on accuracy.
See alsogenerative model, decision boundary, posterior, conditional log-likelihood
Learn moreGenerative classification
Eigendecomposition
An eigendecomposition writes a symmetric matrix as $C=V\Lambda V^\top$: a rotation $V$ into its eigenvectors, a diagonal scaling $\Lambda$ by the eigenvalues, and a rotation back. For the covariance matrix this exposes the principal directions and the variance along each.
See alsoeigenvector, eigenvalue, covariance matrix, SVD
Learn morePrincipal component analysis
Eigenvalue
An eigenvalue is the factor by which its eigenvector is scaled when the matrix is applied: $Cv=\lambda v$. For a covariance matrix the eigenvalues are the variances along the principal directions, so larger eigenvalues mark the directions that carry more of the spread.
See alsoeigenvector, covariance matrix, explained variance
Learn morePrincipal component analysis
Eigenvector
An eigenvector is a direction left unchanged except for scaling when a matrix is applied: $Cv=\lambda v$. The eigenvectors of the covariance matrix are the principal directions of the data, ordered by their eigenvalues.
See alsoeigenvalue, covariance matrix, principal component
Learn morePrincipal component analysis
Embedding
An embedding is the low-dimensional set of coordinates a manifold method assigns to each point, chosen so that selected relationships (distances or neighborhoods) are preserved. It is the unfolded, flattened view of the data.
See alsomanifold, manifold learning, dimensionality reduction
Learn moreManifold learning
Estimator
An estimator is a rule that maps data to an estimate of an unknown quantity, $\hat\theta=T(\mathcal{D})$. Because the data are random, the estimator is itself a random variable, so its quality is judged by properties of its distribution such as bias, variance, and consistency.
See alsoparameter, bias, variance, sampling distribution
Learn moreEstimation
Evidence lower bound ELBO
The evidence lower bound is a tractable lower bound on the log-evidence used by variational inference. Maximizing it is equivalent to minimizing the KL divergence from the approximation $q$ to the true posterior.
See alsovariational inference, KL divergence, marginal likelihood
Learn moreBayesian inference
Explained variance
Explained variance is the share of the data's total variance captured by a component, its eigenvalue divided by the sum of all eigenvalues. It is how PCA decides how many components are worth keeping.
See alsovariance, eigenvalue, principal component, dimensionality reduction
Learn morePrincipal component analysis
Exponential distribution
The exponential distribution models the waiting time between events in a memoryless process. Its density peaks at zero and decays at a constant rate, so longer waits are progressively less likely.
See alsoprobability distribution, probability density function
Learn moreDistribution visualizer
Feature selection
Feature selection keeps only a subset of the available inputs and discards the rest. Lasso does this implicitly by driving some coefficients to exactly zero, so the fitted model ignores those features.
See alsolasso, regularization, shrinkage
Learn moreLinear regression regularization
Frequency
Frequency is how many cycles a wave completes per unit of space or time. Low frequencies vary slowly and carry smooth structure; high frequencies vary rapidly and carry fine detail and edges.
See alsoamplitude, phase, Nyquist frequency, discrete Fourier transform
Learn moreFourier image decomposition
Function estimation
Function estimation infers a whole mapping $f$ rather than a finite set of numbers, as in regression where $y=f(x)+\varepsilon$. The unknown object is effectively infinite-dimensional, so estimates are controlled through smoothness or capacity assumptions.
See alsopoint estimation, parameter, irreducible noise
Learn moreEstimation
Gaussian discriminant analysis GDA
Gaussian discriminant analysis is a generative classifier that fits one multivariate Gaussian per class, $p(x\mid y{=}k)=\mathcal{N}(\mu_k,\Sigma_k)$, then classifies with Bayes' rule. Its per-class mean and covariance come from maximum likelihood.
See alsogenerative model, Gaussian distribution, class-conditional density, MLE
Learn moreGenerative classification
Gaussian distribution
The Gaussian (normal) distribution is the bell-shaped density set by a mean and variance, or a mean vector and covariance matrix in higher dimensions. It is the default model for continuous data and the per-class model in GDA.
See alsocovariance matrix, GDA, density estimation
Learn moreGenerative classification
Generalization
Generalization is how well a model trained on a sample performs on unseen data from the same process. Regularization improves it by discouraging fits that chase noise in the training set.
See alsooverfitting, underfitting, regularization, bias-variance decomposition
Learn moreLinear regression regularization
Generative model
A generative model learns the joint distribution $p(x,y)=p(y)\,p(x\mid y)$, the data-generating story, then classifies by applying Bayes' rule. Unlike a discriminative model it can also score plausibility and generate samples.
See alsodiscriminative model, joint distribution, class-conditional density, Bayes' rule
Learn moreGenerative classification
Geodesic distance
Geodesic distance is the distance measured along the manifold surface rather than straight through the surrounding space. Isomap approximates it by shortest paths along a neighborhood graph, so points are not treated as close just because the surface folds back near them.
See alsomanifold, k-nearest-neighbor graph, Isomap
Learn moreManifold learning
Gibbs phenomenon
The Gibbs phenomenon is the ringing overshoot that appears near a sharp edge when it is rebuilt from a truncated set of frequencies. The overshoot does not shrink as more terms are added; it just narrows.
See alsolow-pass filter, discrete Fourier transform, frequency
Learn moreFourier image decomposition
Gradient descent
Gradient descent minimizes a loss by repeatedly stepping in the direction of steepest decrease, $w \leftarrow w - \eta\nabla J(w)$. The step size is the learning rate; too large a step can oscillate or diverge instead of settling at the minimum.
See alsolearning rate, weight decay, ridge
Learn moreLinear regression regularization
Haar wavelets
Haar wavelets are a basis of translated, dilated square pulses, so each coefficient depends only on a small local region. They give a multi-scale pyramid that captures edges at several scales, at the cost of blocky coarse reconstructions.
See alsobasis, orthogonal, discrete Fourier transform
Learn moreFourier image decomposition
Intrinsic dimension
The intrinsic dimension is the number of coordinates genuinely needed to describe data on its manifold, regardless of how many dimensions it is measured in. A sheet curled through 3D has intrinsic dimension two.
See alsomanifold, dimensionality reduction, embedding
Learn moreManifold learning
Irreducible noise
Irreducible noise is the part of an outcome that no model can explain because it does not depend on the inputs, written $\sigma^2$ in $y=f(x)+\varepsilon$. It sets a floor on achievable prediction error regardless of how good the estimator is.
See alsobias-variance decomposition, function estimation
Learn moreEstimation
Isomap
Isomap is a manifold method that preserves geodesic distances. It builds a neighborhood graph, computes shortest-path distances on it, and then applies multidimensional scaling to place points in low dimensions while keeping those distances.
See alsogeodesic distance, MDS, k-nearest-neighbor graph, manifold learning
Learn moreManifold learning
Joint distribution
The joint distribution $p(x,y)$ gives the probability of features and label together. Factoring it as $p(y)\,p(x\mid y)$ is the basis of generative classification.
See alsogenerative model, class prior, class-conditional density, posterior
Learn moreGenerative classification
Kernel
A kernel is a small, smooth bump, often a Gaussian, placed on a data point in kernel density estimation. Summing the kernels over all points builds a smooth estimate of the density.
See alsokernel density estimation, bandwidth, density estimation
Learn moreGenerative classification
Kernel density estimation KDE
Kernel density estimation is a non-parametric density estimate that centers a kernel on every data point and averages them, $\hat p(x)=\frac1{n}\sum_i K(x;x_i,h)$. The bandwidth $h$ sets the smoothness, and it makes no assumption that the data forms a single cluster.
See alsokernel, bandwidth, density estimation, class-conditional density
Learn moreGenerative classification
KL divergence
The Kullback-Leibler divergence measures how much information is lost when one distribution is used in place of another, $\mathrm{KL}(q\,\|\,p)$. It is zero when the two match and grows as they differ; variational inference minimizes it.
See alsovariational inference, ELBO, cross-entropy
Learn moreBayesian inference
k-nearest-neighbor graph kNN
A k-nearest-neighbor graph connects each point to its $k$ closest points. It captures the local structure of data and is the scaffold manifold methods build on, for example to estimate geodesic distances or local geometry.
See alsogeodesic distance, Isomap, manifold learning
Learn moreManifold learning
Laplace distribution
The Laplace distribution is a symmetric double-sided exponential with a sharp peak and heavier tails than a Gaussian. Its log-density is the L1 penalty behind lasso.
See alsoGaussian distribution, lasso, probability density function
Learn moreDistribution visualizer
Laplacian eigenmaps
Laplacian eigenmaps embed data using the eigenvectors of the graph Laplacian of a neighborhood graph, keeping nearby points nearby. It is a spectral method closely tied to diffusion and clustering on graphs.
See alsok-nearest-neighbor graph, eigenvector, embedding, manifold learning
Learn moreManifold learning
Lasso
Lasso adds an L1 penalty $\lambda\lVert w\rVert_1$ to the least-squares loss. Because the L1 penalty has corners, it drives some coefficients to exactly zero, producing a sparse model and acting as feature selection.
See alsoridge, regularization, shrinkage, feature selection
Learn moreLinear regression regularization
Learning rate
The learning rate $\eta$ scales each gradient-descent step. Too small and training crawls; too large and the updates overshoot, oscillate, or diverge, especially when combined with a strong penalty.
See alsogradient descent, weight decay
Learn moreLinear regression regularization
Legendre polynomials
Legendre polynomials are an orthogonal basis on $[-1,1]$ under the unweighted inner product. A degree-$d$ expansion gives the best squared-error polynomial fit on the domain, but high-degree fits diverge just outside it (Runge's phenomenon).
See alsoChebyshev polynomials, Runge's phenomenon, orthogonal
Learn moreFourier image decomposition
Likelihood
The likelihood is the probability of the observed data read as a function of the parameters, $L(\theta)=p(\mathcal{D};\theta)$. It is not a distribution over $\theta$; it ranks parameter values by how well they would have produced the data we saw.
See alsolog-likelihood, maximum likelihood
Learn moreEstimation
Linear regression
Linear regression fits a linear function of the inputs to a continuous target by minimizing squared error, $\min_w \lVert Xw-y\rVert^2$. Its coefficients have a closed form, and it is the base model that regularization modifies.
See alsoordinary least squares, ridge, lasso, MSE
Learn moreLinear regression regularization
Locally linear embedding LLE
Locally linear embedding represents each point as a weighted combination of its neighbors, then finds low-dimensional coordinates that preserve those same local weights. It captures local geometry without needing global distances.
See alsomanifold learning, k-nearest-neighbor graph, embedding, Laplacian eigenmaps
Learn moreManifold learning
Log-likelihood
The log-likelihood is the logarithm of the likelihood, $\ell(\theta)=\log L(\theta)$. The log turns products over independent observations into sums, which is numerically stable and easier to differentiate, while preserving where the maximum sits.
See alsolikelihood, maximum likelihood, conditional log-likelihood
Learn moreEstimation
Low-pass filter
A low-pass filter keeps the low-frequency coefficients and discards the high ones, producing a smoothed version of the signal or image. Applied to non-periodic images it can introduce Gibbs ringing at the edges.
See alsofrequency, discrete Fourier transform, Gibbs phenomenon, spectral energy
Learn moreFourier image decomposition
Low-rank approximation
A low-rank approximation rebuilds a matrix from only its largest components, $X_{\text{rank-}k}=U_k\Sigma_k V_k^\top$. Keeping the top $k$ singular values gives the closest rank-$k$ matrix under squared error, which is how PCA compresses data.
See alsoSVD, dimensionality reduction, reconstruction, explained variance
Learn morePrincipal component analysis
Manifold
A manifold is a surface that looks flat up close but can curve through a higher-dimensional space, like a sheet rolled into a spiral. Manifold learning assumes data lies on such a low-dimensional surface embedded in many measured dimensions.
See alsomanifold learning, embedding, intrinsic dimension, geodesic distance
Learn moreManifold learning
Manifold learning
Manifold learning recovers the low-dimensional surface that data lies on and lays it out in few coordinates. Methods like Isomap, locally linear embedding, and Laplacian eigenmaps differ in how they preserve local or global structure while unfolding it.
See alsomanifold, Isomap, embedding, dimensionality reduction
Learn moreManifold learning
Marginal likelihood evidence
The marginal likelihood, or evidence, is the probability of the data averaged over the prior, $p(\mathcal{D})=\int p(\mathcal{D}\mid\theta)p(\theta)\,d\theta$. It normalizes Bayes' rule and is often the hard part to compute, which motivates approximate inference.
See alsoBayes' rule, posterior, approximate inference, ELBO
Learn moreBayesian inference
Markov chain Monte Carlo MCMC
Markov chain Monte Carlo approximates a posterior by building a Markov chain whose stationary distribution is that posterior. After enough steps its samples behave like posterior draws, targeting the true posterior at the cost of computation and mixing diagnostics.
See alsoapproximate inference, variational inference, posterior
Learn moreBayesian inference
Maximum a posteriori MAP
Maximum a posteriori estimation picks the parameter at the peak of the posterior, $\hat\theta_{\mathrm{MAP}}=\arg\max_\theta p(\theta\mid\mathcal{D})$. It equals maximum likelihood plus a prior term, so the prior acts as a regularizer; ridge regression is the MAP estimate under a Gaussian prior.
See alsoMLE, posterior, prior, ridge
Learn moreBayesian inference
Maximum likelihood estimation MLE
Maximum likelihood estimation chooses the parameter that makes the observed data most probable: $\hat\theta=\arg\max_\theta \ell(\theta)$. It is consistent under broad conditions and connects many familiar losses to a single principle of fitting by probability.
See alsolikelihood, log-likelihood, cross-entropy, consistency
Learn moreEstimation
Mean squared error MSE
Mean squared error averages squared deviations of an estimate from the truth, $\mathbb{E}[(\hat\theta-\theta)^2]$, and decomposes exactly into $\text{Bias}(\hat\theta)^2+\text{Var}(\hat\theta)$. It rewards estimators that are both centered and stable, and penalizes large misses heavily.
See alsobias, variance, bias-variance decomposition
Learn moreEstimation
Mixture model
A mixture model combines several weighted component distributions into one density. It captures data with multiple clusters or modes that a single distribution cannot represent.
See alsoGaussian distribution, probability distribution, density estimation
Learn moreDistribution visualizer
Multidimensional scaling MDS
Multidimensional scaling places points in a low-dimensional space so their pairwise distances match a given set of target distances as closely as possible. It is the final step of Isomap and a manifold method in its own right.
See alsoIsomap, embedding, dimensionality reduction
Learn moreManifold learning
Nyquist frequency
The Nyquist frequency is the highest frequency a sampled signal can represent, one cycle per two samples. Anything faster cannot be resolved and masquerades as a lower frequency (aliasing).
See alsofrequency, discrete Fourier transform
Learn moreFourier image decomposition
Ordinary least squares OLS
Ordinary least squares is linear regression with no penalty, choosing the coefficients that minimize the mean squared error. With many features it can overfit, fitting noise as if it were signal.
See alsolinear regression, overfitting, ridge, MSE
Learn moreLinear regression regularization
Orthogonal
Two directions are orthogonal if they meet at a right angle and share no common component. PCA's principal directions are mutually orthogonal, so each captures variance the others do not.
See alsoprincipal component, basis, eigenvector
Learn morePrincipal component analysis
Overfitting
Overfitting is when a model fits the training data too closely, capturing noise rather than the underlying pattern, so it does poorly on new data. High-capacity models such as high-degree polynomials overfit unless they are regularized.
See alsounderfitting, generalization, regularization, variance
Learn moreLinear regression regularization
Parameter
A parameter is the fixed but unknown quantity a model is built around, such as a mean $\mu$ or a coefficient vector $\theta\in\mathbb{R}^p$. Estimation is the task of recovering it from data; it is the target, not a function of the data.
See alsoestimator, point estimation, function estimation
Learn moreEstimation
Periodicity
Periodicity is the property of repeating at a fixed interval. The Fourier basis is built from periodic waves, so a reconstruction treats the image as one tile of an infinitely repeating pattern.
See alsofrequency, discrete Fourier transform, Gibbs phenomenon
Learn moreFourier image decomposition
Phase
Phase is where a wave sits in its cycle, its horizontal shift. In a Fourier coefficient the phase is the argument of the complex number while the amplitude is its magnitude; both are needed to place the wave.
See alsoamplitude, frequency, discrete Fourier transform
Learn moreFourier image decomposition
Point estimation
Point estimation produces a single best value for a finite-dimensional unknown, $\hat\theta=T(\mathcal{D})\in\mathbb{R}^p$, as opposed to an interval or a whole function. It is the most direct estimation task and the setting for bias, variance, and MSE.
See alsoparameter, estimator, confidence interval, function estimation
Learn moreEstimation
Posterior
The posterior is the updated belief about a parameter after seeing data, $p(\theta\mid\mathcal{D})$. Bayes' rule forms it by combining the prior with the likelihood; it is a compromise that moves from the prior toward the likelihood as data accumulates.
See alsoprior, likelihood, Bayes' rule, credible interval
Learn moreBayesian inference
Posterior predictive
The posterior predictive distribution predicts new data by averaging the likelihood over the posterior, $p(x_{\text{new}}\mid\mathcal{D})=\int p(x_{\text{new}}\mid\theta)p(\theta\mid\mathcal{D})\,d\theta$. It accounts for parameter uncertainty instead of plugging in a single estimate.
See alsoposterior, Bayesian inference, marginal likelihood
Learn moreBayesian inference
Precision
Precision is the inverse of variance. In Bayesian updates, combining a prior and data is a precision-weighted average: sources with smaller variance (higher precision) get more weight.
See alsovariance, posterior, prior
Learn moreBayesian inference
Principal component
A principal component is one of the orthogonal directions PCA finds, ordered so the first captures the most variance, the second the most of what remains, and so on. They are the eigenvectors of the covariance matrix, equivalently the right singular vectors of the data.
See alsoprincipal component analysis, eigenvector, explained variance, covariance matrix
Learn morePrincipal component analysis
Principal component analysis PCA
Principal component analysis reorganizes data around the orthogonal directions of greatest variance. Computed from the SVD or the covariance eigendecomposition, it supports visualization, dimensionality reduction, and denoising by ordering directions from most to least informative.
See alsoprincipal component, SVD, covariance matrix, dimensionality reduction
Learn morePrincipal component analysis
Prior
The prior $p(\theta)$ encodes what is believed about a parameter before seeing data. It can express genuine prior knowledge or act as a regularizer, and Bayes' rule updates it into the posterior.
See alsoposterior, likelihood, Bayes' rule, regularization
Learn moreBayesian inference
Probability density function PDF
A probability density function gives the relative likelihood of each value of a continuous random variable. Areas under it are probabilities, and the total area integrates to one.
See alsocumulative distribution function, probability distribution, random variable
Learn moreDistribution visualizer
Probability distribution
A probability distribution describes how likely each value of a random variable is. For continuous variables it is summarized by a density (PDF) and a cumulative function (CDF).
See alsoprobability density function, cumulative distribution function, random variable
Learn moreDistribution visualizer
Projection
A projection drops each point onto a lower-dimensional subspace, keeping the part that lies within it. PCA projects data onto the span of the top principal components, the rank-$k$ projection that preserves the most variance.
See alsoprincipal component, dimensionality reduction, low-rank approximation, orthogonal
Learn morePrincipal component analysis
Random variable
A random variable is a quantity whose value is set by a random process. Its behavior is captured by a probability distribution over the values it can take.
See alsoprobability distribution, probability density function, sampling distribution
Learn moreDistribution visualizer
Reconstruction
Reconstruction rebuilds data from its reduced representation by mapping the kept components back into the original coordinates. The reconstruction error is what the discarded components would have contributed, which PCA keeps small by dropping only low-variance directions.
See alsolow-rank approximation, dimensionality reduction, explained variance
Learn morePrincipal component analysis
Regularization
Regularization adds a penalty on model complexity to the training objective, trading a little fit on the data for better behavior on new data. Ridge and lasso are the common penalties for linear models.
See alsoridge, lasso, overfitting, generalization, shrinkage
Learn moreLinear regression regularization
Ridge
Ridge regression adds an L2 penalty $\lambda\lVert w\rVert^2$ to the least-squares loss. It shrinks all coefficients smoothly toward zero without forcing any to exactly zero, reducing variance at the cost of a little bias.
See alsolasso, shrinkage, regularization, bias-variance decomposition
Learn moreLinear regression regularization
Runge's phenomenon
Runge's phenomenon is the large oscillation that high-degree polynomial fits develop near the edges of their domain. It is why global polynomial bases like Legendre extrapolate badly just outside the fitted region.
See alsoLegendre polynomials, Chebyshev polynomials, overfitting
Learn moreFourier image decomposition
Sampling distribution
The sampling distribution is the distribution of an estimator across repeated samples from the same process. Its center relates to bias and its spread to variance, so most questions about an estimator's reliability are questions about this distribution.
See alsostandard error, bias, variance, bootstrap
Learn moreEstimation
Shrinkage
Shrinkage pulls estimated coefficients toward zero, accepting a little bias to cut variance. It is the mechanism behind ridge regression and a common way to stabilize high-variance estimates.
See alsoridge, regularization, variance, bias
Learn moreLinear regression regularization
Singular value decomposition SVD
The singular value decomposition factors any matrix as $X=U\Sigma V^\top$: orthogonal singular vectors in $U$ and $V$, and nonnegative singular values in $\Sigma$. For centered data it yields PCA directly, with the right singular vectors as principal directions and the squared singular values as variances.
See alsoprincipal component, covariance matrix, eigendecomposition, low-rank approximation
Learn morePrincipal component analysis
Spectral energy
Spectral energy is the squared magnitude of a frequency coefficient, $|F[u,v]|^2$, measuring how much of a signal's total energy sits in that one wave. Smooth images concentrate it in low frequencies; sharp ones spread it into high frequencies.
See alsofrequency, discrete Fourier transform, low-pass filter
Learn moreFourier image decomposition
Standard error
The standard error is the standard deviation of an estimator's sampling distribution, $\text{SE}(\hat\theta)=\sqrt{\text{Var}(\hat\theta)}$. It reports how much an estimate would wobble from sample to sample and sets the scale for confidence intervals.
See alsosampling distribution, variance, confidence interval
Learn moreEstimation
Swiss roll
The swiss roll is a standard test dataset: a 2D sheet rolled into a spiral through 3D. It is useful because its true intrinsic structure is known, so a method's unrolled embedding can be judged against it.
See alsomanifold, intrinsic dimension, embedding
Learn moreManifold learning
Unbiased estimator
An unbiased estimator has expected value equal to the quantity it estimates, $\mathbb{E}[\hat\theta]=\theta$, so it is centered on the target on average. Unbiasedness alone does not guarantee accuracy; a high-variance unbiased estimator can still miss badly on any single sample.
See alsobias, variance, MSE, consistency
Learn moreEstimation
Underfitting
Underfitting is when a model is too constrained to capture the real pattern, so both training and test error stay high. Too strong a regularization penalty causes it.
See alsooverfitting, generalization, regularization
Learn moreLinear regression regularization
Uniform distribution
The uniform distribution spreads probability evenly over an interval, so every value in the range is equally likely. Its density is a flat rectangle and zero outside the range.
See alsoprobability distribution, probability density function
Learn moreDistribution visualizer
Variance
The variance of an estimator measures how much it changes from sample to sample, $\text{Var}(\hat\theta)=\mathbb{E}[(\hat\theta-\mathbb{E}[\hat\theta])^2]$. High variance means the estimate is sensitive to the particular data drawn, even if it is correct on average.
See alsobias, standard error, MSE, bias-variance decomposition
Learn moreEstimation
Variational inference
Variational inference approximates a posterior by choosing the closest member of a tractable family $q(\theta)$, turning inference into optimization. It maximizes the ELBO (equivalently minimizes the KL divergence to the posterior), trading exactness for speed and scale.
See alsoMCMC, ELBO, KL divergence, approximate inference
Learn moreBayesian inference
Weight decay
Weight decay multiplies the weights by a factor slightly less than one on each gradient step, $w \leftarrow (1-2\eta\lambda)w - \eta\nabla\,\text{MSE}$. It is closely related to an L2 penalty and pulls weights toward zero as training proceeds.
See alsoridge, gradient descent, learning rate, shrinkage
Learn moreLinear regression regularization