Overview
Bayesian inference uses observed data $\mathcal{D}$ to learn about unknown parameter(s) $\theta$. For example, $\theta$ might be a coin's probability of heads, a Gaussian mean, or a vector of regression coefficients, while $\mathcal{D}$ is the dataset you actually observed.
The point of introducing a prior and a posterior is to keep track of uncertainty throughout that learning process. Before seeing data, we use a prior to describe which parameter values seem plausible. This can be seen as our initial beliefs or assumptions, or as a form of regularization. After seeing data, we update those beliefs to a posterior, which tells us which parameter values are plausible given the evidence we observed.
The central update rule is Bayes' rule. It combines the prior with the likelihood, $p(\mathcal{D}\mid\theta)$, which measures how compatible the observed data is with each possible value of $\theta$. Last is the probability of observing the data, which for continuous variables would be the probability density $p(\mathcal{D}) = \int p(\mathcal{D}\mid\theta)p(\theta)\,d\theta$, which acts as the evidence (or marginal likelihood). It is often hard to compute, and that difficulty motivates approximate inference.
Conjugacy
Conjugacy shows the core principle of Bayesian inference in its cleanest form. You start with a prior, update it with a likelihood after observing data, and end with a posterior. In general, carrying out that update can require difficult integration through the evidence term $p(\mathcal{D})$, but conjugate models are special because the posterior can be written down explicitly with a formula instead of requiring numerical approximation (they're closed-form).
A prior is conjugate to a likelihood when the posterior belongs to the same distribution family (e.g., Normal, Beta, Uniform) as the prior, just with updated parameter values. In practice, conjugacy appears when the prior is chosen to match the algebraic structure of the likelihood, and it is useful because it makes Bayesian updating both exact and easy to interpret.
Beta-Binomial
Suppose a coin has an unknown probability of heads $p$. You flip the coin $n$ times independently and let $X$ denote the number of observed heads. The Binomial distribution $\mathrm{Binomial}(n,p)$ describes that head count, while the Beta distribution $\mathrm{Beta}(\alpha,\beta)$ is a distribution on probabilities in $[0,1]$ with shape parameters $\alpha$ and $\beta$.
If we use a Beta prior for the unknown probability $p$, then the Binomial likelihood and the Beta prior form a conjugate pair:
Normal-Normal (known variance)
Now suppose the data values $x_1,\dots,x_n$ come from a Normal (Gaussian) distribution with unknown mean $\mu$ and known variance $\sigma^2$. Here $\sigma^2$ describes the variance in the data-generating process, or how noisy the observations are around the true mean $\mu$. We place a Normal prior on the unknown mean: $\mu$ itself is assumed to follow a Normal distribution with prior mean $\mu_0$ and prior variance $\tau^2$.
Because the likelihood is Normal in $\mu$ and the prior is also Normal in $\mu$, this is another conjugate pair:
The posterior for the unknown mean is also Normal: $\mu\mid\mathcal{D}\sim\mathcal{N}(\mu_n,\tau_n^2)$. Here $\mu_n$ is the updated posterior mean, $\tau_n^2$ is the updated posterior variance, and $\bar x$ denotes the sample mean of the observed values:
Bayesian linear regression (Gaussian)
The same conjugate idea extends to linear regression. If the response vector $y$ is modeled with Gaussian noise around the linear predictor $Xw$, and the coefficient vector $w$ has a Gaussian prior, then the posterior over $w$ is also Gaussian.
Conjugate updating
Paste a dataset, choose a model, and see how Bayesian updating changes both your belief about the unknown parameter and your prediction for a new observation. The first plot focuses on uncertainty about the parameter itself; the second plot shows what the updated model predicts for new data.
The prior shows what parameter values were plausible before seeing the dataset, the likelihood shows which parameter values best explain the observed data, and the posterior shows the updated belief after combining both. The shaded interval on the posterior plot is a 95% credible interval, which contains parameter values that together have 95% of the posterior probability mass.
This plot summarizes predictions for new data after averaging over parameter uncertainty rather than plugging in a single estimate.
Posterior predictive distribution
Bayesian prediction integrates over uncertainty in parameters instead of plugging in a single estimate:
Credible intervals vs confidence intervals
These are often confused. The difference is about what is random and what the statement conditions on. A confidence interval is a statement about a procedure over repeated samples; a credible interval is a statement about parameters conditional on observed data (given a prior and likelihood).
Maximum a posteriori (MAP) estimation
MAP is relevant because it keeps the Bayesian idea of combining prior information with data, but replaces the full posterior with a single estimate.
MAP chooses the parameter value with highest posterior density. In other words, instead of describing all plausible values of $\theta$ after observing $\mathcal{D}$, it asks which single value of $\theta$ is most plausible under the posterior.
This formula motivates MAP as an optimization problem. Because the log function preserves maximizers, MAP can be found by maximizing log-likelihood plus log-prior, or equivalently by minimizing a data-fit term plus a penalty induced by the prior. That does not solve Bayesian inference in full, but it often gives a practical Bayesian-informed estimate even when exact conjugate updating is unavailable or unnecessary.
This is why priors often behave like regularizers: different prior choices reshape the objective and pull the estimate toward different kinds of solutions.
MAP vs MLE (Normal mean)
To see that optimization view concretely, suppose $x_i\sim\mathcal{N}(\mu,\sigma^2)$ with a Gaussian prior $\mu\sim\mathcal{N}(\mu_0,\tau^2)$. The MLE is $\hat\mu_{\mathrm{MLE}}=\bar x$, while the MAP estimate adds a prior penalty and therefore balances fit to the data against agreement with the prior.
The blue curve is the data-fit term, the white curve is the prior penalty, and the green curve is their sum, whose minimum gives the MAP estimate.
Approximate inference
When conjugacy is unavailable, we usually cannot write down the posterior in a clean closed form. In that case we turn to approximate inference: methods that either sample from the posterior indirectly or fit a simpler distribution to it.
MCMC (sampling)
Markov Chain Monte Carlo builds a Markov chain, meaning a sequence of parameter values where the next value depends only on the current one.
The transition rule is designed so that the posterior is the chain's stationary distribution: if the chain were already distributed according to the posterior, one more update step would leave that distribution unchanged. After enough steps, the samples produced by the chain behave approximately like posterior draws.
In practice, the chain is built by repeatedly proposing a move and then accepting or rejecting it in a way that favors regions of high posterior density while still exploring the full space. The appeal is that MCMC targets the true posterior rather than a simplified surrogate, but the tradeoff is computational cost: it can mix slowly, require diagnostics, and be hard to scale in high-dimensional models.
Variational inference (optimization)
Variational inference takes a different route. Instead of sampling from the posterior directly, it chooses a tractable family of distributions $q(\theta)$ and looks for the member of that family that is closest to the true posterior. The usual measure of closeness is the KL divergence, $\mathrm{KL}(q\,\|\,p)$, which measures how much information is lost when we use $q(\theta)$ in place of the posterior.
Because the posterior contains the hard-to-compute evidence term, VI usually works with the evidence lower bound (ELBO) instead. Maximizing the ELBO is equivalent to minimizing the KL divergence from $q(\theta)$ to the posterior, so optimization pushes $q(\theta)$ toward the best approximation available within the chosen family. If that family is flexible enough, the approximation can be very good; if it is restrictive, VI converges to the best approximation in that family rather than to the exact posterior.
Visualizing variational inference (VI)
To make VI concrete, we'll intentionally use a variational family that does not match the true conjugate posterior. For coin flips with a Beta prior, the true posterior is Beta. We'll approximate it with a logistic-normal distribution: sample $z\sim\mathcal{N}(m,s^2)$ and map $p=\sigma(z)$.
This plot compares the prior, the exact posterior, and the variational approximation at several stages of optimization, so you can see both how the data updated the prior and how $q$ changes from its starting shape to its final fitted form.
Interpreting the results
Concluding remarks
Bayesian inference is a framework for reasoning under uncertainty. You start with a prior, update it with a likelihood after observing data, and obtain a posterior that describes which parameter values are plausible after seeing the evidence. The main advantage is that uncertainty remains part of the answer rather than being discarded after fitting a single estimate.
Conjugacy shows this updating process in its cleanest form. In conjugate models, the posterior stays in the same family as the prior, so the update can be written down exactly and interpreted directly. That makes conjugacy a useful teaching tool and a practical computational shortcut, but only for a limited class of model-prior combinations.
MLE and MAP enter when we want point estimates instead of a full posterior. MLE uses only the likelihood and chooses the parameter that best fits the observed data. MAP adds the prior and chooses the mode of the posterior, so it acts like a regularized version of likelihood-based estimation. These approaches are often simpler and faster than full Bayesian inference, but their impact is that they collapse posterior uncertainty into a single fitted value.
Approximate inference becomes necessary when the posterior cannot be written in a convenient closed form or when exact inference is too expensive. MCMC aims to approximate the full posterior by sampling from it, which preserves uncertainty well but can be computationally heavy. Variational inference instead fits a simpler distribution by optimization, which is often much faster and more scalable, but only gives the best approximation within the chosen variational family.