Overview

Estimation is about using data $\mathcal{D}$ to infer something unknown: a parameter $\theta$, a function $f$, or a prediction rule. Many of the most common ideas in statistics and machine learning can be expressed using a single lens: choose an estimator that performs well under a loss (like MSE) or under a probabilistic model (like likelihood).

In point estimation, the unknown object is finite-dimensional (a scalar or vector $\theta\in\mathbb{R}^p$). In function estimation, the unknown object is typically infinite-dimensional (a whole curve/surface).

$$\text{Point:}\quad X_1,\dots,X_n\sim P_\theta\;\Rightarrow\;\hat\theta\approx\theta$$
$$\text{Function:}\quad Y = f(X)+\varepsilon\;\Rightarrow\;\hat f(x)\approx f(x)$$

The estimator is a function of the data, and since the data is drawn from a random process, any function of the data is a random variable. Here $T$ and $A$ are the estimation procedures (e.g., formulas, algorithms) that we design to produce good estimates.

$$\text{Point estimation:}\quad \hat\theta = T(\mathcal{D})$$
$$\text{Function estimation:}\quad \hat f = A(\mathcal{D}),\; \hat f: \mathcal{X}\to\mathbb{R}$$

Examples: (1) estimating a mean $\theta=\mu$; (2) estimating a regression function $f(x)=\mathbb{E}[Y\mid X=x]$.

Bias and Variance

Treat an estimator as a random variable (because it depends on random data). Two core quantities are:

$$\text{Bias:}\quad \mathrm{Bias}(\hat\theta) = \mathbb{E}[\hat\theta]-\theta$$
$$\text{Variance:}\quad \mathrm{Var}(\hat\theta) = \mathbb{E}\big[(\hat\theta-\mathbb{E}[\hat\theta])^2\big]$$

An estimator is considered unbiased if $\mathrm{Bias}(\hat\theta)=0$, i.e., $\mathbb{E}[\hat\theta]=\theta$. Unbiasedness is a nice property, but it is not the only thing we care about; we also want low variance.

For function estimation, we can define bias and variance at each input $x$:

$$\mathrm{Bias}(\hat f(x)) = \mathbb{E}[\hat f(x)]-f(x)$$
$$\mathrm{Var}(\hat f(x)) = \mathbb{E}\big[(\hat f(x)-\mathbb{E}[\hat f(x)])^2\big]$$

Determining an unbiased estimator

Let $X_1,\dots,X_n$ be i.i.d. Bernoulli($p$), i.e., $\mathbb{P}(X_i=1)=p$ and $\mathbb{P}(X_i=0)=1-p$. A natural estimator of $p$ is the sample mean:

$$\hat p = \bar X = \frac{1}{n}\sum_{i=1}^n X_i$$

First compute the expectation of a Bernoulli random variable:

$$\mathbb{E}[X_i] = 0\cdot\mathbb{P}(X_i=0) + 1\cdot\mathbb{P}(X_i=1) = p$$

Then apply linearity of expectation (independence is not required for this step):

$$\mathbb{E}[\hat p] = \mathbb{E}\!\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[X_i] = \frac{1}{n}\sum_{i=1}^n p = p$$

Therefore $\mathbb{E}[\hat p]=p$, so $\hat p$ is an unbiased estimator of $p$.

Standard error

The standard error is the standard deviation of the estimator:

$$\text{Standard error:}\quad \mathrm{SE}(\hat\theta)=\sqrt{\mathrm{Var}(\hat\theta)}$$

In practice, we rarely know $\mathrm{Var}(\hat\theta)$ exactly, so we often estimate the standard error using the data we have sampled.

For the Bernoulli example, we can compute the variance of $\hat p$ using the fact that the variance of a sum of independent random variables is the sum of their variances:

$$\mathrm{Var}(\hat p)=\mathrm{Var}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2}\sum_{i=1}^n \mathrm{Var}(X_i) = \frac{1}{n^2}\sum_{i=1}^n p(1-p) = \frac{p(1-p)}{n}$$
$$\mathrm{SE}(\hat p)=\sqrt{\frac{p(1-p)}{n}}\;\approx\;\sqrt{\frac{\hat p(1-\hat p)}{n}}$$

The standard error quantifies the sampling variability of the estimator, meaning how much the estimate can vary across repeated samplings of the data. From the Central Limit theorem, the collection of sample estimates will converge to be approximately Normal for large $n$. The standard error, then, can be used to compute the probability that the true parameter lies within a confidence interval centered around the estimate.

Bootstrap confidence interval

The initial sample of data is just one of many possible samples we could have drawn from the data-generating process, and only gives us one realization of the estimator $\hat\theta$. The bootstrap is one way to get a better understanding of the uncertainty of our estimate by redrawing many collections of samples from our observed dataset (with replacement) and computing $\hat\theta^*$ for each resample.

  • If $\theta = \mu$, then on each resample, $\hat\theta^*=\hat\mu^*=\frac{1}{n}\sum_{i=1}^n X_i^*$

This allows us to empirically approximate the sampling distribution of $\hat\theta$ and estimate its standard error and confidence intervals. This is especially useful when an analytical standard error is unavailable, inconvenient, or relies on assumptions we do not want to make. Under repeated sampling from the data-generating process, the constructed interval contains the true parameter $\theta$ with approximately probability $1-\alpha$. It is not, in general, a statement that the parameter $\theta$ itself is random or that there is a $1-\alpha$ probability it lies in this particular realized interval.

Note: The basic bootstrap assumes observations are i.i.d. (or at least exchangeable). For dependent data (e.g., time series), we typically need variants such as the block bootstrap.

Fig. 1 · bootstrap distribution of the statistic
Note the standard error and confidence interval tend to tighten as $n$ grows.
Results
-

Mean squared error (MSE) and minimizing it

A common way to evaluate an estimator is by its mean squared error. For a parameter estimator, the MSE decomposes into bias and variance.

$$\text{MSE:}\quad \mathrm{MSE}(\hat\theta)=\mathbb{E}\big[(\hat\theta-\theta)^2\big]$$
$$\text{Decomposition:}\quad \mathrm{MSE}(\hat\theta)=\mathrm{Bias}(\hat\theta)^2+\mathrm{Var}(\hat\theta)$$

To prove this, here's the derivation where we add and subtract $\mathbb{E}[\hat\theta]$ inside the initial definition, and simplify the expansion using the definition of variance and the fact that $\mathbb{E}[\hat\theta-\mathbb{E}[\hat\theta]]=0$ (the expected deviation from the mean is zero by definition):

$$\begin{aligned} \mathbb{E}\big[(\hat\theta-\theta)^2\big] &= \mathbb{E}\big[(\hat\theta-\mathbb{E}[\hat\theta] + \mathbb{E}[\hat\theta]-\theta)^2\big]\\ &= \mathbb{E}\big[(\hat\theta-\mathbb{E}[\hat\theta])^2\big] + 2\,\mathbb{E}\big[(\hat\theta-\mathbb{E}[\hat\theta])(\underbrace{\mathbb{E}[\hat\theta]-\theta}_{\text{a constant}})\big] + \mathbb{E}\big[(\underbrace{\mathbb{E}[\hat\theta]-\theta}_{\text{a constant}})^2\big]\\ &= \mathrm{Var}(\hat\theta) + 2\,(\mathbb{E}[\hat\theta]-\theta)\,\underbrace{\mathbb{E}[\hat\theta-\mathbb{E}[\hat\theta]]}_{=\,0} + (\mathbb{E}[\hat\theta]-\theta)^2\\ &= \mathrm{Var}(\hat\theta) + \mathrm{Bias}(\hat\theta)^2. \end{aligned}$$

Minimizing MSE often means accepting a little bias to reduce variance. This is the intuition behind shrinkage methods and regularization (e.g., ridge regression), and leads the discussion of the bias-variance tradeoff in ML.

Function estimation: bias-variance and prediction error

For regression, we often care about prediction error at an input $x$. A classic decomposition (under squared loss) is:

$$\mathbb{E}\big[(\hat f(x)-Y)^2\mid X=x\big] =\underbrace{\big(\mathbb{E}[\hat f(x)]-f(x)\big)^2}_{\text{bias}^2} +\underbrace{\mathrm{Var}(\hat f(x))}_{\text{variance}} +\underbrace{\mathrm{Var}(\varepsilon\mid X=x)}_{\text{irreducible noise}}$$

The irreducible noise term cannot be improved by changing the estimator; it is the part of $Y$ that is unpredictable from $X$. Only bias and variance are under our control.

Consistency

An estimator is consistent if it converges to the truth as the sample size grows. The most common notion is convergence in probability.

$$\hat\theta_n\xrightarrow{p}\theta\quad\text{as }n\to\infty$$

Consistency is an asymptotic property; it doesn't say the estimator is good at small $n$. In practice, two methods can both be consistent, but one can dominate the other at realistic sample sizes.

Maximum likelihood estimation (MLE)

If we assume a probabilistic model $p_\theta(x)$ for the data, the likelihood $L(\theta)$ of a dataset $\mathcal{D}=\{x_i\}_{i=1}^n$ is $L(\theta)=\prod_{i=1}^n p_\theta(x_i)$. This is the probability of observing the data we did. The MLE chooses the parameter that maximizes this.

$$\hat\theta_{\text{MLE}}=\arg\max_{\theta}\;\prod_{i=1}^n p_\theta(x_i)$$
$$\hat\theta_{\text{MLE}}=\arg\max_{\theta}\;\sum_{i=1}^n \log p_\theta(x_i)$$

The second form is the log-likelihood; it is numerically stable and turns products into sums. Under regularity conditions (like differentiability and identifiability), MLEs are consistent and asymptotically normal.

Conditional log-likelihood (discriminative objective)

In supervised learning we often model $p_\theta(y\mid x)$ directly. The conditional log-likelihood objective is:

$$\hat\theta = \arg\max_{\theta}\;\sum_{i=1}^n \log p_\theta(y_i\mid x_i)$$

For multiclass classification with a softmax model, maximizing conditional log-likelihood is equivalent to minimizing cross-entropy (negative conditional log-likelihood).

Conditional log-likelihood vs. mean squared error

Both are valid training objectives, but they encode different assumptions. MLE assumes a full generative model for the data, while conditional log-likelihood only models the conditional distribution of $Y$ given $X$.

$$\text{MSE regression:}\quad \min_{f\in\mathcal{F}}\;\frac{1}{n}\sum_{i=1}^n\big(f(x_i)-y_i\big)^2$$
$$\text{Conditional log-likelihood:}\quad \max_{\theta}\;\sum_{i=1}^n \log p_\theta(y_i\mid x_i)$$

A useful bridge: if we assume a Gaussian noise model $Y\mid X=x \sim \mathcal{N}(f(x),\sigma^2)$ with constant $\sigma^2$, then minimizing MSE is (up to constants) equivalent to maximizing the conditional log-likelihood.

$$\log p(y\mid x) = -\tfrac{1}{2\sigma^2}(y-f(x))^2 + \text{const}\;\Rightarrow\;\min (y-f(x))^2\;\Leftrightarrow\;\max \log p(y\mid x)$$