The Cramér-Rao Bound is a fascinating result.
If we just start by thinking of estimators as functions of the data that try to estimate the parameter, you might imagine that there if you work really hard you might be able to come up with a better estimator. The Cramér-Rao bound says there is a limit to how well you can do. It's a limit on the (co)variance of the estimator and it is based on information theoretic quantities for the statistical model
First let's consider the univariate case where
In the unbiased case, the Cramér-Rao states
where
Under some mild assumptions, you can rewrite this Fisher information as
The **efficiency** of an unbiased estimator
$\hat{\theta}$ measures how close this estimator's variance comes to this lower bound; estimator efficiency is defined as
$$
{\displaystyle e({\hat {\theta }})={\frac {I(\theta )^{-1}}{\operatorname {var} ({\hat {\theta }})}}}
$$
The term $\frac{\partial}{\partial \theta} \log p(X \mid \theta)$ is called the **score function**.
Consider the straw man estimator that always returns a constant value $\hat{\theta}_{const} = \theta_0$. The variance of the estimator is 0!
The $b(\theta_0)=0$ as well, is this a violation of the ramér-Rao Bound? While the bias is 0 at that particular point, it's biased everywhere else $b(\theta_0)=\theta_0 - \theta$, so this form of the bound isn't applicable, we need a generalization that works with biased estimators.
where we use
The resolution to the example with the straw man estimator that always returns a constant value $\hat{\theta}_\textrm{const} = \theta_0$ involves the generalization of the Cramér-Rao Bound? The bias $b(\theta_0)=\theta_0 - \theta$, so the derivative is $\color{#DC2830}{\frac{d b(\theta )}{d\theta}}=-1$, and the generalized bound is $\displaystyle \operatorname {var} \left({\hat {\theta }}\right) \geq 0$, so all is well.
There is a corresponding formulation for the multivariate case where
Let's consider the unbiased case first, and generalize variance to covariance. We have
$$ \operatorname{cov}[\hat{\theta}_i, \hat{\theta}j \mid \theta] \ge I^{-1}{ij}(\theta) $$
where
Under some mild assumptions, you can rewrite this Fisher information matrix as
The generalization of the score function $\nabla_\theta \log p(X \mid \theta)$ is now a vector.
There is also a corresponding generalization for biased, multivariate estimators.
The general form of the Cramér–Rao bound then states that the covariance matrix of
where
Importantly, under some regularity conditions maximum likelihood estimators are asymptotically unbiased and efficient (ie. they saturate the inequality).
Later we will connect the Fisher information matrix to the topic of Information Geometry, where we can interpret
We will also see connections to the concept of Sufficiency and the exponential family.