No Title

$next$ $up$ $previous$

Postscript version of this file

STAT 801 Lecture 13

Goals of Today's Lecture:

Develop large sample theory for the mle.
Define and interpret Fisher information.
Extend the ideas to estimating equations.

Today's notes

Large Sample Theory of the MLE

Application of the law of large numbers to the likelihood function:

The log likelihood ratio for $\theta$ to $\theta_0$ is

$\begin{displaymath}\ell(\theta) -\ell(\theta_0) = \sum L_i \end{displaymath}$

where $L_i = \log[f(X_i,\theta)/f(X_i,\theta_0)]$ . We proved that $\mu(\theta) \equiv E_{\theta_0}(L_i) < 0$ . Then

$\begin{displaymath}\frac{\ell(\theta) -\ell(\theta_0)}{n} \to \mu(\theta) \end{displaymath}$

$\begin{displaymath}P_{\theta_0}(\ell(\theta)< \ell(\theta_0)) \to 1 \, . \end{displaymath}$

Definition A sequence $\hat\theta_n$ of estimators of $\theta$ is consistent if $\hat\theta_n$ converges weakly (or strongly) to $\theta$ .

Proto theorem: In regular problems the mle $\hat\theta$ is consistent.

Now let us study the shape of the log likelihood near the true value of $\hat\theta$ under the assumption that $\hat\theta$ is a root of the likelihood equations close to $\theta_0$ . We use Taylor expansion to write, for a 1 dimensional parameter $\theta$ ,
$\begin{align*}U(\hat\theta) = & 0 \\ = & U(\theta_0) + U^\prime(\theta_0)(\h... ... \\ & + U^{\prime\prime}(\tilde\theta) (\hat\theta-\theta_0)^2/2 \end{align*}$
for some $\tilde\theta$ between $\theta_0$ and $\hat\theta$ . (This form of the remainder in Taylor's theorem is not valid for multivariate $\theta$ .) The derivatives of U are each sums of n terms and so should be both proportional to n in size. The second derivative is multiplied by the square of the small number $\hat\theta-\theta_0$ so should be negligible compared to the first derivative term. If we ignore the second derivative term we get

$\begin{displaymath}-U^\prime(\theta_0)(\hat\theta-\theta_0) \approx U(\theta_0) \end{displaymath}$

Now let's look at the terms U and $U^\prime$ .

In the normal case

$\begin{displaymath}U(\theta_0) = \sum (X_i-\mu_0) \end{displaymath}$

has a normal distribution with mean 0 and variance n (SD $\sqrt{n}$ ). The derivative is simply

$\begin{displaymath}U^\prime(\mu) = -n \end{displaymath}$

and the next derivative $U^{\prime\prime}$ is 0. We will analyze the general case by noticing that both U and $U^\prime$ are sums of iid random variables. Let

$\begin{displaymath}U_i = \frac{\partial \log f}{\partial\theta} (X_i,\theta_0) \end{displaymath}$

and

$\begin{displaymath}V_i = -\frac{\partial^2 \log f}{\partial\theta^2} (X_i,\theta) \end{displaymath}$

In general, $U(\theta_0)=\sum U_i$ has mean 0 and approximately a normal distribution. Here is how we check that:
$\begin{align*}E_{\theta_0}(U(\theta_0)) & = n E_{\theta_0} (U_1) \\ & = n \int ... ...heta=\theta_0} \\ &= n\frac{\partial}{\partial\theta} 1 \\ & = 0\ \end{align*}$

Notice that I have interchanged the order of differentiation and integration at one point. This step is usually justified by applying the dominated convergence theorem to the definition of the derivative. The same tactic can be applied by differentiating the identity which we just proved

$\begin{displaymath}\int\frac{\partial\log f}{\partial\theta}(x,\theta) f(x,\theta) dx =0 \end{displaymath}$

Taking the derivative of both sides with respect to $\theta$ and pulling the derivative under the integral sign again gives

$\begin{displaymath}\int \frac{\partial}{\partial\theta} \left[ \frac{\partial\log f}{\partial\theta}(x,\theta) f(x,\theta) \right] dx =0 \end{displaymath}$

Do the derivative and get
$\begin{multline*}-\int\frac{\partial^2\log(f)}{\partial\theta^2} f(x,\theta) dx ... ...artial\log f}{\partial\theta}(x,\theta) \right]^2 f(x,\theta) dx \end{multline*}$

Definition: The Fisher Information is

$\begin{displaymath}I(\theta)=-E_{\theta}(U^\prime(\theta))=nE_{\theta_0}(V_1) \end{displaymath}$

We refer to ${\cal I}(\theta_0) = E_{\theta_0}(V_1)$ as the information in 1 observation.

The idea is that I is a measure of how curved the log likelihood tends to be at the true value of $\theta$ . Big curvature means precise estimates. Our identity above is

$\begin{displaymath}I(\theta) = {\rm Var}_\theta(U(\theta))=n{\cal I}(\theta) \end{displaymath}$

Now we return to our Taylor expansion approximation

$\begin{displaymath}-U^\prime(\theta_0)(\hat\theta-\theta_0) \approx U(\theta_0) \end{displaymath}$

and study the two appearances of U.

We have shown that $U=\sum U_i$ is a sum of iid mean 0 random variables. The central limit theorem thus proves that

$\begin{displaymath}n^{-1/2} U(\theta) \Rightarrow N(0,\sigma^2) \end{displaymath}$

where $\sigma^2 = {\rm Var}(U_i)=E(V_i)={\cal I}(\theta)$ .

Next observe that

$\begin{displaymath}-U^\prime(\theta) = \sum V_i \end{displaymath}$

where again

$\begin{displaymath}V_i = -\frac{\partial U_i}{\partial\theta} \end{displaymath}$

The law of large numbers can be applied to show

$\begin{displaymath}-U^\prime(\theta_0)/n \to E_{\theta_0}[ V_1] = {\cal I}(\theta_0) \end{displaymath}$

Now manipulate our Taylor expansion as follows

$\begin{displaymath}n^{1/2} (\hat\theta - \theta_0) \approx \left[\frac{\sum V_i}{n}\right]^{-1} \frac{\sum U_i}{\sqrt{n}} \end{displaymath}$

Apply Slutsky's Theorem to conclude that the right hand side of this converges in distribution to $N(0,\sigma^2/{\cal I}(\theta)^2)$ which simplifies, because of the identities, to $N(0,1/{\cal I}(\theta))$ .

Summary

In regular families:

Under strong regularity conditions Jensen's inequality can be used to demonstrate that $\hat\theta$ which maximizes $\ell$ globally is consistent and that this $\hat\theta$ is a root of the likelihood equations.
It is generally easier to study $\ell$ only close to $\theta_0$ . For instance define A to be the event that $\ell$ is concave on the set of $\theta$ such that $\vert\theta-\theta_0\vert <\delta$ and the likelihood equations have a unique root in that set. Under weaker conditions than the previous case we can prove that there is a $\delta > 0$ such that

$\begin{displaymath}P(A) \to 1 \end{displaymath}$

In that case we can prove that the root $\hat\theta$ of the likelihood equations mentioned in the definition of A is consistent.
Sometimes get still weaker conclusion. Let B be the event that $\ell(\theta)$ is concave for $n^{1/2}\vert\theta-\theta_0\vert < L$ and there is a unique root of $\ell$ over this range. Again this root is consistent but there might be other consistent roots of the likelihood equations.
Under any of these scenarios there is a consistent root of the likelihood equations which is definitely the closest to the true value $\theta_0$ . This root $\hat\theta$ has the property

$\begin{displaymath}\sqrt{n}(\hat\theta-\theta_0) \Rightarrow N(0,1/{\cal I}(\theta)) \, . \end{displaymath}$

We usually simply say that the mle is consistent and asymptotically normal with an asymptotic variance which is the inverse of the Fisher information. This assertion is actually valid for vector valued $\theta$ where now I is a matrix with ijth entry

$\begin{displaymath}I_{ij} = - E\left(\frac{\partial^2\ell}{\partial\theta_i\partial\theta_j}\right) \end{displaymath}$

Estimating Equations

Same ideas arise whenever estimates derived by solving some equation. Example: large sample theory for Generalized Linear Models.

Suppose that for $i=1,\ldots, n$ we have observations of the numbers of cancer cases Y_i in some group of people characterized by values x_i of some covariates. You are supposed to think of x_i as containing variables like age, or a dummy for sex or average income or $\ldots$ A parametric regression model for the Y_i might postulate that Y_i has a Poisson distribution with mean $\mu_i$ where the mean $\mu_i$ depends somehow on the covariate values. Typically we might assume that $g(\mu_i) = \beta_0+ x_i \beta_1$ where g is a so-called link function, often for this case $g(\mu) = \log(\mu)$ and $x_i\beta$ is a matrix product with x_i written as a row vector and $\mu$ a column vector. This is supposed to function as a ``linear regression model with Poisson errors''. I will do as a special case $\log(\mu_i) = \beta x_i$ where x_i is a scalar.

The log likelihood is simply

$\begin{displaymath}\ell(\beta) = \sum(Y_i \log(\mu_i) - \mu_i) \end{displaymath}$

ignoring irrelevant factorials. The score function is, since $\log(\mu_i) = \beta x_i$ ,

$\begin{displaymath}U(\beta) = \sum (Y_i x_i - x_i \mu_i) = \sum x_i(Y_i-\mu_i) \end{displaymath}$

(Notice again that the score has mean 0 when you plug in the true parameter value.) The key observation, however, is that it is not necessary to believe that Y_i has a Poisson distribution to make solving the equation U=0 sensible. Suppose only that $\log(E(Y_i)) = x_i\beta$ . Then we have assumed that

$\begin{displaymath}E_\beta(U(\beta)) = 0 \end{displaymath}$

This was the key condition in proving that there was a root of the likelihood equations which was consistent and here it is what is needed, roughly, to prove that the equation $U(\beta)=0$ has a consistent root $\hat\beta$ . Ignoring higher order terms in a Taylor expansion will give

$\begin{displaymath}V(\beta)(\hat\beta-\beta) \approx U(\beta) \end{displaymath}$

where $V=-U^\prime$ . In the mle case we had identities relating the expectation of V to the variance of U. In general here we have

$\begin{displaymath}{\rm Var}(U) = \sum x_i^2 {\rm Var}(Y_i) \end{displaymath}$

If Y_i is Poisson with mean $\mu_i$ (and so ${\rm Var}(Y_i)=\mu_i$ ) this is

$\begin{displaymath}{\rm Var}(U) = \sum x_i^2\mu_i \end{displaymath}$

Moreover we have

$\begin{displaymath}V_i = x_i^2 \mu_i \end{displaymath}$

and so

$\begin{displaymath}V(\beta) = \sum x_i^2 \mu_i \end{displaymath}$

The central limit theorem (the Lyapunov kind) will show that $U(\beta)$ has an approximate normal distribution with variance $\sigma_U^2 = \sum x_i^2 {\rm Var}(Y_i)$ and so

$\begin{displaymath}\hat\beta-\beta \approx N(0,\sigma_U^2/(\sum x_i^2\mu_i)^2) \end{displaymath}$

If ${\rm Var}(Y_i)=\mu_i$ , as it is for the Poisson case, the asymptotic variance simplifies to $1/\sum x_i^2\mu_i$ .

Other estimating equations are possible. People suggest alternatives very often. If w_i is any set of deterministic weights (even possibly depending on $\mu_i$ then we could define

$\begin{displaymath}U(\beta) = \sum w_i (Y_i-\mu_i) \end{displaymath}$

and still conclude that U=0 probably has a consistent root which has an asymptotic normal distribution. Idea widely used: see, e.g., Zeger and Liang's idea of Generalized Estimating Equations (GEE) which the econometricians call Generalized Method of Moments.

Problems with maximum likelihood

1.: In problems with many parameters the approximations don't work very well and maximum likelihood estimators can be far from the right answer. See your homework for the Neyman Scott example where the mle is not consistent.
2.: When there are multiple roots of the likelihood equation you must choose the right root. To do so you might start with a different consistent estimator and then apply some iterative scheme like Newton Raphson to the likelihood equations to find the mle. It turns out not many steps of NR are generally required if the starting point is a reasonable estimate.

Finding (good) preliminary Point Estimates

Method of Moments

Basic strategy: set sample moments equal to population moments and solve for the parameters.

Definition: The $r^{\mbox{th}}$ sample moment (about the origin) is

$\begin{displaymath}\frac{1}{n}\sum_{i=1}^n X_i^r \end{displaymath}$

The $r^{\mbox{th}}$ population moment is

$\begin{displaymath}{\rm E}(X^r) \end{displaymath}$

(Central moments are

$\begin{displaymath}\frac{1}{n}\sum_{i=1}^n (X_i-\bar X)^r \end{displaymath}$

and

$\begin{displaymath}{\rm E}\left[(X-\mu)^r\right] \, . \end{displaymath}$

If we have p parameters we can estimate the parameters $\theta_1,\ldots,\theta_p$ by solving the system of pequations:

$\begin{displaymath}\mu_1 = \bar{X} \end{displaymath}$

$\begin{displaymath}\mu_2^\prime = \overline{X^2} \end{displaymath}$

and so on to

$\begin{displaymath}\mu_p^\prime = \overline{X^p} \end{displaymath}$

You need to remember that the population moments $\mu_k^\prime$ will be formulas involving the parameters.

Gamma Example

The Gamma( $\alpha,\beta$ ) density is

$\begin{displaymath}f(x;\alpha,\beta) = \frac{1}{\beta\Gamma(\alpha)}\left(\frac{... ...ta}\right)^{\alpha-1} \exp\left[-\frac{x}{\beta}\right] 1(x>0) \end{displaymath}$

and has

$\begin{displaymath}\mu_1 = \alpha\beta \end{displaymath}$

and

$\begin{displaymath}\mu_2^\prime = \alpha\beta^2\end{displaymath}$

This gives the equations
$\begin{align*}\alpha\beta & = \overline{X} \\ \alpha\beta^2 & = \overline{X^2} \end{align*}$
Divide the second equation by the first to find the method of moments estimate of $\beta$ is

$\begin{displaymath}\tilde\beta = \overline{X^2}/\overline{X} \end{displaymath}$

Then from the first equation get

$\begin{displaymath}\tilde\alpha = \overline{X}/\tilde\beta= (\overline{X})^2/\overline{X^2} \end{displaymath}$

The method of moments equations are much easier to solve than the likelihood equations which involve the function

$\begin{displaymath}\psi(\alpha) = \frac{d}{d\alpha} \log(\Gamma(\alpha)) \end{displaymath}$

called the digamma function.

$next$ $up$ $previous$

Richard Lockhart
2000-02-11