Postscript version of this file
STAT 801 Lecture 12
Goals of Today's Lecture:
- Develop large sample theory for the mle.
- Define and interpret Fisher information.
- Extend the ideas to estimating equations.
Today's notes
For
iid log likelihood is
The score function is
MLE
maximizes .
If maximum occurs
in interior of parameter space and the log likelihood continuously
differentiable then
solves the likelihood equations
Examples
N
)
Unique root of likelihood equations is a global maximum.
[Remark: Suppose we called
the parameter.
Score function still has two components:
first component same as before but
second component is
Setting the new likelihood equations equal to 0 still gives
General invariance (or equivariance) principal:
Suppose
is some reparametrization of a model
(a one to one relabelling of the parameter values). Then
.
Does not
apply to other estimators.]
Cauchy: location
At least 1 root of likelihood equations but
often several more. One root is a global maximum;
others, if they exist may be local minima or maxima.
Binomial()
If X=0 or X=n: no root of likelihood equations;
likelihood is monotone. Other values of
X: unique root, a global maximum. Global
maximum at
even if X=0 or n.
The 2 parameter exponential
The density is
Log-likelihood is
for
and
otherwise is
Increasing function of
till
reaches
which gives mle of .
Now plug in
for ;
get so-called profile likelihood for :
Set
derivative equal to 0 to get
Notice mle
does not solve likelihood equations; we had to
look at the edge of the possible parameter space.
is called a support or
truncation parameter. ML methods behave oddly in
problems with such parameters.
Three parameter Weibull
The density in question is
Three likelihood equations:
Set
derivative equal to 0; get
where
indicates
mle of
could be found by finding the mles of the
other two parameters and then plugging in to the formula above.
It is not possible to find explicitly the remaining two parameters;
numerical methods are needed. However, you can see that putting
and letting
will make the log
likelihood go to .
The mle is not uniquely
defined, then, since any
and any
will do.
If the true value of
is more than 1 then the probability that
there is a root of the likelihood equations is high; in this case there
must be two more roots: a local maximum and a saddle point! For a
true value of
the theory we detail below applies to the
local maximum and not to the global maximum of the likelihood equations.
Large Sample Theory
We now study the approximate behaviour of
by studying
the function U. Notice first that Uis a sum of independent random variables.
Theorem: If
are iid with mean
then
This is called the law of large numbers. The strong law says
and the weak law that
For iid Yi the stronger conclusion holds but for our
heuristics we will ignore the differences between these notions.
Now suppose
is true value of .
Then
where
Consider as an example the case of
data where
If the true mean is
then
and
If we think of a
we see that the derivative of
is likely to be positive so that
increases
as we increase .
For
more than
the derivative
is probably negative and so
tends to be decreasing for
.
It follows that
is likely to be maximized close to .
Repeat ideas for more general case. Study
rv
You know the inequality
(difference is
.) A
generalization is called Jensen's inequality:
for g a convex function (
roughly) then
Inequality above has g(x)=x2. Use
:
convex because
.
We get
But
We can reassemble the inequality and this calculation to get
It is possible to prove that the inequality is strict unless
the
and
densities are actually the same.
Let
be this expected value. Then
for each
we find
This proves likelihood probably higher at
than at any other single .
This idea can often be
stretched to prove that the mle is consistent.
Definition A sequence
of estimators of
is consistent if
converges weakly
(or strongly) to .
Proto theorem: In regular problems the mle
is consistent.
Now let us study the shape of the log likelihood near the true
value of
under the assumption that
is a
root of the likelihood equations close to .
We use Taylor
expansion to write, for a 1 dimensional parameter ,
for some
between
and
.
(This form of the remainder in Taylor's theorem is not valid
for multivariate .) The derivatives of U are each
sums of n terms and so should be both proportional
to n in size. The second derivative is multiplied by the
square of the small number
so should be
negligible compared to the first derivative term.
If we ignore the second derivative term we get
Now let's look at the terms U and .
In the normal case
has a normal distribution with mean 0 and variance n
(SD ). The derivative is simply
and the next derivative
is 0. We will analyze
the general case by noticing that both U and
are
sums of iid random variables. Let
and
In general,
has mean 0
and approximately a normal distribution. Here is how we check that:
Notice that I have interchanged the order of differentiation
and integration at one point. This step is usually justified
by applying the dominated convergence theorem to the definition
of the derivative. The same tactic can be applied by differentiating
the identity which we just proved
Taking the derivative of both sides with respect to
and pulling the derivative under the integral sign again gives
Do the derivative and get
Definition: The Fisher Information is
We refer to
as
the information in 1 observation.
The idea is that I is a measure of how curved the log
likelihood tends to be at the true value of .
Big curvature means precise estimates. Our identity above
is
Now we return to our Taylor expansion approximation
and study the two appearances of U.
We have shown that
is a sum of iid mean 0 random variables.
The central limit theorem thus proves that
where
.
Next observe that
where again
The law of large numbers can be applied to show
Now manipulate our Taylor expansion as follows
Apply Slutsky's Theorem to conclude that the right hand
side of this converges in distribution to
which simplifies, because of the
identities, to
.
Summary
In regular families:
We usually simply say that the mle is consistent and asymptotically
normal with an asymptotic variance which is the inverse of the Fisher
information. This assertion is actually valid for vector valued
where now I is a matrix with ijth entry
Estimating Equations
Same ideas arise whenever estimates derived
by solving some equation. Example: large sample theory
for Generalized Linear Models.
Suppose that for
we have observations of the numbers
of cancer cases Yi in some group of people characterized by values
xi of some covariates. You are supposed to think
of xi as containing variables like age, or a dummy for sex or
average income or A parametric regression model for the Yi might postulate that
Yi has a Poisson distribution with mean
where the mean
depends somehow on the covariate
values. Typically we might assume that
where
g is a so-called link function,
often for this case
and
is a matrix product with xi written as a row vector and
a column vector. This is supposed to function as a
``linear regression model with Poisson errors''.
I will do as a special case
where xi is a scalar.
The log likelihood is simply
ignoring irrelevant factorials. The score function is, since
,
(Notice again that the score has mean 0 when you plug in
the true parameter value.) The key observation, however,
is that it is not necessary to believe that Yi has
a Poisson distribution to make solving the equation U=0
sensible. Suppose only that
.
Then we have assumed that
This was the key condition in proving that there was a
root of the likelihood equations which was consistent and
here it is what is needed, roughly, to prove
that the equation
has a consistent root .
Ignoring higher order terms
in a Taylor expansion will give
where
.
In the mle case we had identities relating
the expectation of V to the variance of U. In general here we have
If Yi is Poisson with mean
(and so
)
this is
Moreover we have
and so
The central limit theorem (the Lyapunov kind) will show that
has an approximate normal distribution with variance
and so
If
,
as it is for the Poisson case,
the asymptotic variance simplifies
to
.
Notice that other estimating equations are possible. People
suggest alternatives very often. If wi is any set of
deterministic weights (even possibly depending on then we could define
and still conclude that U=0 probably has a consistent root
which has an asymptotic normal distribution. Idea widely used:
see, e.g., Zeger and Liang's idea of
Generalized Estimating Equations (GEE) which the econometricians call
Generalized Method of Moments.
Problems with maximum likelihood
- 1.
- In problems with many parameters the approximations don't
work very well and maximum likelihood estimators can be far from
the right answer. See your homework for the Neyman Scott example
where the mle is not consistent.
- 2.
- When there are multiple roots of the likelihood equation you must
choose the right root. To do so you might start with a different
consistent estimator and then apply some iterative scheme like Newton
Raphson to the likelihood equations to find the mle. It turns out not many
steps of NR are generally required if the starting point is a reasonable
estimate.
Finding (good) preliminary Point Estimates
Method of Moments
Basic strategy: set sample moments equal to population moments and
solve for the parameters.
Definition: The
sample moment (about the origin)
is
The
population moment is
(Central moments are
and
Richard Lockhart
2000-02-11