next up previous


Postscript version of this file

STAT 801 Lecture 16

Reading for Today's Lecture:

Goals of Today's Lecture:

Today's notes

Optimality theory for point estimates

Cramér Rao Inequality

Suppose T is and unbiased estimate of $\theta$. Then

\begin{displaymath}{\rm Var}_\theta(T) \ge \frac{1}{I(\theta)}
\end{displaymath}

is called the Cramér Rao Lower Bound. The inequality is strict unless the correlation is 1 which would require that

\begin{displaymath}U(\theta) = A(\theta) T(X) + B(\theta)
\end{displaymath}

for some non-random constants A and B (which might depend on $\theta$.) This would prove that

\begin{displaymath}\ell(\theta) = A^*(\theta) T(X) + B^*(\theta) + C(X)
\end{displaymath}

for some further constants A* and B* and finally

\begin{displaymath}f(x,\theta) = h(x) e^{A*(\theta)T(x)+B^*(\theta)}
\end{displaymath}

for h=eC.

Summary of Implications

What can we do to find UMVUEs when the CRLB is a strict inequality?

Example: Suppose X has a Binomial(n,p) distribution. The score function is

\begin{displaymath}U(p) =
\frac{1}{p(1-p)} X - \frac{n}{1-p}
\end{displaymath}

CRLB will be strict unless T=cX for some c. If we are trying to estimate p then choosing c=n-1 does give an unbiased estimate $\hat p = X/n$ and T=X/n achieves the CRLB so it is UMVU.

Different tactic: Suppose T(X) is some unbiased function of X. Then we have

\begin{displaymath}E_p(T(X)-X/n) \equiv 0\end{displaymath}

because $\hat p = X/n$ is also unbiased. If h(k) = T(k)-k/n then

\begin{displaymath}E_p(h(X)) = \sum_{k=0}^n h(k)
\dbinom{n}{k} p^k (1-p)^{n-k}
\equiv 0
\end{displaymath}

LHS of $\equiv$ sign is polynomial function of p as is the right. Thus if the left hand side is expanded out the coefficient of each power pk is 0. The constant term occurs only in the term k=0 and its coefficient is

\begin{displaymath}h(0)
\dbinom{n}{0}= h(0)
\end{displaymath}

Thus h(0) = 0. Now p1=p occurs only in the term k=1 with coefficient nh(1) so h(1)=0. Since the terms with k=0 or 1 are 0 the quantity p2 occurs only in the term with k=2 with coefficient

n(n-1)h(2)/2

so h(2)=0. We can continue in this way to see that in fact h(k)=0 for each k and so the only unbiased function of X is X/n.

A Binomial random variable is a sum of n iid Bernoulli(p)rvs. If $Y_1,\ldots,Y_n$ iid Bernoulli(p) then $X=\sum Y_i$ is Binomial(n,p). Could we do better by than $\hat p = X/n$by trying $T(Y_1,\ldots,Y_n)$ for some other function T?

Try n=2. There are 4 possible values for Y1,Y2. If h(Y1,Y2) = T(Y1,Y2) - [Y1+Y2]/2 then

\begin{displaymath}E_p(h(Y_1,Y_2)) \equiv 0
\end{displaymath}

and we have

\begin{eqnarray*}E_p( h(Y_1,Y_2)) &= & h(0,0)(1-p)^2
\\
&& +
[h(1,0)+h(0,1)]p(1-p)
\\
&& + h(1,1) p^2 \, .
\end{eqnarray*}


This can be rewritten in the form

\begin{displaymath}\sum_{k=0}^n w(k)
\dbinom{n}{k}
p^k(1-p)^{n-k}
\end{displaymath}

where w(0)=h(0,0), 2w(1) =h(1,0)+h(0,1), and w(2) = h(1,1). Just as before it follows that w(0)=w(1)=w(2)=0. This argument can be used to prove that for any unbiased estimate $T(Y_1,\ldots,Y_n)$ we have that the average value of $T(y_1,\ldots,y_n)$ over vectors $y_1,\ldots,y_n$ which have exactly k 1s and n-k 0s is k/n. Now let's look at the variance of T:
\begin{align*}{\rm Var(T)} = & E_p( [T(Y_1,\ldots,Y_n) - p]^2)
\\
=& E_p( [T(...
...\\
& 2E_p( [T(Y_1,\ldots,Y_n)
-X/n][X/n-p])
\\ & + E_p([X/n-p]^2)
\end{align*}
Claim cross product term is 0 which will prove variance of T is variance of X/n plus a non-negative quantity (which will be positive unless $T(Y_1,\ldots,Y_n) \equiv X/n$). Compute the cross product term by writing
\begin{multline*}E_p( [T(Y_1,\ldots,Y_n)-X/n][X/n-p])
\\
=\sum_{y_1,\ldots,y_n...
..._i/n][\sum y_i/n -p]
\\
\times p^{\sum y_i} (1-p)^{n-\sum
y_i}
\end{multline*}
Do sum by summing over those $y_1,\ldots,y_n$ whose sum is an integer x and then summing over x. We get
\begin{multline*}E_p( [T(Y_1,\ldots,Y_n)-X/n][X/n-p])
\\
= \sum_{x=0}^n
\sum...
...y_1,\ldots,y_n)-x/n]\right][x/n -p]
\\
\times p^{x}
(1-p)^{n-x}
\end{multline*}

We have already shown that the sum in [] is 0!

This long, algebraically involved, method of proving that $\hat p = X/n$ is the UMVUE of p is one special case of a general tactic.

To get more insight I begin by rewriting
\begin{multline*}E_p(T(Y_1,\ldots,Y_n))
\\
\sum_{x=0}^n \sum_{\sum y_i = x} T...
... T(y_1,\ldots,y_n)}{
\dbinom{n}{x}}
\dbinom{n}{x}
p^x(1-p)^{n-x}
\end{multline*}
Notice large fraction in formula is average value of T over values of y when $\sum y_i$ is held fixed at x. Notice that the weights in this average do not depend on p. Notice that this average is actually
\begin{multline*}E(T(Y_1,\ldots,Y_n\vert X=x))
\\
= \sum_{y_1,\ldots,y_n} T(y_1,\ldots,y_n)\\
\times
P(Y_1=y_1,\ldots,Y_n=y_n\vert X=x)
\end{multline*}
Notice conditional probabilities do not depend on p. In a sequence of Binomial trials if I tell you that 5 of 17 were heads and the rest tails the actual trial numbers of the 5 Heads are chosen at random from the 17 possibilities; all of the 17 choose 5 possibilities have the same chance and this chance does not depend on p.

Notice: with data $Y_1,\ldots,Y_n$ log likelihood is

\begin{displaymath}\ell(p) = \sum Y_i \log(p) - (n-\sum Y_i) \log(1-p)
\end{displaymath}

and

\begin{displaymath}U(p) =
\frac{1}{p(1-p)} X - \frac{n}{1-p}
\end{displaymath}

as before. Again CRLB is strict except for multiples of X. Since only unbiased multiple of X is $\hat p = X/n$ UMVUE of p is $\hat p$.

Sufficiency

In the binomial situation the conditional distribution of the data $Y_1,\ldots,Y_n$ given X is the same for all values of $\theta$; we say this conditional distribution is free of $\theta$.

Defn: Statistic T(X) is sufficient for the model $\{ P_\theta;\theta \in \Theta\}$ if conditional distribution of data X given T=t is free of $\theta$.

Intuition: Data tell us about $\theta$ if different values of $\theta$ give different distributions to X. If two different values of $\theta$ correspond to same density or cdf for X we cannot distinguish these two values of $\theta$ by examining X. Extension of this notion: if two values of $\theta$ give same conditional distribution of X given Tthen observing T in addition to X doesn't improve our ability to distinguish the two values.

Mathematically Precise version of this intuition: Suppose T(X)is sufficient statistic and S(X) is any estimate or confidence interval or ... If you only know value of T then:

You can carry out the first step only if the statistic T is sufficient; otherwise you need to know the true value of $\theta$ to generate X*.

Example 1: $Y_1,\ldots,Y_n$ iid Bernoulli(p). Given $\sum Y_i = y$ the indexes of the y successes have the same chance of being any one of the $\dbinom{n}{y}$ possible subsets of $\{1,\ldots,n\}$. This chance does not depend on p so $T(Y_1,\ldots,Y_n)
= \sum Y_i$ is a sufficient statistic.

Example 2: $X_1,\ldots,X_n$ iid $N(\mu,1)$. Joint distribution of $X_1,\ldots,X_n,\overline{X}$ is multivariate normal. All entries of mean vector are $\mu$. Variance covariance matrix can be partitioned as

\begin{displaymath}\left[\begin{array}{cc} I_{n \times n} & {\bf 1}_n /n
\\
{\bf 1}_n^t /n & 1/n \end{array}\right]
\end{displaymath}

where ${\bf 1}_n$ is a column vector of n 1s and $I_{n \times n}$ is $n \times n$ identity matrix.

You can now compute the conditional means and variances of Xi given $\overline{X}$ and use the fact that the conditional law is multivariate normal to prove that the conditional distribution of the data given $\overline{X} = x$ is multivariate normal with mean vector all of whose entries are x and variance-covariance matrix given by $I_{n\times n} - {\bf 1}_n{\bf 1}_n^t /n $. Since this does not depend on $\mu$ we find that $\overline{X}$ is sufficient.

WARNING: Whether or not statistic is sufficient depends on density function and on $\Theta$.

Theorem: [Rao-Blackwell] Suppose S(X) is a sufficient statistic for model $\{P_\theta,\theta\in\Theta\}$. If T is an estimate of $\phi(\theta)$ then:

1.
E(T|S) is a statistic.

2.
E(T|S) has the same bias as T; if T is unbiased so is E(T|S).

3.
${\rm Var}_\theta(E(T\vert S)) \le {\rm Var}_\theta(T)$ and the inequality is strict unless T is a function of S.

4.
MSE of E(T|S) is no more than MSE of T.

Proof: Review conditional distributions: abstract definition of conditional expectation is:

Defn: E(Y|X) is any function of X such that

\begin{displaymath}E\left[R(X)E(Y\vert X)\right] = E\left[R(X) Y\right]
\end{displaymath}

for any function R(X). E(Y|X=x) is a function g(x) such that

g(X) = E(Y|X)

Fact: If X,Y has joint density fX,Y(x,y) and conditional density f(y|x) then

\begin{displaymath}g(x) = \int y f(y\vert x) dy
\end{displaymath}

satisfies these definitions.

Proof:
\begin{align*}E(R(X)g(X)) & = \int R(x) g(x)f_X(x) dx
\\
& = \int\int R(x) y f_...
...t x) dy dx
\\
&= \int\int R(x)y f_{X,Y}(x,y) dy dx
\\
&= E(R(X)Y)
\end{align*}

Think of E(Y|X) as average Y holding X fixed. Behaves like ordinary expected value but functions of X only are like constants.


next up previous



Richard Lockhart
2000-02-25