next up previous

STAT 330 Lecture 19

Reading for Today's Lecture: 10.1.

Goals of Today's Lecture:

Today's notes

I samples

Data: tex2html_wrap_inline190 is observation j in sample i, for tex2html_wrap_inline196 and i from 1 to I.

Jargon: ``I levels of some factor influencing the response variable X.''

Model:

First problem of interest: give hypothesis tests for tex2html_wrap_inline220 .

Technique: ANalysis Of VAriance or ANOVA.

Idea: Compare two independent estimates of tex2html_wrap_inline222 using an F test.

The theory:

1: Mean Square for Error or MSE is

displaymath226

where tex2html_wrap_inline228 is the total number of observations in all the samples and I is the number of samples.

2: Two motivations for the second estimate of tex2html_wrap_inline222 :

A: If tex2html_wrap_inline234 and all the tex2html_wrap_inline236 then tex2html_wrap_inline238 are an iid sample of size I from a population which has a tex2html_wrap_inline242 distribution. The sample variance of the tex2html_wrap_inline238 is

displaymath246

where now

displaymath248

This sample variance is an estimate of the population variance tex2html_wrap_inline250 and can be used to estimate tex2html_wrap_inline222 by multiplying by J to get

displaymath256

This quantity is called the Mean Square for Treatment (MSTr) or the Mean Square Between Groups. The numerator is called the Sum of Squares for Treatments

eqnarray74

The last two formulas work even when the sample sizes tex2html_wrap_inline218 are not all equal. Our test of tex2html_wrap_inline220 is based on the ratio of these two variance estimates:

displaymath262

Fact: If tex2html_wrap_inline264 is true then

  1. tex2html_wrap_inline266 .
  2. tex2html_wrap_inline268 where the notation is MSE = SSE/(n-I) and SSE stands for Sum of Squares for Error.
  3. SSE and SSTr are independent.
  4. The previous 3 facts prove that

    displaymath272

B: Alternative motivation for the test:

displaymath274

where we define

displaymath276

A natural estimate of tex2html_wrap_inline278 is

displaymath280

(we are just plugging in sample means for population means).

BUT: we can compute the expected value of this estimate and get

displaymath282

so that the natural estimate tends to be a bit more than what we want to estimate. We divide by an estimate of tex2html_wrap_inline284 , namely, (I-1)MSE/J to get

displaymath288

as an estimate of

displaymath290

Thus the null hypothesis predicts tex2html_wrap_inline292 while the alternative predicts F>1; we will reject tex2html_wrap_inline264 for large values of F.

ANOVA Tables

We generally record the arithmetic of our analysis in a table called an ANOVA table.

Sum of Mean Expected
Source df Squares Square F P Mean Square
tex2html_wrap352 I-1 tex2html_wrap_inline306 tex2html_wrap_inline308 tex2html_wrap_inline310 tex2html_wrap354 tex2html_wrap_inline314
tex2html_wrap356 I(J-1) tex2html_wrap_inline318 tex2html_wrap_inline320
Total IJ-1 tex2html_wrap_inline324

Remark: The only easily interpreted number in this table is the P value.

Remark: A central point of ANOVA tables is that the columns labelled df and Sum of Squares each add up to the Total line.

Remark: The table is traditionally filled in by calculating two lines and filling in the rest by subtraction. This is no longer relevant in a computer age. It is no longer relevant to give the short cut formulas for computing the sums of squares by hand (see top of page 398 in text for formulas involving subtraction of two squares).

Why does the table add up?

Pythagoras's Theorem:

If x and y are perpendicular vectors in tex2html_wrap_inline332 then

displaymath334

The sum of squares decomposition in one example

The data consist of blood coagulation times for 24 animals fed one of 4 different diets. In the following I write the data in a table and decompose the table into a sum of several tables. The 4 columns of the table correspond to Diets A, B, C and D. Later in the course we will do matrix linear algebra and then want to think of stacking up these 24 values into a single column vector but the tables save space.

displaymath336

The sums of squares of the entries of each of these arrays are intervals for differences between the 4 population means. On the left hand side tex2html_wrap_inline338 . This is the uncorrected total sum of squares. The first term on the right hand side gives tex2html_wrap_inline340 . This term is sometimes put in ANOVA tables as the Sum of Squares due to the Grand Mean but it is usually subtracted from the total to produce the Total Sum of Squares we usually put at the bottom of the table and often called the Corrected (or Adjusted) Total Sum of Squares. In this case the corrected sum of squares is the squared length of the table

displaymath342

which is 340.

The second term on the right hand side of the equation has squared length tex2html_wrap_inline344 (which is the Treatment Sum of Squares produced by SAS). The formula for this Sum of Squares is

displaymath346

but I want you to see that the formula is just the squared length of the vector of individual sample means minus the grand mean. The last vector of the decomposition is called the residual vector and has squared length tex2html_wrap_inline348 . Corresponding to the decomposition of the total squared length of the data vector is a decomposition of its dimension, 24, into the dimensions of subspaces. For instance the grand mean is always a multiple of the single vector all of whose entries are 1; this describes a one dimensional space. The second vector, of deviations from a grand mean lies in the three dimensional subspace of tables which are constant in each column and have a total equal to 0. Similarly the vector of residuals lies in a 20 dimensional subspace - the set of all tables whose columns sum to 0. This decomposition of dimensions is the decomposition of degrees of freedom. So 24 = 1+3+20 and the degrees of freedom for treatment and error are 3 and 20 respectively. The vector whose squared length is the Corrected Total Sum of Squares lies in the 23 dimensional subspace of vectors whose entries sum to 1; this produces the 23 total degrees of freedom in the usual ANOVA table.


next up previous



Richard Lockhart
Mon Feb 9 13:32:20 PST 1998