STAT 330 Lecture 22
Reading for Today's Lecture: 10.1.
Goals of Today's Lecture:
Today's notes
Residual Analysis
Details: Residuals are
Fitted values are
Making a Q-Q plot: Plot sorted residuals against ``normal quantiles''.
These are the points which split the area under the normal curve into n+1 equal pieces.
For the co-agulation data here is a dot plot of the residuals. Each point is labelled according to the corresponding diet. There are too few points for a histogram to really work and also too few to warrant separate plots for each of the 4 groups.
I am looking for signs of non-normality or for outlying residuals. I see no sign of any problems here. I am also looking for evidence that the assumption of homoscedasticity constant variance) is wrong; I don't see such evidence.
We do not have time order information for the coagulation data. Here is a plot of residual versus fitted value. Again I have labelled the group.
I see no problem here. I am looking for a trend in the variation with the mean; in more sophisticated models I would also be looking for evidence that for certain ranges of fitted values the residuals were either predominantly negative or predominantly positive, indicating a failure of the model equation.
Finally here is a Q-Q plot. There are 24 points so n+1=25. The normal quantiles are the points on the normal curve so that the area to the left of them is 1/25, 2/25, , 24/25. For instance the first normal quantile is -1.75 because the table shows that the area to the left of -1.75 is 0.04 = 1/25.
The plot is acceptably straight; there does not seem to be a major problem with assuming the population distributions are normal.
In practice you will make the Q-Q plot not by had but with software; a SAS example is here
What would I do if I saw problems?
For non-normality, non-constant variance, or a trend in variability with the fitted value I might entertain a transformation of the data such as taking square roots or logs or trying the so called Box-Cox transformation.
For non-normality with non-constant variance I would consider using a generalized linear model as in STAT 402.
For non-normality, outliers and heteroscedasticity I might try a robust (non-parametric) analysis using trimmed means, medians or ... . See STAT 430.
Confidence Intervals
Usually we are interested in confidence intervals for differences between group means, that is, for things like . Let
Then it is a fact that
where is the degrees of freedom used in computing the MSE (which will usually be more than as would be the case for a two sample comparison).
Thus
is a level confidence interval for .