STAT 330 Lecture 30
Reading for Today's Lecture: 12.1, 12.2,12.3
Goals of Today's Lecture:
Today's notes
Simple Linear Regression Model:
We assume for each observation a model equation of the form
where
Distribution theory for estimates:
and are given by
and
where
and
and
If we also assume that the errors are then
has a t distribution on n-2 degrees of freedom.
We can use this distribution theory to test hypotheses and give confidence intervals:
Confidence Intervals
The inequality can be solved as usual to get the interval
This confidence interval, which is exact for normally distributed errors can also be used in large samples for non-normal errors.
Hypothesis Tests
We can test by computing
and getting P values from t tables. Again this test can be used in large samples even if the errors are not normal. The most common value for is 0. In this case
and
is the F statistic from the ANOVA table.
Residual Plots
After fitting the model you should examine residual plots. The fitted residuals are defined by
You should plot:
An experiment was conducted to relate a variable Y, the production of nitrous oxides to a variable x, the "Burner Area Liberation Rate" (a measure of energy produced per square foot of area of some burner in a power plant). The data are
x | 100 | 125 | 125 | 150 | 150 | 200 | 200 |
y | 150 | 140 | 180 | 210 | 190 | 320 | 280 |
x | 250 | 250 | 300 | 300 | 350 | 400 | 400 |
y | 400 | 430 | 440 | 390 | 600 | 610 | 670 |
Here is a plot of the data:
I used SAS to fit the regression model. In particular I used proc glm (glm stands for general linear model). Here is the SAS code:
options pagesize=60 linesize=80; data nox; infile 'ch12q9.dat'; input area emission ; proc glm data=nox; model emission = area; output out=noxfit p=yhat r=resid ; proc univariate data=noxfit plot normal; var resid; proc plot; plot resid*area; plot resid*yhat; run;The line labelled model says that I am interested in the effects of area (my shorthand name for ``Burner Area Liberation Rate'') on emissions.
The output from proc glm is
The SAS System 1 10:00 Monday, November 20, 1995 General Linear Models Procedure Number of observations in data set = 14 The SAS System 2 10:00 Monday, November 20, 1995 General Linear Models Procedure Dependent Variable: EMISSION Sum of Mean Source DF Squares Square F Value Pr > F Model 1 398030.26093 398030.26093 294.74 0.0001 Error 12 16205.45335 1350.45445 Corrected Total 13 414235.71429 R-Square C.V. Root MSE EMISSION Mean 0.960879 10.26905 36.748530 357.85714 Source DF Type I SS Mean Square F Value Pr > F AREA 1 398030.26093 398030.26093 294.74 0.0001 Source DF Type III SS Mean Square F Value Pr > F AREA 1 398030.26093 398030.26093 294.74 0.0001 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT -45.55190539 -1.79 0.0989 25.46779420 AREA 1.71143233 17.17 0.0001 0.09968772
The conclusions are that AREA has a very significant and strong effect on emissions, that the intercept of the linear regression might be 0 and that the estimated slope is
The diagnostic plots show one possible Y outlier at x=300
Plot of RESID*AREA. Legend: A = 1 obs, B = 2 obs, etc.
RESID |
|
60 +
|
|
|
| A A
|
40 +
|
|
| A
|
|A A
20 +
| A
|
| A
|
|
0 + A
|
|
|
|
| A
-20 + A
|
| A
| A A
|
|
-40 +
|
|
|
|
|
-60 +
|
|
|
|
| A
-80 +
|
-+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
100 125 150 175 200 225 250 275 300 325 350 375 400
AREA
Here is a Q-Q plot of the residuals
Prediction Intervals
University admissions officers would like to guess a student's GPA at the end of first year on the basis of her/his high school record. In the simplest case that high school record might be summarized by x, the high school GPA. The mathematical version of this problem is that there is a data set of pairs and a new individual x for which we desire to guess the corresponding Y. A related, but different problem is to guess the average first year GPA for a large group of students whose high school GPA is x.
We will use the following notation.
If we have fitted a simple linear regression model to our data set, obtaining estimated slope and intercept then we predict both the individual and the average of the group using the regression line:
Next lecture we will develop the theory to get an estimate of the likely size of the prediction error , a prediction interval for Y (of the form ) and a standard error and confidence interval for .