next up previous

STAT 330 Lecture 30

Reading for Today's Lecture: 12.1, 12.2,12.3

Goals of Today's Lecture:

Today's notes

Simple Linear Regression Model:

We assume for each observation a model equation of the form

displaymath93

where

Distribution theory for estimates:

  1. The least squares estimates tex2html_wrap_inline107 and tex2html_wrap_inline109 minimize

    displaymath111

    and are given by

    displaymath113

    and

    displaymath115

    where

    displaymath117

  2. The standard deviation, tex2html_wrap_inline119 , of the errors is estimated by

    displaymath121

  3. Other formulas, useful for computing means and variances are:

    displaymath123

    and

    displaymath125

  4. tex2html_wrap_inline107 and tex2html_wrap_inline109 are unbiased:

    displaymath131

    and

    displaymath133

  5. tex2html_wrap_inline135
  6. tex2html_wrap_inline137 is an unbiased estimate of tex2html_wrap_inline105

    If we also assume that the errors are tex2html_wrap_inline141 then

  7. the least squares estimates are also the maximum likelihood estimates.
  8. tex2html_wrap_inline109 has a normal distribution with mean tex2html_wrap_inline99 and variance tex2html_wrap_inline147 NOTE: in samples where the largest tex2html_wrap_inline149 is small compared to tex2html_wrap_inline151 (most large samples) this normality is approximately true even if the errors do NOT have a normal distribution.
  9. tex2html_wrap_inline153 are independent of s.
  10. tex2html_wrap_inline157 has a tex2html_wrap_inline159 distribution on n-2 degrees of freedom.
  11. The pivot

    displaymath163

    has a t distribution on n-2 degrees of freedom.

We can use this distribution theory to test hypotheses and give confidence intervals:

Confidence Intervals

The inequality tex2html_wrap_inline169 can be solved as usual to get the interval

displaymath171

This confidence interval, which is exact for normally distributed errors can also be used in large samples for non-normal errors.

Hypothesis Tests

We can test tex2html_wrap_inline173 by computing

displaymath175

and getting P values from t tables. Again this test can be used in large samples even if the errors are not normal. The most common value for tex2html_wrap_inline181 is 0. In this case

displaymath183

and

displaymath185

is the F statistic from the ANOVA table.

Residual Plots

After fitting the model you should examine residual plots. The fitted residuals are defined by

displaymath189

You should plot:

Example: see text chapter 12 question 9.

An experiment was conducted to relate a variable Y, the production of nitrous oxides to a variable x, the "Burner Area Liberation Rate" (a measure of energy produced per square foot of area of some burner in a power plant). The data are

x 100 125 125 150 150 200 200
y 150 140 180 210 190 320 280
x 250 250 300 300 350 400 400
y 400 430 440 390 600 610 670

Here is a plot of the data:

I used SAS to fit the regression model. In particular I used proc glm (glm stands for general linear model). Here is the SAS code:

  options pagesize=60 linesize=80;
  data nox;
  infile 'ch12q9.dat';
  input area emission ;
  proc glm  data=nox;
   model emission = area;
   output out=noxfit p=yhat r=resid ;
  proc univariate data=noxfit plot normal;
   var resid;
  proc plot;
   plot resid*area;
   plot resid*yhat;
  run;
The line labelled model says that I am interested in the effects of area (my shorthand name for ``Burner Area Liberation Rate'') on emissions.

The output from proc glm is

                        The SAS System                                1
                                        10:00 Monday, November 20, 1995

               General Linear Models Procedure

           Number of observations in data set = 14

                        The SAS System                                2
                                        10:00 Monday, November 20, 1995

               General Linear Models Procedure

Dependent Variable: EMISSION
                         Sum of            Mean
Source          DF      Squares          Square     F Value     Pr > F

Model            1   398030.26093    398030.26093    294.74     0.0001

Error           12    16205.45335      1350.45445

Corrected Total 13   414235.71429

      R-Square             C.V.        Root MSE        EMISSION Mean

      0.960879         10.26905       36.748530            357.85714


Source      DF        Type I SS     Mean Square   F Value     Pr > F

AREA         1     398030.26093    398030.26093    294.74     0.0001

Source      DF      Type III SS     Mean Square   F Value     Pr > F

AREA         1     398030.26093    398030.26093    294.74     0.0001


                                  T for H0:    Pr > |T|   Std Error of
Parameter            Estimate    Parameter=0                Estimate

INTERCEPT        -45.55190539          -1.79     0.0989    25.46779420
AREA               1.71143233          17.17     0.0001     0.09968772

The conclusions are that AREA has a very significant and strong effect on emissions, that the intercept of the linear regression might be 0 and that the estimated slope is

displaymath223

The diagnostic plots show one possible Y outlier at x=300

            Plot of RESID*AREA.  Legend: A = 1 obs, B = 2 obs, etc.
RESID |
      |
   60 +
      |
      |
      |
      |                                    A                       A
      |
   40 +
      |
      |
      |                                                                        A
      |
      |A                       A
   20 +
      |                                    A
      |
      |      A
      |
      |
    0 +            A
      |
      |
      |
      |
      |                        A
  -20 +            A
      |
      |                                                A
      |      A                                                                 A
      |
      |
  -40 +
      |
      |
      |
      |
      |
  -60 +
      |
      |
      |
      |
      |                                                A
  -80 +
      |
      -+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
      100   125   150   175   200   225   250   275   300   325   350   375  400
                                         AREA

Here is a Q-Q plot of the residuals

Prediction Intervals

University admissions officers would like to guess a student's GPA at the end of first year on the basis of her/his high school record. In the simplest case that high school record might be summarized by x, the high school GPA. The mathematical version of this problem is that there is a data set of tex2html_wrap_inline233 pairs and a new individual x for which we desire to guess the corresponding Y. A related, but different problem is to guess the average first year GPA for a large group of students whose high school GPA is x.

We will use the following notation.

If we have fitted a simple linear regression model to our data set, obtaining estimated slope tex2html_wrap_inline109 and intercept tex2html_wrap_inline107 then we predict both the individual and the average of the group using the regression line:

displaymath257

Next lecture we will develop the theory to get an estimate of the likely size of the prediction error tex2html_wrap_inline259 , a prediction interval for Y (of the form tex2html_wrap_inline263 ) and a standard error and confidence interval for tex2html_wrap_inline265 .


next up previous



Richard Lockhart
Wed Mar 11 06:59:23 PST 1998