STAT 330 Lecture 34
Reading for Today's Lecture: Chapter 13.
Goals of Today's Lecture:
Today's notes
The General Linear Model
for
Special Cases and Examples:
One Way Layout:
with parameters , and p=I or parameters .
Note: is redundant because
Special notes:
Two Way Layout without replicates:
with the restrictions
and
Multiple Regression
In multiple regression we have an equation like the above but with the filled in with the values of more than 1 independent variable:
Example: We now regress hardness on SAND and FIBRE content. Previously we had treated each of these variables as merely having 3 (unordered) categories. Now we use the numerical values of those categories as the and .
All the models above can be written in the form
In the two way layout example we have, for instance:
Analysis Principles
giving the matrix algebra solution
Source | SS | df |
Regression | p | |
Error | n-p | |
Total | n | |
(not corrected) |
Source | SS | df |
Regression | p-1 | |
Error | n-p | |
Total | n | |
(corrected) |
SAS example: Multiple Regression
The data consist of casting hardnesses for 18 samples prepared under 3 levels of sand added and 3 levels of carbon fibre added. See Q 15 in Chapter 11. I use proc glm to regress hardness on sand content and fibre content but now treat them as continuous variables.
I ran the following SAS code:
options pagesize=60 linesize=80; data plaster; infile 'plaster.dat'; input sand fibre hardness strength; proc glm data=plaster; model hardness = sand fibre; output out=plasfit p=yhat r=resid ; proc univariate data=plasfit plot normal; var resid; proc plot; plot resid*sand; plot resid*fibre; plot resid*yhat; run;
The line labelled model says that I am interested in the effects of sand and fibre; the lack of the class statment makes glm do multiple regression.
The abridged output from proc glm is:
General Linear Models Procedure Number of observations in data set = 18 Dependent Variable: HARDNESS Sum of Mean Source DF Squares Square F Value Pr > F Model 2 167.41666667 83.70833333 11.53 0.0009 Error 15 108.86111111 7.25740741 Corrected Total 17 276.27777778 R-Square C.V. Root MSE HARDNESS Mean 0.605972 3.870011 2.6939576 69.611111 Source DF Type I SS Mean Square F Value Pr > F SAND 1 102.08333333 102.08333333 14.07 0.0019 FIBRE 1 65.33333333 65.33333333 9.00 0.0090 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 64.36111111 50.68 0.0001 1.26994378 SAND 0.19444444 3.75 0.0019 0.05184524 FIBRE 0.09333333 3.00 0.0090 0.03110714
The conclusions are that both sand and fibre have an effect on hardness (I read the so called Type 1 SS table and see P values of 0.0019 and 0.0090 and reject the two null hypotheses). The last table permits confidence intervals for the slopes. You can, for instance, predict that a SAND content of 10% and a FIBRE content of 20% would produce a hardness of
The model fit should be checked by examining various diagnostic statistics and plots:
Univariate Procedure Variable=RESID Moments N 18 Sum Wgts 18 Mean 0 Sum 0 Std Dev 2.530533 Variance 6.403595 Skewness -0.1431 Kurtosis -0.29863 USS 108.8611 CSS 108.8611 CV . Std Mean 0.596452 T:Mean=0 0 Pr>|T| 1.0000 Num ^= 0 18 Num > 0 7 M(Sign) -2 Pr>=|M| 0.4807 Sgn Rank 0.5 Pr>=|S| 0.9915 W:Normal 0.976631 Pr<W 0.8888 Quantiles(Def=5) 100% Max 4.388889 99% 4.388889 75% Q3 2.055556 95% 4.388889 50% Med -0.40278 90% 3.805556 25% Q1 -1.36111 10% -3.36111 0% Min -5.19444 5% -5.19444 1% -5.19444 Range 9.583333 Q3-Q1 3.416667 Mode -0.86111 Extremes Lowest Obs Highest Obs -5.19444( 5) 2.055556( 16) -3.36111( 1) 2.305556( 7) -2.94444( 15) 2.305556( 8) -2.02778( 13) 3.805556( 6) -1.36111( 2) 4.388889( 10) Stem Leaf # Boxplot 4 4 1 | 2 1338 4 +-----+ 0 57 2 | + | -0 4996530 7 *-----* -2 490 3 | -4 2 1 | ----+----+----+----+ Normal Probability Plot 5+ ++*+++++ | **++++*++ | ++++**++ | *+*++** ** | ++*+*++* -5+ +++++*++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 Plot of RESID*SAND. Legend: A = 1 obs, B = 2 obs, etc. RESID | | 6 + | | | | | | A 4 + | A | | | | |B 2 + A | A | A | | | | 0 +A | A | A A | B | |A | -2 +A | | | A | |A | -4 + | | | | A | | -6 + | -+-----------------------------------+-----------------------------------+ 0 15 30 SAND Plot of RESID*FIBRE. Legend: A = 1 obs, B = 2 obs, etc. RESID | | 6 + | | | | | | A 4 + |A | | | | | B 2 + A |A | A | | | | 0 + A |A | B | B | |A | -2 + A | | | A | |A | -4 + | | | |A | | -6 + | -+-----------------------------------+-----------------------------------+ 0 25 50 FIBRE Plot of RESID*YHAT. Legend: A = 1 obs, B = 2 obs, etc. RESID | | 6 + | | | | | | A 4 + | A | | | | | B 2 + A | A | A | | | | 0 + A | A | A A | B | | A | -2 + A | | | A | | A | -4 + | | | | A | | -6 + | -+-----------+-----------+-----------+-----------+-----------+-----------+ 64 66 68 70 72 74 76 YHATThe diagnostic plots seem fine to me.
In the two way ANOVA model fit for this data we allowed the possibility that effect of SAND depended on the level of FIBRE. We can do the same here and include an interaction term in the model. The model equation fitted by the previous run of SAS is
for . Here Y is hardness, u is sand content (in %) and v is fibre content in percent. To include an interaction term we modify the model equation to
The coefficient is then the interaction.
options pagesize=60 linesize=80; data plaster; infile 'plaster.dat'; input sand fibre hardness strength; proc anova data=plaster; model hardness = sand|fibre; run;which produces
General Linear Models Procedure Dependent Variable: HARDNESS Sum of Mean Source DF Squares Square F Value Pr > F Model 3 168.54166667 56.18055556 7.30 0.0035 Error 14 107.73611111 7.69543651 Corrected Total 17 276.27777778 R-Square C.V. Root MSE HARDNESS Mean 0.610044 3.985089 2.7740650 69.611111 Source DF Type I SS Mean Square F Value Pr > F SAND 1 102.08333333 102.08333333 13.27 0.0027 FIBRE 1 65.33333333 65.33333333 8.49 0.0113 SAND*FIBRE 1 1.12500000 1.12500000 0.15 0.7079 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 63.98611111 39.14 0.0001 1.63463347 SAND 0.21944444 2.60 0.0210 0.08441211 FIBRE 0.10833333 2.14 0.0505 0.05064727 SAND*FIBRE -0.00100000 -0.38 0.7079 0.00261541There is no sign of a need for an interaction term so the original model seems to be reasonable. Notice that the resulting model with only 3 parameters is more parsimonious than the model for the two way layout which has 5 parameters (or 9 with an interaction term). The model asserts that hardness actually increases linearly with sand content and also with fibre content.