STAT 350: Lecture 20
The SCENIC data set, continued
See Lecture 18 for plots of the data and Lecture 19 for our first analysis.
We have found that STAY, CULTURE and CHEST are significant and that we must retain one of the three variables BED, NURSES and CENSUS which measure size of the hospital. Picking the variable of the three which produces the largest multiple we go with NURSES. Now we look at the question of adding further variables to that 4 covariate model.
> anova(fit.n,fit.full) Analysis of Variance Table Response: Risk Model Resid. Df RSS Test Df SumSq F Value Pr(F) FULL 108 95.63982 REDUCED 104 95.63982 4 2.9895 0.8127053 0.5198417This suggests we need not consider adding further variables.
However, we should examine diagnostics and consider the question of how variables are likely to influence RISK.
Suggestion: Transform other variables.
Define NURSE.RATIO = NURSES/CENSUS. Idea: large values indicate more intensive nursing care.
Define CROWDING = CENSUS/BEDS. Idea: large values indicate a crowded hospital.
Add these variables to the model.
> Nurse.Ratio <- scenic$Nurse/scenic$Census > sc.ext <- data.frame(scenic, Nurse.Ratio) > Crowding <- scenic$Census/scenic$Beds > sc.ext <- data.frame(sc.ext, Crowding) > fit.l20 <- lm(Risk ~ Stay + Culture + Chest + Nurses + Crowding + Nurse.Ratio, data = sc.ext) > summary(fit.l20) Residuals: Min 1Q Median 3Q Max -2.036 -0.6102 0.01268 0.3956 2.798 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -1.2762 0.8753 -1.4581 0.1478 Stay 0.2196 0.0594 3.6983 0.0003 Culture 0.0424 0.0099 4.2740 0.0000 Chest 0.0093 0.0055 1.7040 0.0913 Nurses 0.0014 0.0007 1.9627 0.0523 Crowding 1.4296 0.9455 1.5121 0.1335 Nurse.Ratio 0.8238 0.3298 2.4979 0.0140 Residual standard error: 0.9359 on 106 df Multiple R-Squared: 0.5389 F-statistic: 20.65 on 6 and 106 df, the p-value is 6.661e-16 Correlation of Coefficients: (Intercept) Stay Culture Chest Nurses Crowding Stay -0.3314 Culture 0.1738 -0.1725 Chest -0.1170 -0.3422 -0.3010 Nurses 0.3162 -0.2737 -0.0803 0.1608 Crowding -0.7108 -0.2136 -0.0321 -0.0605 -0.3032 Nurse.Ratio -0.6321 0.2561 -0.1365 -0.2548 -0.3056 0.3849
Conclusion: NURSE.RATIO is a useful predictor.
Can we discard CHEST, CROWDING? NURSES marginal but seems reasonable to keep this variable since we are keeping NURSE.RATIO.
fit.l20.t <- lm(Risk ~ Stay + Culture + Nurse.Ratio + Nurses, data = sc.ext) > summary(fit.l20.t) Residuals: Min 1Q Median 3Q Max -2.214 -0.6387 0.06483 0.5021 2.655 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.0831 0.6092 -0.1365 0.8917 Stay 0.2767 0.0549 5.0417 0.0000 Culture 0.0482 0.0096 5.0311 0.0000 Nurse.Ratio 0.7695 0.2994 2.5701 0.0115 Nurses 0.0016 0.0007 2.2607 0.0258 Residual standard error: 0.9511 on 108 df Multiple R-Squared: 0.5149 F-statistic: 28.66 on 4 and 108 df, the p-value is 3.331e-16 Correlation of Coefficients: (Intercept) Stay Culture Nurse.Ratio Stay -0.8669 Culture 0.1569 -0.3317 Nurse.Ratio -0.6468 0.3148 -0.2287 Nurses 0.1916 -0.3356 -0.0521 -0.1851 > anova(fit.l20,fit.l20.t) Analysis of Variance Table Response: Risk Model Res df ESS test df SS F P value FULL 106 92.852 REDUCED 108 97.689 2 4.84 2.76 0.068
Conclusion: Can discard CHEST, CROWDING but not NURSES.
Remaining Issues
To demonstrate that changing X casues changes in Y we hold all other important variables constant and try experimental units at various settings of X. Variables we don't know about or can't control are equalized between the different levels of X by randomly assigning units to the different values of X.
An observational study is one where X cannot be controlled and other variables cannot be held constant. Think about a case where men have generally higher values of both X and Y and women have generally lower values but that among men there is no relation between X and Y Here is a possible plot, the triangles being men.
If you didn't know about the influence of sex you would see a positive correlation between X and Y but if you compute separate correlations for the two groups you see the variables are unrelated. Remember, if you manipulate X in the picture you are either doing so for a women (and X and Y are unrelated for women) or for a man (and again X and Y are unrelated); in either case Y will be unaffected because you would not be affecting the sex of a person.
Doing multiple regression is very much like this. Imagine you have a response variable Y, a variable X whose influence on Y is of primary interest and some other variables which probably influence Y and may influence X as well. You would like to look at the relation between X and Y in groups of cases where all the other covariate values are the same; this is not generally possible. Instead, we estimate the average value of Y for each possible combination of the variable X and the other variables. We ask if this mean depends on X. We say we are adjusting for the other covariates.
The method works pretty well if we have identified all the possible confounding variables so that we can adjust for them all. So, e.g., in our example lowering the nursing ratio would be asserted to lower the risk of nosocomial infection. The trouble is that no such deduction is rigorously possible. You would need to be sure there was not a 3rd variable correlated with both X and Y which is the real cause of variation in both and for which you haven't adjusted. In randomized designed experiments this possibility is dealt with by the randomization.
The slope in a regression model corresponding to X measures the change expected in Y when X is changed by 1 unit and all the other variables in the regression are held constant. It is in this sense the regression method is used to adjust for the other covariates. Researchers say things like "Adjusted for Length of service and publication rate sex has no impact on salary of professors."