STAT 350: Lecture 19
The SCENIC data set, continued
See Lecture 18 for plots of the data. Here we fit several models and discuss interpretation of the coefficients.
I begin by thinking about how the variables should influence Risk. It seems natural that risk of infection would increase with length of stay. Of course this data provides only ecological correlations, that is, average risk for a whole group of patients is being related to average stay. Nothing in the data can indicate clearly whether or not patients with long stays are actually getting infected more often. Nevertheless, it seems reasonable that if we could hold other features of the hospital environment constant then hospitals with long average stays would expose their patients to more risk, that is, would have a higher probability of infection. Similar remarks should be made about all the discussion below.
The variable Age would be expected to be positively correlated with risk of infection since, presumably, older patients are more susceptible.
The two variables Chest and Culture are measures of how hard the hospital looks for otherwise unsuspected infection. If you look harder you should find more so that the coefficients of these variables in any regression model would be expected to be positive.
The variables Beds, Census and Nurses all measure the size of the hospital. It is not clear if a big hospital should put a patient at risk of infection or not. However, I point out that the relations between these three variables might make a difference. A hospital with high Census compared to Beds may be overcrowded and so more likely to support infections. A hospital with lots of Nurses relative to patients might be expected to be kept in better condition and so lower the risk.
The variable Facilities seems to measure the sophistication of medical treatment at the hospital. This might be positive or it might suggest more exotic diseases among the patient population so I can't guess intelligently about the direction of the effect.
We begin by fitting a model using all the continuous predictors, that is ignoring only SCHOOL and REGION. Here is Splus code and results.
> fit.full <- lm(Risk ~ Stay + Age + Culture + Chest + Beds + Census + Nurses + Facilities, data=scenic) > summary(fit.full) Call: lm(formula = Risk ~ Stay + Age + Culture + Chest + Beds + Census + Nurses + Facilities, data = scenic) Residuals: Min 1Q Median 3Q Max -2.222 -0.5918 0.01833 0.5453 2.556 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.7473 1.2076 -0.6188 0.5374 Stay 0.1769 0.0691 2.5621 0.0118 Age 0.0162 0.0223 0.7285 0.4679 Culture 0.0470 0.0107 4.3719 0.0000 Chest 0.0120 0.0055 2.1943 0.0304 Beds -0.0014 0.0027 -0.5340 0.5945 Census 0.0007 0.0035 0.2097 0.8343 Nurses 0.0019 0.0018 1.0873 0.2794 Facilities 0.0163 0.0102 1.5978 0.1131 Residual standard error: 0.959 on 104 degrees of freedom Multiple R-Squared: 0.5251 F-statistic: 14.37 on 8 and 104 degrees of freedom, the p-value is 6.117e-14 Correlation of Coefficients: (Intercept) Stay Age Culture Chest Beds Census Nurses Stay -0.0146 Age -0.8622 -0.3347 Culture -0.1606 -0.2930 0.3042 Chest -0.1914 -0.3039 0.0281 -0.2911 Beds -0.0377 0.2726 -0.0775 -0.0551 0.0077 Census 0.0536 -0.4625 0.1369 0.1530 0.0657 -0.8766 Nurses 0.0041 0.2486 -0.0410 -0.1935 -0.0657 -0.1681 -0.2265 Facilities -0.1544 -0.0475 -0.0232 -0.0363 -0.0613 -0.2009 0.0670 -0.2229The output suggests that Stay, Culture and Chest are important predictors but none of the others are. So we fit next the model retaining only these 3 predictors.
> fit.1 <- lm( Risk ~ Stay + Culture + Chest, data=scenic) > summary(fit.1) Call: lm(formula = Risk ~ Stay + Culture + Chest, data = scenic) Residuals: Min 1Q Median 3Q Max -2.181 -0.7678 -0.04002 0.696 2.594 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 0.3092 0.5425 0.5700 0.5698 Stay 0.2450 0.0540 4.5349 0.0000 Culture 0.0494 0.0103 4.8008 0.0000 Chest 0.0110 0.0056 1.9839 0.0498 Residual standard error: 0.99 on 109 degrees of freedom Multiple R-Squared: 0.4696 F-statistic: 32.16 on 3 and 109 degrees of freedom, the p-value is 5.662e-15 Correlation of Coefficients: (Intercept) Stay Culture Stay -0.6633 Culture 0.1766 -0.1963 Chest -0.4611 -0.2848 -0.3435We can compare these models in Splus by carrying out an extra Sum of Squares F-test. I have edited the output to make it fit - changing the columns labelled Terms to the shorthand FULL and REDUCED
> anova(fit.1,fit.full) Analysis of Variance Table Response: Risk Terms Resid. Df RSS Test Df Sum of Sq F Value Pr(F) 1 REDUCED 109 106.8206 2 FULL 104 95.6398 5 11.18079 2.431627 0.03972748You see that the test of the hypothesis of no influence of the 5 extra variables suggests that you cannot delete them all, even though individaul t-tests for the 5 individual coefficients are not significant. To see what can happen consider adding each of the three variables which measure size
> fit.b <- lm(Risk ~ Stay + Culture + Chest + Beds , data = scenic) > fit.c <- lm(Risk ~ Stay + Culture + Chest + Census, data = scenic) > fit.n <- lm(Risk ~ Stay + Culture + Chest + Nurses, data = scenic) > summary(fit.b) Call: lm(formula = Risk ~ Stay + Culture + Chest + Beds, data = scenic) Residuals: Min 1Q Median 3Q Max -1.993 -0.7365 0.05695 0.66 2.291 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 0.4149 0.5309 0.7816 0.4362 Stay 0.1845 0.0578 3.1940 0.0018 Culture 0.0480 0.0101 4.7701 0.0000 Chest 0.0130 0.0055 2.3761 0.0193 Beds 0.0013 0.0005 2.5516 0.0121 Residual standard error: 0.9658 on 108 degrees of freedom Multiple R-Squared: 0.4997 F-statistic: 26.97 on 4 and 108 degrees of freedom, the p-value is 1.554e-15 Correlation of Coefficients: (Intercept) Stay Culture Chest Stay -0.6351 Culture 0.1714 -0.1558 Chest -0.4439 -0.3155 -0.3475 Beds 0.0780 -0.4099 -0.0560 0.1424 > summary(fit.c) Call: lm(formula = Risk ~ Stay + Culture + Chest + Census, data = scenic) Residuals: Min 1Q Median 3Q Max -1.984 -0.7584 0.07387 0.6545 2.447 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 0.5233 0.5357 0.9770 0.3307 Stay 0.1719 0.0599 2.8694 0.0049 Culture 0.0484 0.0101 4.8199 0.0000 Chest 0.0132 0.0055 2.3952 0.0183 Census 0.0017 0.0007 2.5656 0.0117 Residual standard error: 0.9655 on 108 degrees of freedom Multiple R-Squared: 0.5 F-statistic: 27 on 4 and 108 degrees of freedom, the p-value is 1.554e-15 Correlation of Coefficients: (Intercept) Stay Culture Chest Stay -0.6504 Culture 0.1683 -0.1542 Chest -0.4270 -0.3189 -0.3452 Census 0.1558 -0.4757 -0.0384 0.1497 > summary(fit.n) Call: lm(formula = Risk ~ Stay + Culture + Chest + Nurses, data = scenic) Residuals: Min 1Q Median 3Q Max -1.958 -0.7093 0.02961 0.5473 2.453 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 0.3703 0.5241 0.7065 0.4814 Stay 0.1936 0.0549 3.5263 0.0006 Culture 0.0456 0.0100 4.5505 0.0000 Chest 0.0127 0.0054 2.3480 0.0207 Nurses 0.0021 0.0007 2.9949 0.0034 Residual standard error: 0.9556 on 108 degrees of freedom Multiple R-Squared: 0.5102 F-statistic: 28.13 on 4 and 108 degrees of freedom, the p-value is 5.551e-16 Correlation of Coefficients: (Intercept) Stay Culture Chest Stay -0.6417 Culture 0.1700 -0.1450 Chest -0.4544 -0.3008 -0.3519 Nurses 0.0389 -0.3126 -0.1276 0.1013
You should note that each of the measures of size is significant; that is, it appears that you should add each of them. But because the three measures of size are quite multicollinear you get no additional predictive power out of including more than one of them in the model. Look what happens when I add both Census and Beds:
> fit.bc <- lm(Risk ~ Stay + Culture + Chest + Census + Beds, data = scenic) > summary(fit.bc) Call: lm(formula = Risk ~ Stay + Culture + Chest + Census + Beds, data = scenic) Residuals: Min 1Q Median 3Q Max -1.986 -0.7498 0.07771 0.6563 2.386 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 0.4841 0.5742 0.8431 0.4010 Stay 0.1760 0.0637 2.7611 0.0068 Culture 0.0483 0.0101 4.7610 0.0000 Chest 0.0131 0.0055 2.3797 0.0191 Census 0.0011 0.0034 0.3243 0.7463 Beds 0.0005 0.0026 0.1956 0.8453 Residual standard error: 0.9699 on 107 degrees of freedom Multiple R-Squared: 0.5002 F-statistic: 21.42 on 5 and 107 degrees of freedom, the p-value is 8.549e-15 Correlation of Coefficients: (Intercept) Stay Culture Chest Census Stay -0.6906 Culture 0.1887 -0.1749 Chest -0.3926 -0.3080 -0.3417 Census 0.3715 -0.4140 0.0810 0.0513 Beds -0.3492 0.3301 -0.0906 -0.0215 -0.9794Notice again that neither the coefficient of Beds nor that of Census is significantly different from 0. The reason for the confusion is in the matrix of correlations (which is converted to correlation form by dividing the ijth entry by the square root of the product of the ith and jth diagonal entries). In particular the components for Beds and Census are very highly negatively correlated.
Here is Splus code for an ANOVA table comparing the model with Beds and Census to the model without them.
> anova(fit.1,fit.bc) Analysis of Variance Table Response: Risk Terms Resid. Df RSS Test Df Sum of Sq F Value Pr(F) 1 Stay + Culture + Chest 109 106.8206 2 Stay + Culture + Chest + Census + Beds 107 100.6482 +Census+Beds 2 6.17243 3.280984 0.04140679Finally look what happens if we put in Beds and Nurses.
> fit.nb <- lm(Risk ~ Stay + Culture + Chest + Nurses + Beds, data = scenic) > summary(fit.nb) Call: lm(formula = Risk ~ Stay + Culture + Chest + Nurses + Beds, data = scenic) Residuals: Min 1Q Median 3Q Max -1.959 -0.7159 0.05094 0.5114 2.513 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 0.3523 0.5290 0.6660 0.5069 Stay 0.1998 0.0582 3.4312 0.0009 Culture 0.0451 0.0102 4.4383 0.0000 Chest 0.0125 0.0055 2.2807 0.0245 Nurses 0.0026 0.0017 1.5528 0.1234 Beds -0.0004 0.0012 -0.3336 0.7394 Residual standard error: 0.9596 on 107 degrees of freedom Multiple R-Squared: 0.5107 F-statistic: 22.34 on 5 and 107 degrees of freedom, the p-value is 2.776e-15 Correlation of Coefficients: (Intercept) Stay Culture Chest Nurses Stay -0.6371 Culture 0.1819 -0.1818 Chest -0.4364 -0.3217 -0.3285 Nurses -0.0763 0.1693 -0.1821 -0.0678 Beds 0.1019 -0.3230 0.1422 0.1211 -0.9079Notice particularly that Beds now has a negative coefficient (though not significantly so).
SUMMARY: Because this is an observational study you cannot interpret the parameter estimates as measuring the amount by which the response would be expected to increase if you increased the corresponding variable by 1 unit. The trouble is that in the population, large values of Beds go with, usually, larger values of Census. So the coefficient of Beds measures something rather more like: how much would the response increase if I increased Beds by 1 unit and all the other covariates changed as you would expect from the relations in the variables seen in this population. Thus you can't really base a policy decision about buildig more beds in a hospital on the sign of the coefficient of Beds in a regression model like this.