next up previous

STAT 350: Lecture 32

Example: transformation vs glm

At each of 5 doses of some drug 40 animals were tested and the number dying were recorded. The log doses are -3.204, -2.903, -2.602, -2.301, and -2.000 and the numbers surviving are 7, 18, 32, 35, 38. A plot of the data is

After the arcsine transform we have

A more standard transformation for this problem is the logit or logistic

displaymath46

Here is the data analyzed by least squares after the logistic transformation

THE DATA FILE

Dose   Dead  Tested
-3.204    7     40
-2.903   18     40
-2.602   32     40
-2.301   35     40
-2.000   38     40

I used SPlus to analyze the data; here is a file full of commands I used.

dead <- read.table("data", header = T)
attach(dead)
postscript("logist.ps",onefile=F,horizontal=F)
plot(Dose, Dead/Tested,xlab="Log Dose",ylab="Proportion Dying")
dev.off()
postscript("arc_logist.ps",onefile=F,horizontal=F)
plot(Dose, Dead/Tested,xlab="Log Dose",ylab="Arc Sine Transform")
dev.off()
postscript("logit_logist.ps",onefile=F,horizontal=F)
plot(Dose, log(Dead/(Tested-Dead)),xlab="Log Dose",ylab="Logit Transform")
dev.off()
linfit <- lm( log(Dead/(Tested-Dead)) ~ Dose, data=dead)
summary(linfit)
glmfit <- glm( cbind(Dead,Tested-Dead) ~ Dose, data=dead, family=binomial)
summary(glmfit)
postscript("logist_plus_curve.ps",onefile=F,horizontal=F)
plot(Dose, Dead/Tested,xlab="Log Dose",ylab="Proportion Dying")
d <- seq(-3.3,-1.9,length=200)
etalin <- coef(linfit)[1] + d*coef(linfit)[2]
p  <- exp(etalin)/(1+exp(etalin))
lines(d,p)
etaglm <- coef(glmfit)[1] + d*coef(glmfit)[2]
p  <- exp(etaglm)/(1+exp(etaglm))
lines(d,p,lty=2)
dev.off()
The output consists of 4 postscript files which have graphs in them and the following printout:
S-PLUS : Copyright (c) 1988, 1996 MathSoft, Inc.
S : Copyright AT&T.
Version 3.4 Release 1 for Sun SPARC, SunOS 5.3 : 1996 
Working data will be in .Data 
> dead <- read.table("data", header = T)
#
#    Read in the data.  Columns are named by the words 
#    read off line 1 because of the header=T bit.
#
> attach(dead)
#
#    Makes the variable which are columns of dead
#    accessible to the plotting rooutines
#
> postscript("logist.ps",onefile=F,horizontal=F)
#
#    Declares that the next graph should be put
#    in a postscript file called logist.ps.  The 
#    file should be encapsulated postscript and in 
#    portrait orientation.
#
> plot(Dose, Dead/Tested,xlab="Log Dose",ylab="Proportion Dying")
#
#    Plot Proportion dying on the y axis against Dose and
#    label the axes
#
> dev.off()
#
#    Finish up the postscript file
#
Starting to make postscript file.
Finished postscript file, executing command "lpr -h logist.ps &".
 null device 
           1
> postscript("arc_logist.ps",onefile=F,horizontal=F)
> plot(Dose, asin(sqrt(Dead/Tested)),xlab="Log Dose",ylab="Arc Sine Transform")
> dev.off()
Starting to make postscript file.
Finished postscript file, executing command "lpr -h arc_logist.ps &".
 null device 
           1
> postscript("logit_logist.ps",onefile=F,horizontal=F)
> plot(Dose, log(Dead/(Tested-Dead)),xlab="Log Dose",ylab="Logit Transform")
> dev.off()
Starting to make postscript file.
Finished postscript file, executing command "lpr -h logit_logist.ps &".
 null device 
           1
> > linfit <- lm( log(Dead/(Tested-Dead)) ~ Dose, data=dead)
#
#   Regress log(Y/(n-Y)) on Dose
#
> summary(linfit)
#
#   Print out a summary of the regression results.
#
Call: lm(formula = log(Dead/(Tested - Dead)) ~ Dose, data = dead)
Residuals:
       1       2      3        4      5 
 -0.2283 0.00792 0.4812 -0.07283 -0.188

Coefficients:
              Value Std. Error t value Pr(>|t|) 
(Intercept) 10.5322  0.9109    11.5626  0.0014 
       Dose  3.6999  0.3455    10.7095  0.0017 

Residual standard error: 0.3288 on 3 degrees of freedom
Multiple R-Squared: 0.9745 
F-statistic: 114.7 on 1 and 3 degrees of freedom, the p-value is 0.001741 

Correlation of Coefficients:
     (Intercept) 
Dose 0.9869     
> glmfit <- glm( cbind(Dead,Tested-Dead) ~ Dose, data=dead, family=binomial)
# 
#   This fits the model that log(E(Y)/(n-E(Y)) is a linear function
#   of Dose
#
> summary(glmfit)

Call: glm(formula = cbind(Dead, Tested - Dead) ~ Dose, family = binomial, data = dead
	)
Deviance Residuals:
          1           2        3         4          5 
 -0.4319303 -0.03563685 1.026423 -0.476305 -0.5453582

Coefficients:
                Value Std. Error  t value 
(Intercept) 11.238232  1.5651480 7.180300
       Dose  3.936472  0.5604985 7.023162

(Dispersion Parameter for Binomial family taken to be 1 )

    Null Deviance: 80.77441 on 4 degrees of freedom

Residual Deviance: 1.76566 on 3 degrees of freedom

Number of Fisher Scoring Iterations: 3 

Correlation of Coefficients:
     (Intercept) 
Dose 0.9929832  
> > > postscript("logist_plus_curve.ps",onefile=F,horizontal=F)
> plot(Dose, Dead/Tested,xlab="Log Dose",ylab="Proportion Dying")
> d <- seq(-3.3,-1.9,length=200)
> etalin <- coef(linfit)[1] + d*coef(linfit)[2]
> p  <- exp(etalin)/(1+exp(etalin))
#
#    For each dose in d, which is a list of 200 numbers running
#    from -3.3 to -1.9, compute the fitted probability according to
#    the logit model: if log(x/(1-x))=p then p=exp(x)/(1+exp(x))
#
> lines(d,p)
#
#   Plot the fitted curve for the least squares method on the 
#   graph of the data
#
> etaglm <- coef(glmfit)[1] + d*coef(glmfit)[2]
> p  <- exp(etaglm)/(1+exp(etaglm))
#
#    Do the same for the generalized model fit
#
> lines(d,p,lty=2)
#
#   Plot the fitted curve for the glm method on the 
#   graph of the data.  Use a dashed (lty=2) line.
#
> dev.off()
Starting to make postscript file.
Finished postscript file, executing command "lpr -h logist_plus_curve.ps &".
 null device 
           1
> q() # end-of-file

Poisson Regression: Count Data

In the table below the first row is the number of times a carton of glass objects was transferred from one aircraft to another during shipping. The second row is the number of broken objects.

i: 1 2 3 4 5 6 7 8 9 10
tex2html_wrap_inline50 1 0 2 0 3 1 0 1 2 0
tex2html_wrap_inline52 16 9 17 12 22 13 8 15 19 11

A reasonable model is that tex2html_wrap_inline52 has a Poisson distribution with mean tex2html_wrap_inline56 which depends in some way on tex2html_wrap_inline50 . We fit 3 models:

  1. The ordinary linear regression model

    displaymath60

    for which the relevant plot is

  2. The transformed regression model

    displaymath62

    for which the relevant plot is

  3. The Poisson regression model in which tex2html_wrap_inline52 has a Poisson( tex2html_wrap_inline56 ) distribution and

    displaymath68

    or equivalently

    displaymath70

The following SPlus code fits these models

dat <- read.table("data", header = T)
attach(dat)
postscript("xyplot.ps",onefile=F,horizontal=F)
plot(Transfers,Broken,xlab="Number of Transfers",ylab="Number Broken")
dev.off()
postscript("xrootyplot.ps",onefile=F,horizontal=F)
plot(Transfers, sqrt(Broken),xlab="Number of Transfers",ylab="Square Root Transform")
dev.off()
#
#   Regress Number Broken on Number of Transfers
#
linfit <- lm( Broken ~ Transfers, data=dat)
summary(linfit)
diag(linfit)
#
#   Regress Square Root of Number Broken on Number of Transfers
#
rootlinfit <- lm( sqrt(Broken) ~ Transfers, data=dat)
summary(rootlinfit)
diag(rootlinfit)
# 
#   The following fits log(E(Y)) is a linear function
#   of Dose and variance is equal to the mean
#
glmfit <- glm( Broken ~ Transfers, data=dat, family=Poisson)
summary(glmfit)
postscript("points_plus_curve.ps",onefile=F,horizontal=F)
plot(Transfers,Broken,xlab="Number of Transfers",ylab="Number Broken")
d <- seq(0,4,length=200)
etalin <- coef(linfit)[1] + d*coef(linfit)[2]
lines(d,etalin)
etarootlin <- coef(rootlinfit)[1] + d*coef(rootlinfit)[2]
lines(d,etarootlin^2,lty=2)
etaglm <- coef(glmfit)[1] + d*coef(glmfit)[2]
p  <- exp(etaglm)
lines(d,p,lty=3)
legend(0,20,lty=1:3,legend=c("OLS","OLS on Root Y","GLM"))
dev.off()

EDITED OUTPUT (Complete output.)

> summary(linfit)

Call: lm(formula = Broken ~ Transfers, data = dat)
Residuals:
  Min   1Q Median  3Q Max 
 -2.2 -1.2    0.3 0.8 1.8

Coefficients:
              Value Std. Error t value Pr(>|t|) 
(Intercept) 10.2000  0.6633    15.3771  0.0000 
  Transfers  4.0000  0.4690     8.5280  0.0000 

Residual standard error: 1.483 on 8 degrees of freedom
Multiple R-Squared: 0.9009 
F-statistic: 72.73 on 1 and 8 degrees of freedom, the p-value is 2.749e-05 

Correlation of Coefficients:
          (Intercept) 
Transfers -0.7071    

> summary(rootlinfit)

Call: lm(formula = sqrt(Broken) ~ Transfers, data = dat)
Residuals:
     Min      1Q  Median     3Q   Max 
 -0.3722 -0.1263 0.01059 0.1392 0.274

Coefficients:
              Value Std. Error t value Pr(>|t|) 
(Intercept)  3.2006  0.1010    31.6789  0.0000 
  Transfers  0.5254  0.0714     7.3538  0.0001 

Residual standard error: 0.2259 on 8 degrees of freedom
Multiple R-Squared: 0.8711 
F-statistic: 54.08 on 1 and 8 degrees of freedom, the p-value is 7.965e-05 

Correlation of Coefficients:
          (Intercept) 
Transfers -0.7071    

> summary(glmfit)

Call: glm(formula = Broken ~ Transfers, family = poisson, data = dat)
Deviance Residuals:
        Min         1Q      Median        3Q       Max 
 -0.8105292 -0.2389281 -0.02029464 0.3299091 0.6074179

Coefficients:
                Value Std. Error   t value 
(Intercept) 2.3529495  0.1317376 17.860883
  Transfers 0.2638422  0.0792345  3.329891

(Dispersion Parameter for Poisson family taken to be 1 )

    Null Deviance: 12.56868 on 9 degrees of freedom

Residual Deviance: 1.813176 on 8 degrees of freedom

Number of Fisher Scoring Iterations: 3 

Correlation of Coefficients:
          (Intercept) 
Transfers -0.770864

The fits together


next up previous



Richard Lockhart
Thu Mar 20 23:13:31 PST 1997