STAT 330 Lecture 35
Reading for Today's Lecture: Chapter 13.
Goals of Today's Lecture:
Today's notes
Non-linear Regression
Example: Sand dunes can be dated by geophysicists by a technique known as thermoluminescence dating. The idea is this. When sediment is buried under ground for a long time cosmic radiation and radiation from naturally occurring traces of radioactive elements in the ground cause electrons in the sand to be excited to a higher energy level. If the sand is then heated gently in an oven it glows with a blue light called thermoluminescence. The amount, Y, of this blue light can be measured. It depends on how long the sample of sand has been buried, how much background radiation it has been exposed to and how sensitive electrons in the particular sample are to radiation.
A natural model for the relation is, for old sediments, to assume that
for some parameter values and where d is the total amount of radiation impinging on the sample while it was buried. The quantity d should be the of the form d=tr where r is the rate at which radiation has hit the sample and t is how long it has been buried. The quantity r can be measured (or guessed) by physicists and so if we knew d we could guess t. If we knew and and the error were not big we could guess d from Y and then guess t.
In order to guess the parameters and we artificially vary d in the laboratory by applying various doses of radiation to the sample and measuring Y. If we apply, in the lab, a dose to a sample which has already received while buried then the total dose is and we get the model equation
in which the are covariates under the control of the experimenter but , and are unknown parameters. Here is the data:
0 | 10852.72 | 0 | 13069.23 | 0 | 13815.53 |
0 | 28048.12 | 0 | 17280.65 | 100 | 75309.49 |
100 | 74637.4 | 100 | 69289.66 | 100 | 75616.82 |
200 | 101647.1 | 200 | 100214.6 | 200 | 107503 |
200 | 105080.7 | 300 | 120423 | 300 | 127952.9 |
300 | 123771.8 | 300 | 133668.7 | 300 | 132273.8 |
300 | 127244.9 | 300 | 129482 | 400 | 139236.3 |
400 | 140716.4 | 400 | 138808.5 | 400 | 133130.2 |
500 | 175240.1 | 500 | 166539.2 | 500 | 150112.8 |
600 | 170131.9 | 600 | 170064.9 | 600 | 157032.5 |
600 | 153997 | 800 | 205493.6 | 800 | 186204.3 |
800 | 180135.2 | 800 | 177266.2 | 1000 | 197666.2 |
1000 | 209013.9 | 1000 | 201754 |
And here is a plot:
We generally assume that the are independent and all have mean 0 and variance . Moreover, we often assume that the have a normal distribution.
Under these circumstances the maximum likelihood estimates of the parameters are the same as least squares: they minimize
In later courses you will learn the theory which underlies the computation of estimated standard errors for this sort of data set. The essential mathematical idea is that when n is large the uncertainty in the parameter estimates is small. This permits the approximation of the function
by a linear function of the parameters and then use of multiple regression formulas to approximate the standard errors of the parameter estimates.
Here is SAS code to fit this model to the data set graphed above.
options pagesize=60 linesize=80; data thermo; infile 'nlin.dat' firstobs=2; input dose tl ; proc nlin ; parameters i0=210000 dadd=40 drate=360; model tl=i0*(1-exp(-(dose+dadd)/drate)); run;You should notice the line parameters. In problems other than linear regression estimation is carried out by solving some equations numerically. This process needs starting values and I provided these on the parameters line. You might find the firstobs=2 option useful; it says the first data point is on the second line of the input file. This allows me to put titles on the columns of data in the data file.
Here is the SAS output.
Non-Linear Least Squares Iterative Phase Dependent Variable TL Method: Gauss-Newton Iter I0 DADD DRATE Sum of Squares 0 210000 40.000000 360.000000 2850924481 1 206915 38.829762 360.130510 2658973703 2 206918 38.816671 360.146389 2658972887 3 206918 38.817016 360.148131 2658972886 NOTE: Convergence criterion met. Non-Linear Least Squares Summary Statistics Dependent Variable TL Source DF Sum of Squares Mean Square Regression 3 705788442288 235262814096 Residual 35 2658972886 75970654 Uncorrected Total 38 708447415175 (Corrected Total) 37 117263307042 Parameter Estimate Asymptotic Asymptotic 95 % Std. Error Confidence Interval Lower Upper I0 206918.3830 5960.3342157 194818.33309 219018.43294 DADD 38.8170 8.3251990 21.91606 55.71797 DRATE 360.1481 28.5679715 302.15241 418.14385 Asymptotic Correlation Matrix Corr I0 DADD DRATE ----------------------------------------------------------- I0 1 0.4462823664 0.9042557338 DADD 0.4462823664 1 0.679588767 DRATE 0.9042557338 0.679588767 1In this output the most useful line is the line DADD in the table with columns Parameter, Estimate and so on. The Asymptotic Std. Error gives an approximate standard error for the estimated radiation dose the sample received while buried. Beside this standard error is a 95% confidence interval. From data on the rate at which radiation impacted this sample while buried the geophysicist would convert this confidence interval into a confidence interval for the length of time the sample has been buried.
Here is a plot of the data with the fitted curve superimposed. The fit seems OK.
Finally here is a residual plot of residual
against dose.
The plot seems to me to be a bit suspect. There is an area of low doses where the residuals are all positive and then another area on the right of the plot where the residuals are all negative.
Logistic Regression: an example
Example: At each of 5 doses of some drug 40 animals were tested and the numbers dying were recorded. The log doses are -3.204, -2.903, -2.602, -2.301, and -2.000 and the numbers surviving are 7, 18, 32, 35, 38.
In older statistics courses it was recommended to regress the proportion surviving on the log dose after making a transformation. However, the modern approach is to do logistic regression via fitting a so-called generalized linear model.
In this case our model is that , the number dying at dose i has a Binomial distribution with parameters and
The logistic regression model postulates that the probability of death is related to the log dose by the so-called logistic transformation:
This model can be solved for to give
Here is SAS code to fit the model:
data logist; infile "logist.dat" firstobs=2; input dose dead tested; proc logist; model dead/tested=dose;and here is the output
The LOGISTIC Procedure Data Set: WORK.LOGIST Response Variable (Events): DEAD Response Variable (Trials): TESTED Number of Observations: 5 Link Function: Logit Response Profile Ordered Binary Value Outcome Count 1 EVENT 130 2 NO EVENT 70 Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 260.979 183.970 . SC 264.277 190.567 . -2 LOG L 258.979 179.970 79.009 with 1 DF (p=0.0001) Score . . 68.582 with 1 DF (p=0.0001) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1 11.2382 1.5660 51.5010 0.0001 . . DOSE 1 3.9365 0.5608 49.2757 0.0001 0.926164 51.238 Association of Predicted Probabilities and Observed Responses Concordant = 78.5% Somers' D = 0.695 Discordant = 9.0% Gamma = 0.793 Tied = 12.5% Tau-a = 0.318 (9100 pairs) c = 0.847The output gives estimates and standard errors for and and tests the hypotheses that and that by a technique called a Wald test and by two other likelihood methods. All the tests agree with what we already knew from looking at the data - there is a clear effect of dose on death rate. Probability more useful is the confidence interval estimate for which can be calculated from the estimate and the standard error.
Now here is a plot of the proportion surviving against the log dose with the fitted curve
superimposed.