STAT 350: Lecture 31
Heteroscedastic Errors
If plots and/or tests show that the error variances depend on i there are several standard approaches to fixing the problem, depending on the nature of the dependence.
This usually arises realistically in the following situations:
while generalized linear models use
Generally the latter approach offers more flexibility since it is then possible to model the variance as a general function of the mean while for transformation followed by ordinary least squares the transformed data must follow a homoscedastic linear model.
Weighted Least Squares
If
and
and the errors are independent with normal distributions then the likelihood is
To choose to maximize this likelihood we minimize the quantity
The process is called weighted least squares.
Algebraically it is easy to see how to do the minimization. Rewrite the quantity to be minimized as
This is just an ordinary least squares problem with the response variable being
and the covariates being
The calculation can be written in matrix form. If is a diagonal matrix with in the ith diagonal position then put and . Then
becomes
If had mean 0, independent entries and then has mean 0, independent entries and so that ordinary multiple regression theory applies. The estimate of is
where now is a diagonal matrix with on the diagonal. This estimate is unbiased and has variance covariance matrix
Example
It is possible to do weighted least squares in SAS fairly easily. As an example we consider using the SENIC data set taking the variance of RISK to be proportional to 1/CENSUS. (Motivation: RISK is an estimated proportion; variance of a Binomial proportion is inversely proportional to the sample size. This makes the weight just CENSUS.
proc reg data=scenic; model Risk = Culture Stay Nratio Chest Facil; weight Census; run ;
EDITED OUTPUT (Complete output)
Dependent Variable: RISK Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 5 12876.94280 2575.38856 17.819 0.0001 Error 107 15464.46721 144.52773 C Total 112 28341.41001 Root MSE 12.02197 R-square 0.4544 Dep Mean 4.76215 Adj R-sq 0.4289 C.V. 252.44833 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 0.468108 0.62393433 0.750 0.4547 CULTURE 1 0.030005 0.00891714 3.365 0.0011 STAY 1 0.237420 0.04444810 5.342 0.0001 NRATIO 1 0.623850 0.34803271 1.793 0.0759 CHEST 1 0.003547 0.00444160 0.799 0.4263 FACIL 1 0.008854 0.00603368 1.467 0.1452EDITED OUTPUT FOR UNWEIGHTED CASE (Complete output)
Dependent Variable: RISK Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 5 108.32717 21.66543 24.913 0.0001 Error 107 93.05266 0.86965 C Total 112 201.37982 Root MSE 0.93255 R-square 0.5379 Dep Mean 4.35487 Adj R-sq 0.5163 C.V. 21.41399 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -0.768043 0.61022741 -1.259 0.2109 CULTURE 1 0.043189 0.00984976 4.385 0.0001 STAY 1 0.233926 0.05741114 4.075 0.0001 NRATIO 1 0.672403 0.29931440 2.246 0.0267 CHEST 1 0.009179 0.00540681 1.698 0.0925 FACIL 1 0.018439 0.00629673 2.928 0.0042
Transformation
Sometimes the response variable will have a distribution which makes it likely that the errors will be not very normal and that the errors will not be homoscedastic. Typical examples:
Example: For each of the doses a number of animals are treated with the corresponding dose of some drug. The number, Y, dying at dose d is Binomial with parameter h(d).
The traditional analysis method is to try transformation:
BIGGEST PROBLEM If the model was linear before transformation then it will not be linear after transformation.