Example 36.2: Computing Predicted Values for a Tobit Model

The LIFEREG Procedure

Example 36.2: Computing Predicted Values for a Tobit Model

The LIFEREG Procedure can be used to perform a Tobit analysis. The Tobit model, described by Tobin (1958), is a regression model for left censored data assuming a normally distributed error term. The model parameters are estimated by maximum likelihood. PROC LIFEREG provides estimates of the parameters of the distribution of the uncensored data. Refer to Greene (1993) and Maddala (1983) for a more complete discussion of censored normal data and related distributions. This example shows how you can use PROC LIFEREG and the data step to compute two of the three types of predicted values discussed there.

Consider a continuous random variable Y, and a constant C. If you were to sample from the distribution of Y but discard values less than (greater than) C, the distribution of the remaining observations would be truncated on the left (right). If you were to sample from the distribution of Y and report values less than (greater than) C as C, the distribution of the sample would be left (right) censored.

The probability density function of the truncated random variable Y' is given by

f_Y'( y ) = [( f_Y( y ))/ Pr( Y > C ) ] for y > C

where f_Y( y ) is the probability density function of Y. PROC LIFEREG cannot compute the proper likelihood function to estimate parameters or predicted values for a truncated distribution.

Suppose the model being fit is specified as follows:

${\rm Y}_i^\ast = x_{i}^'{\beta} + \epsilon_i \$

where $\epsilon_i$ is a normal error term with zero mean and standard deviation $\sigma$ .

Define the censored random variable Y_i as

${\rm Y}_i & = & 0 \;\; { if } \;\; {\rm Y}_i^\ast \leq 0 \{\rm Y}_i & = & {\rm Y}_i^\ast \;\; { if }\;\; {\rm Y}_i^\ast \gt 0 \$

This is the Tobit model for left-censored normal data. ${\rm Y}_i^\ast$ is sometimes called the latent variable. PROC LIFEREG estimates parameters of the distribution of ${\rm Y}_i^\ast$ by maximum likelihood.

You can use the LIFEREG procedure to compute predicted values based on the mean functions of the latent and observed variables. The mean of the latent variable ${\rm Y}_i^\ast$ is $x_{i}^'{\beta}$ and you can compute values of the mean for different settings of x_i by specifying XBETA=variable-name in an OUTPUT statement. Estimates of $x_{i}^'{\beta}$ for each observation will be written to the OUT= data set. Predicted values of the observed variable Y_i can be computed based on the mean

$E({\rm Y}_i) = \Phi(\frac{x_{i}^'{\beta}} {\sigma})(x_{i}^'{\beta} + \sigma\lambda_i)$

where

$\lambda_i = \frac{\phi(x_{i}^'{\beta}/\sigma)} {\Phi(x_{i}^'{\beta}/\sigma)}$

$\phi$ and $\Phi$ represent the normal probability density and cumulative distribution functions.

The following table shows a subset of the Mroz (1987) data set. In this data, Hours is the number of hours the wife worked outside the household in a given year, Yrs_Ed is the years of education, and Yrs_Exp is the years of work experience. A Tobit model will be fit to the hours worked with years of education and experience as covariates.

Hours	Yrs_Ed	Yrs_Exp
0	8	9
0	8	12
0	9	10
0	10	15
0	11	4
0	11	6
1000	12	1
1960	12	29
0	13	3
2100	13	36
3686	14	11
1920	14	38
0	15	14
1728	16	3
1568	16	19
1316	17	7
0	17	15

If the wife was not employed (worked 0 hours), her hours worked will be left censored at zero. In order to accommodate left censoring in PROC LIFEREG, you need two variables to indicate censoring status of observations. You can think of these variables as lower and upper endpoints of interval censoring. If there is no censoring, set both variables to the observed value of Hours. To indicate left censoring, set the lower endpoint to missing and the upper endpoint to the censored value, zero in this case.

The following statements create a SAS data set with the variables Hours, Yrs_Ed, and Yrs_Exp from the data above. A new variable, Lower is created such that Lower=. if Hours=0 and Lower=Hours if Hours>0.

   data subset;
      input Hours Yrs_Ed Yrs_Exp @@;
      if Hours eq 0 
         then Lower=.;
         else Lower=Hours;
   datalines;
   0 8 9 0 8 12 0 9 10 0 10 15 0 11 4 0 11 6 
   1000 12 1 1960 12 29 0 13 3 2100 13 36 
   3686 14 11 1920 14 38 0 15 14 1728 16 3
   1568 16 19 1316 17 7 0 17 15
   ;

The following statements fit a normal regression model to the left censored Hours data using Yrs_Ed and Yrs_Exp as covariates. You will need the estimated standard deviation of the normal distribution to compute the predicted values of the censored distribution from the formulas above. The data set OUTEST contains the standard deviation estimate in a variable named _SCALE_. You also need estimates of $x_{i}^'{\beta}$ . These are contained in the data set OUT as the variable Xbeta

   proc lifereg data=subset outest=OUTEST(keep=_scale_);
      model (lower, hours) = yrs_ed yrs_exp / d=normal;
      output out=OUT xbeta=Xbeta;
   run;

Output 36.2.1 shows the results of the model fit. These tables show parameter estimates for the uncensored, or latent variable, distribution.

Output 36.2.1: Parameter Estimates from PROC LIFEREG

The LIFEREG Procedure

Model Information
Data Set	WORK.SUBSET
Dependent Variable	Lower
Dependent Variable	Hours
Number of Observations	17
Noncensored Values	8
Right Censored Values	0
Left Censored Values	9
Interval Censored Values	0
Name of Distribution	NORMAL
Log Likelihood	-74.9369977

Analysis of Parameter Estimates
Variable	DF	Estimate	Standard Error	Chi-Square	Pr > ChiSq	Label
Intercept	1	-5598.6	2850.2	3.8583	0.0495	Intercept
Yrs_Ed	1	373.14771	191.88717	3.7815	0.0518
Yrs_Exp	1	63.33711	38.36317	2.7258	0.0987
Scale	1	1582.9	442.67318			Normal scale

The following statements combine the two data sets created by PROC LIFEREG to compute predicted values for the censored distribution. The OUTEST= data set contains the estimate of the standard deviation from the uncensored distribution, and the OUT= data set contains estimates of $x_{i}^'{\beta}$ .

   data predict;
      drop lambda _scale_ _prob_;
      set out;
      if _n_ eq 1 then set outest;
      lambda = pdf('NORMAL',Xbeta/_scale_) 
               / cdf('NORMAL',Xbeta/_scale_);
      Predict = cdf('NORMAL', Xbeta/_scale_) 
                * (Xbeta + _scale_*lambda);
      label Xbeta='MEAN OF UNCENSORED VARIABLE'
            Predict = 'MEAN OF CENSORED VARIABLE';
   run;

   proc print data=predict noobs label;
      var hours lower yrs: xbeta predict;
   run;

Output 36.2.2 shows the original variables, the predicted means of the uncensored distribution, and the predicted means of the censored distribution.

Output 36.2.2: Predicted Means from PROC LIFEREG

Hours	Lower	Yrs_Ed	Yrs_Exp	MEAN OF UNCENSORED VARIABLE	MEAN OF CENSORED VARIABLE
0	.	8	9	-2043.42	73.46
0	.	8	12	-1853.41	94.23
0	.	9	10	-1606.94	128.10
0	.	10	15	-917.10	276.04
0	.	11	4	-1240.67	195.76
0	.	11	6	-1113.99	224.72
1000	1000	12	1	-1057.53	238.63
1960	1960	12	29	715.91	1052.94
0	.	13	3	-557.71	391.42
2100	2100	13	36	1532.42	1672.50
3686	3686	14	11	322.14	805.58
1920	1920	14	38	2032.24	2106.81
0	.	15	14	885.30	1170.39
1728	1728	16	3	561.74	951.69
1568	1568	16	19	1575.13	1708.24
1316	1316	17	7	1188.23	1395.61
0	.	17	15	1694.93	1809.97

Chapter Contents
Previous
Next
Top