Example 64.5: Computing a Bootstrap Confidence Interval

The TPSPLINE Procedure

Example 64.5: Computing a Bootstrap Confidence Interval

The following example illustrates how you can construct a bootstrap confidence interval by using the multiple responses feature in PROC TPSPLINE.

Numerous epidemiological observations have indicated that exposure to solar radiation is an important factor in the etiology of melanoma. The following data present age-adjusted melanoma incidences for 37 years from the Connecticut Tumor Registry (Houghton, Flannery, and Viola 1980). The data are analyzed by Ramsay and Silverman (1997).

   data melanoma; 
      input  year incidences @@;
      datalines;
      1936    0.9   1937   0.8  1938   0.8  1939   1.3
      1940    1.4   1941   1.2  1942   1.7  1943   1.8
      1944    1.6   1945   1.5  1946   1.5  1947   2.0
      1948    2.5   1949   2.7  1950   2.9  1951   2.5
      1952    3.1   1953   2.4  1954   2.2  1955   2.9
      1956    2.5   1957   2.6  1958   3.2  1959   3.8
      1960    4.2   1961   3.9  1962   3.7  1963   3.3
      1964    3.7   1965   3.9  1966   4.1  1967   3.8
      1968    4.7   1969   4.4  1970   4.8  1971   4.8
      1972    4.8
      ;
   run;

The variable incidences records the number of melanoma cases per 100,000 people for the years 1936 to 1972. The following model fits the data and requests a 90% Bayesian confidence interval along with the estimate.

   proc tpspline data=melanoma;
      model incidences = (year) /alpha = 0.1;
      output out = result pred uclm lclm;
   run;

The output is displayed in Output 64.5.1

Output 64.5.1: Output from PROC TPSPLINE for the Melanoma Data Set

The TPSPLINE Procedure

Dependent Variable: incidences

Summary of Input Data Set
Number of Non-Missing Observations	37
Number of Missing Observations	0
Unique Smoothing Design Points	37

Summary of Final Model
Number of Regression Variables	0
Number of Smoothing Variables	1
Order of Derivative in the Penalty	2
Dimension of Polynomial Space	2

Summary Statistics of Final Estimation
*log10(nLambda)**	-0.060735
Smoothing Penalty	0.517132
Residual SS	1.224264
Tr(I-A)	22.585173
Model DF	14.414827
Standard Deviation	0.232823

The following statements produce a plot of the estimated curve:

   symbol1 h=1pct ;
   symbol2 i=join v=none;
   symbol3 i=join v=none;
   symbol4 i=join v=none c=cyan;

   legend1 frame cframe=ligr cborder=black 
           label=none position=center;
   axis1 label=(angle=90 rotate=0) minor=none;
   axis2 minor=none;

   title1 'Age-adjusted Melanoma Incidences for 37 years';

   proc gplot data=result;
      plot  incidences*year=1
            p_incidences*year=2
            lclm_incidences*year=3
            uclm_incidences*year=4 /overlay legend=legend1 
                                    vaxis=axis1 haxis=axis2
                                    frame cframe=ligr;
   run;

The estimated curve is displayed with 90% confidence interval bands in Output 64.5.2. The number of melanoma incidences exhibits a periodic pattern and increases over the years. The periodic pattern is related to sunspot activity and the accompanying fluctuations in solar radiation.

Output 64.5.2: TPSPLINE Estimate and 90% Confidence Interval of Melanoma Data

Wang and Wahba (1995) compared several bootstrap confidence intervals to Bayesian confidence intervals for smoothing splines. Both bootstrap and Bayesian confidence intervals are across-the-curve intervals, not point-wise intervals. They concluded that bootstrap confidence intervals work as well as Bayesian intervals concerning average coverage probability. Additionally, bootstrap confidence intervals appear to be better for small sample sizes. Based on their simulation, the "percentile-t interval" bootstrap interval performs better than the other types of bootstrap intervals.

Suppose that $\hat{f}_{\hat{\lambda}}$ and $\hat{\sigma}$ are the estimates of f and $\sigma$ from the data. Assume that $\hat{f}_{\hat{\lambda}}$ is the "true" f, and generate the bootstrap sample as follows:

$y_i^* = \hat{f}_{\hat{\lambda}}(x_i) + \epsilon_i^*, i=1, ... , n$

where ${\epsilon}^* = (\epsilon_1^*, ... ,\epsilon_n^*)^T \approx N(0,\hat{\sigma}I_{nx n})$ . Denote $f^*_{\hat{\lambda}}(x_i)$ as the random variable of the bootstrap estimate at x_i. Repeat this process K times, so that at each point x_i, you have K bootstrap estimates $\hat{f}_{\hat{\lambda}}(x_i)$ or K realizations of $f^*_{\hat{\lambda}}(x_i)$ . For each fixed x_i, consider the following statistic D_i^*, which is similar to a Student's t statistic:

$D_i^* = (f^*_{\hat{\lambda}}(x_i) - \hat{f}_{\hat{\lambda}}(x_i)) /\hat{\sigma_i}^*$

where $\hat{\sigma_i}^*$ is the estimate of $\hat{\sigma}$ based on the ith bootstrap sample.

Suppose $\chi_{\alpha/2}$ and $\chi_{1-\alpha/2}$ are the lower and upper $\alpha/2$ points of the empirical distribution of D_i^*. The $(1-\alpha)$ 100% bootstrap confidence interval is defined as

$(\hat{f}_{\hat{\lambda}}(x_i) - \chi_{1-\alpha/2} \hat{\sigma}, \hat{f}_{\hat{\lambda}}(x_i) - \chi_{\alpha/2} \hat{\sigma} )$

Bootstrap confidence intervals are easy to interpret and can be used with any distribution. However, because they require K model fits, their construction is computationally intensive.

The multiple dependent variables feature in PROC TPSPLINE enables you to fit multiple models with the same independent variables. The procedure calculates the matrix decomposition part of the calculations only once regardless of the number of dependent variables in the model. These calculations are responsible for most of the computing time used by the TPSPLINE procedure. This feature is particularly useful when you need to generate a bootstrap confidence interval. To construct a bootstrap confidence interval, perform the following tasks:

Fit the data using PROC TPSPLINE and obtain estimates $\hat{f}_{\hat{\lambda}}(x_i)$ and $\hat{\sigma}$ .
Generate K bootstrap samples based on $\hat{f}_{\hat{\lambda}}(x_i)$ and $\hat{\sigma}$ .
Fit the K bootstrap samples with the TPSPLINE procedure to obtain estimates of $\hat{f}^*_{\hat{\lambda}}(x_i)$ and $\hat{\sigma_i}^*$ .
Compute D_i^* and the values $\chi_{\alpha/2}$ and $\chi_{1-\alpha/2}$ .

The following statements illustrate this process:

   proc tpspline data=melanoma;
      model incidences = (year) /alpha = 0.05;
      output out = result pred uclm lclm;
   run;

The output from the initial PROC TPSPLINE analysis is displayed in Output 64.5.3. The data set result contains the predicted values and confidence limits from the analysis.

Output 64.5.3: Output from PROC TPSPLINE for the Melanoma Data Set

The TPSPLINE Procedure

Dependent Variable: incidences

Summary of Input Data Set
Number of Non-Missing Observations	37
Number of Missing Observations	0
Unique Smoothing Design Points	37

Summary of Final Model
Number of Regression Variables	0
Number of Smoothing Variables	1
Order of Derivative in the Penalty	2
Dimension of Polynomial Space	2

Summary Statistics of Final Estimation
*log10(nLambda)**	-0.060735
Smoothing Penalty	0.517132
Residual SS	1.224264
Tr(I-A)	22.585173
Model DF	14.414827
Standard Deviation	0.232823

The following statements illustrate how you can obtain a bootstrap confidence interval for the Melanoma data. The following statements create the data set bootstrap. The observations are created with information from the preceding PROC TPSPLINE execution; as displayed in Output 64.5.3, $\hat{\sigma} = 0.232823$ . The values of $\hat{f}_{\hat{\lambda}}(x_i)$ are stored in the data set result in the variable P_incidence.

   data bootstrap; 
      set result;
      array y{1070} y1-y1070;
      do i=1 to 1070;
         y{i} = p_incidences + 0.232823*rannor(123456789);
      end;
      keep y1-y1070 p_incidences year;
   run;

  ods listing close;

   proc tpspline data=bootstrap;
      ods output FitStatistics=FitResult;
      id p_incidences;
      model y1-y1070 = (year);
      output out=result2;
   run;
   ods listing;

The DATA step generates 1,070 bootstrap samples based on the previous estimate from PROC TPSPLINE. For this data set, some of the bootstrap samples result in $\lambda$ s (selected by the GCV function) that cause problematic behavior. Thus, an additional 70 bootstrap samples are generated.

The ODS listing destination is closed before PROC TPSPLINE is invoked. The model fits all the y1 -y1070 variables as dependent variables, and the models are fit for all bootstrap samples simultaneously. The output data set result2 contains the variables year, y1 -y1070, p_y1 -p_y1070, and P_incidences.

The ODS OUTPUT statement writes the FitStatistics table to the data set FitResult. The data set FitResult contains the two variables. They are Parameter and Value. The FitResult data set is used in subsequent calculations for D_i^*.

In the data set FitResult, there are 63 estimates with a standard deviation of zero, suggesting that the estimates provide perfect fits of the data and are caused by $\hat{\lambda}$ s that are approximately equal to zero. For small sample sizes, there is a positive probability that the $\lambda$ chosen by the GCV function will be zero (refer to Wang and Wahba 1995).

In the following steps, these cases are removed from the bootstrap samples as "bad" samples: they represent failure of the GCV function.

The following SAS statements manipulate the data set FitResult, retaining the standard deviations for all bootstrap samples and merging FitResult with the data set result2, which contains the estimates for bootstrap samples. In the final data set boot, the D^*_i statistics are calculated.

   data FitResult; set FitResult;
      if Parameter="Standard Deviation";
      keep Value;
   run;

   proc transpose data=sum2 out=sd prefix=sd;

   data result2; 
      if _N_ = 1 then set sd;
      set result2;

   data boot; 
      set result2;
      array y{1070}  p_y1-p_y1070;
      array sd{1070} sd1-sd1070;
      do i=1 to 1070;
         if sd{i} > 0 then do;
            d = (y{i} - P_incidences)/sd{i};
            obs = _N_;
            output;
         end;
      end;
      keep d obs P_incidences year;
   run;

The following SAS statements retain the first 1000 bootstrap samples and calculate the values $\chi_{\alpha/2}$ and $\chi_{1-\alpha/2}$ with $\alpha=0.1$ .

   proc sort data=boot; 
      by obs;
   run;

   data boot; 
      set boot; 
         by obs;
      retain n;

      if first.obs then n=1;
         else n=n+1;
      if n > 1000 then delete;
   run;


   proc sort data=boot; 
      by obs error;
   run;

   data chi1 chi2 ; 
      set boot;
      if (_N_ = (obs-1)*1000+50)  then output chi1;
      if (_N_ = (obs-1)*1000+950) then output chi2; 
   run;

   proc sort data=result; 
      by year;
   run;

   proc sort data=chi1;   
      by year;
   run;

   proc sort data=chi2;   
      by year;
   run;

   data result; 
      merge result
         chi1(rename=(d=chi05))
         chi2(rename=(d=chi95));
      keep year incidences P_incidences lower upper 
           LCLM_incidences UCLM_incidences;
   
      lower = -chi95*0.232823 + P_incidences;
      upper = -chi05*0.232823 + P_incidences;
   
      label  lower="Lower 95% CL (Bootstrap)"
             upper="Upper 95% CL (Bootstrap)"
             lclm_incidences="Lower 95% CL (Bayesian)"
             uclm_incidences="Upper 95% CL (Bayesian)";
   run;

The data set result contains the variables year, incidences, the TPSPLINE estimate P_incidences, and the 90% Bayesian and 90% bootstrap confidence intervals.

The following statements produce Output 64.5.4:

   symbol1  v=dot  h=1pct ;
   symbol2  i=join v=none l=1;
   symbol3  i=join v=none l=33;
   symbol4  i=join v=none l=33;
   symbol5  i=join v=none l=43 c=green;
   symbol6  i=join v=none l=43 c=green;

   title1 'Age-adjusted Melanoma Incidences for 37 years';
   proc gplot data=result;
      plot       incidences * year=1
               p_incidences * year=2
            lclm_incidences * year=3
            uclm_incidences * year=3
                      lower * year=4
                      upper * year=4
            /overlay      legend=legend1 
             vaxis=axis1  haxis=axis2
             frame        cframe=ligr;
   run;

Output 64.5.4 displays the plot of the variable incidences, the predicted values, and the Bayesian and bootstrap confidence intervals.

The plot shows that the bootstrap confidence interval is similar to the Bayesian confidence interval. However, the Bayesian confidence interval is symmetric around the estimates, while the bootstrap confidence interval is not.

Output 64.5.4: Comparison of Bayesian and Bootstrap Confidence Interval for Melanoma Data

Chapter Contents
Previous
Next
Top