Chapter Contents |
Previous |
Next |
The TPSPLINE Procedure |
Numerous epidemiological observations have indicated that exposure to solar radiation is an important factor in the etiology of melanoma. The following data present age-adjusted melanoma incidences for 37 years from the Connecticut Tumor Registry (Houghton, Flannery, and Viola 1980). The data are analyzed by Ramsay and Silverman (1997).
data melanoma; input year incidences @@; datalines; 1936 0.9 1937 0.8 1938 0.8 1939 1.3 1940 1.4 1941 1.2 1942 1.7 1943 1.8 1944 1.6 1945 1.5 1946 1.5 1947 2.0 1948 2.5 1949 2.7 1950 2.9 1951 2.5 1952 3.1 1953 2.4 1954 2.2 1955 2.9 1956 2.5 1957 2.6 1958 3.2 1959 3.8 1960 4.2 1961 3.9 1962 3.7 1963 3.3 1964 3.7 1965 3.9 1966 4.1 1967 3.8 1968 4.7 1969 4.4 1970 4.8 1971 4.8 1972 4.8 ; run;
The variable incidences records the number of melanoma cases per 100,000 people for the years 1936 to 1972. The following model fits the data and requests a 90% Bayesian confidence interval along with the estimate.
proc tpspline data=melanoma; model incidences = (year) /alpha = 0.1; output out = result pred uclm lclm; run;
The output is displayed in Output 64.5.1
Output 64.5.1: Output from PROC TPSPLINE for the Melanoma Data Setsymbol1 h=1pct ; symbol2 i=join v=none; symbol3 i=join v=none; symbol4 i=join v=none c=cyan; legend1 frame cframe=ligr cborder=black label=none position=center; axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; title1 'Age-adjusted Melanoma Incidences for 37 years'; proc gplot data=result; plot incidences*year=1 p_incidences*year=2 lclm_incidences*year=3 uclm_incidences*year=4 /overlay legend=legend1 vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
The estimated curve is displayed with 90% confidence interval bands in Output 64.5.2. The number of melanoma incidences exhibits a periodic pattern and increases over the years. The periodic pattern is related to sunspot activity and the accompanying fluctuations in solar radiation.
Output 64.5.2: TPSPLINE Estimate and 90% Confidence Interval of Melanoma DataSuppose that and are the estimates of f and from the data. Assume that is the "true" f, and generate the bootstrap sample as follows:
where . Denote as the random variable of the bootstrap estimate at xi. Repeat this process K times, so that at each point xi, you have K bootstrap estimates or K realizations of . For each fixed xi, consider the following statistic Di*, which is similar to a Student's t statistic:
where is the estimate of based on the ith bootstrap sample.
Suppose and are the lower and upper points of the empirical distribution of Di*. The 100% bootstrap confidence interval is defined as
Bootstrap confidence intervals are easy to interpret and can be used with any distribution. However, because they require K model fits, their construction is computationally intensive.
The multiple dependent variables feature in PROC TPSPLINE enables you to fit multiple models with the same independent variables. The procedure calculates the matrix decomposition part of the calculations only once regardless of the number of dependent variables in the model. These calculations are responsible for most of the computing time used by the TPSPLINE procedure. This feature is particularly useful when you need to generate a bootstrap confidence interval. To construct a bootstrap confidence interval, perform the following tasks:
The following statements illustrate this process:
proc tpspline data=melanoma; model incidences = (year) /alpha = 0.05; output out = result pred uclm lclm; run;
The output from the initial PROC TPSPLINE analysis is displayed in Output 64.5.3. The data set result contains the predicted values and confidence limits from the analysis.
Output 64.5.3: Output from PROC TPSPLINE for the Melanoma Data Set
|
data bootstrap; set result; array y{1070} y1-y1070; do i=1 to 1070; y{i} = p_incidences + 0.232823*rannor(123456789); end; keep y1-y1070 p_incidences year; run; ods listing close; proc tpspline data=bootstrap; ods output FitStatistics=FitResult; id p_incidences; model y1-y1070 = (year); output out=result2; run; ods listing;
The DATA step generates 1,070 bootstrap samples based on the previous estimate from PROC TPSPLINE. For this data set, some of the bootstrap samples result in s (selected by the GCV function) that cause problematic behavior. Thus, an additional 70 bootstrap samples are generated.
The ODS listing destination is closed before PROC TPSPLINE is invoked. The model fits all the y1 -y1070 variables as dependent variables, and the models are fit for all bootstrap samples simultaneously. The output data set result2 contains the variables year, y1 -y1070, p_y1 -p_y1070, and P_incidences.
The ODS OUTPUT statement writes the FitStatistics table to the data set FitResult. The data set FitResult contains the two variables. They are Parameter and Value. The FitResult data set is used in subsequent calculations for Di*.
In the data set FitResult, there are 63 estimates with a standard deviation of zero, suggesting that the estimates provide perfect fits of the data and are caused by s that are approximately equal to zero. For small sample sizes, there is a positive probability that the chosen by the GCV function will be zero (refer to Wang and Wahba 1995).
In the following steps, these cases are removed from the bootstrap samples as "bad" samples: they represent failure of the GCV function.
The following SAS statements manipulate the data set FitResult, retaining the standard deviations for all bootstrap samples and merging FitResult with the data set result2, which contains the estimates for bootstrap samples. In the final data set boot, the D*i statistics are calculated.
data FitResult; set FitResult; if Parameter="Standard Deviation"; keep Value; run; proc transpose data=sum2 out=sd prefix=sd; data result2; if _N_ = 1 then set sd; set result2; data boot; set result2; array y{1070} p_y1-p_y1070; array sd{1070} sd1-sd1070; do i=1 to 1070; if sd{i} > 0 then do; d = (y{i} - P_incidences)/sd{i}; obs = _N_; output; end; end; keep d obs P_incidences year; run;
The following SAS statements retain the first 1000 bootstrap samples and calculate the values and with .
proc sort data=boot; by obs; run; data boot; set boot; by obs; retain n; if first.obs then n=1; else n=n+1; if n > 1000 then delete; run; proc sort data=boot; by obs error; run; data chi1 chi2 ; set boot; if (_N_ = (obs-1)*1000+50) then output chi1; if (_N_ = (obs-1)*1000+950) then output chi2; run; proc sort data=result; by year; run; proc sort data=chi1; by year; run; proc sort data=chi2; by year; run; data result; merge result chi1(rename=(d=chi05)) chi2(rename=(d=chi95)); keep year incidences P_incidences lower upper LCLM_incidences UCLM_incidences; lower = -chi95*0.232823 + P_incidences; upper = -chi05*0.232823 + P_incidences; label lower="Lower 95% CL (Bootstrap)" upper="Upper 95% CL (Bootstrap)" lclm_incidences="Lower 95% CL (Bayesian)" uclm_incidences="Upper 95% CL (Bayesian)"; run;
The data set result contains the variables year, incidences, the TPSPLINE estimate P_incidences, and the 90% Bayesian and 90% bootstrap confidence intervals.
The following statements produce Output 64.5.4:
symbol1 v=dot h=1pct ; symbol2 i=join v=none l=1; symbol3 i=join v=none l=33; symbol4 i=join v=none l=33; symbol5 i=join v=none l=43 c=green; symbol6 i=join v=none l=43 c=green; title1 'Age-adjusted Melanoma Incidences for 37 years'; proc gplot data=result; plot incidences * year=1 p_incidences * year=2 lclm_incidences * year=3 uclm_incidences * year=3 lower * year=4 upper * year=4 /overlay legend=legend1 vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
Output 64.5.4 displays the plot of the variable incidences, the predicted values, and the Bayesian and bootstrap confidence intervals.
The plot shows that the bootstrap confidence interval is similar to the Bayesian confidence interval. However, the Bayesian confidence interval is symmetric around the estimates, while the bootstrap confidence interval is not.
Output 64.5.4: Comparison of Bayesian and Bootstrap Confidence Interval for Melanoma Data
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.