Polynomial Regression

The REG Procedure

Polynomial Regression

Consider a response variable Y that can be predicted by a polynomial function of a regressor variable X. You can estimate $\beta_0$ , the intercept, $\beta_1$ , the slope due to X, and $\beta_2$ , the slope due to X², in

$Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \epsilon_i$

for the observations i = 1,2, ... ,n.

Consider the following example on population growth trends. The population of the United States from 1790 to 1970 is fit to linear and quadratic functions of time. Note that the quadratic term, YearSq, is created in the DATA step; this is done since polynomial effects such as Year*Year cannot be specified in the MODEL statement in PROC REG. The data are as follows:

   data USPopulation;
      input Population @@;
      retain Year 1780;
      Year=Year+10;
      YearSq=Year*Year;
      Population=Population/1000;
      datalines;
   3929 5308 7239 9638 12866 17069 23191 31443 39818 50155
   62947 75994 91972 105710 122775 131669 151325 179323 203211
   ;

The following statements begin the analysis. (Influence diagnostics and autocorrelation information for the full model are shown in Figure 55.42 and Figure 55.55.)

   symbol1 c=blue;
   proc reg data=USPopulation;
      var YearSq;
      model Population=Year / r cli clm;
      plot r.*p. / cframe=ligr;
   run;

The DATA option ensures that the procedure uses the intended data set. Any variable that you might add to the model but that is not included in the first MODEL statement must appear in the VAR statement. In the MODEL statement, three options are specified: R requests a residual analysis to be performed, CLI requests 95% confidence limits for an individual value, and CLM requests these limits for the expected value of the dependent variable. You can request specific $100(1-\alpha)$ % limits with the ALPHA= option in the PROC REG or MODEL statement. A plot of the residuals against the predicted values is requested by the PLOT statement.

The ANOVA table is displayed in Figure 55.4.

The REG Procedure

Model: MODEL1

Dependent Variable: Population

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	66336	66336	201.87	<.0001
Error	17	5586.29253	328.60544
Corrected Total	18	71923

Root MSE	18.12748	R-Square	0.9223
Dependent Mean	69.76747	Adj R-Sq	0.9178
Coeff Var	25.98271

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	-1958.36630	142.80455	-13.71	<.0001
Year	1	1.07879	0.07593	14.21	<.0001

Figure 55.4: ANOVA Table and Parameter Estimates

The Model F statistic is significant (F=201.873, p<0.0001), indicating that the model accounts for a significant portion of variation in the data. The R-Square indicates that the model accounts for 92% of the variation in population growth. The fitted equation for this model is

Population = -1958.37 + 1.08 × Year

Figure 55.5 shows the confidence limits for both individual and expected values resulting from the CLM and CLI options.

The REG Procedure

Model: MODEL1

Dependent Variable: Population

Output Statistics
Obs	Dep Var Population	Predicted Value	Std Error Mean Predict	95% CL Mean		95% CL Predict		Residual	Std Error Residual	Student Residual	-2-1 0 1 2	Cook's D
1	3.9290	-27.3240	7.9995	-44.2015	-10.4466	-69.1281	14.4800	31.2530	16.267	1.921	\| \|*** \|	0.446
2	5.3080	-16.5361	7.3615	-32.0674	-1.0048	-57.8150	24.7428	21.8441	16.565	1.319	\| \|** \|	0.172
3	7.2390	-5.7481	6.7486	-19.9864	8.4901	-46.5582	35.0619	12.9871	16.824	0.772	\| \|* \|	0.048
4	9.6380	5.0398	6.1684	-7.9744	18.0540	-35.3594	45.4390	4.5982	17.046	0.270	\| \| \|	0.005
5	12.8660	15.8277	5.6309	3.9475	27.7080	-24.2206	55.8761	-2.9617	17.231	-0.172	\| \| \|	0.002
6	17.0690	26.6157	5.1497	15.7509	37.4805	-13.1432	66.3746	-9.5467	17.381	-0.549	\| *\| \|	0.013
7	23.1910	37.4036	4.7417	27.3996	47.4077	-2.1288	76.9360	-14.2126	17.496	-0.812	\| *\| \|	0.024
8	31.4430	48.1916	4.4273	38.8508	57.5324	8.8218	87.5614	-16.7486	17.579	-0.953	\| *\| \|	0.029
9	39.8180	58.9795	4.2275	50.0603	67.8987	19.7076	98.2514	-19.1615	17.628	-1.087	\| **\| \|	0.034
10	50.1550	69.7675	4.1587	60.9933	78.5416	30.5283	109.0067	-19.6125	17.644	-1.112	\| **\| \|	0.034
11	62.9470	80.5554	4.2275	71.6362	89.4746	41.2835	119.8273	-17.6084	17.628	-0.999	\| *\| \|	0.029
12	75.9940	91.3434	4.4273	82.0026	100.6842	51.9736	130.7131	-15.3494	17.579	-0.873	\| *\| \|	0.024
13	91.9720	102.1313	4.7417	92.1272	112.1354	62.5989	141.6637	-10.1593	17.496	-0.581	\| *\| \|	0.012
14	105.7100	112.9193	5.1497	102.0544	123.7841	73.1603	152.6782	-7.2093	17.381	-0.415	\| \| \|	0.008
15	122.7750	123.7072	5.6309	111.8269	135.5875	83.6589	163.7555	-0.9322	17.231	-0.0541	\| \| \|	0.000
16	131.6690	134.4951	6.1684	121.4810	147.5093	94.0959	174.8944	-2.8261	17.046	-0.166	\| \| \|	0.002
17	151.3250	145.2831	6.7486	131.0448	159.5214	104.4731	186.0931	6.0419	16.824	0.359	\| \| \|	0.010
18	179.3230	156.0710	7.3615	140.5397	171.6024	114.7921	197.3500	23.2520	16.565	1.404	\| \|** \|	0.195
19	203.2110	166.8590	7.9995	149.9816	183.7364	125.0550	208.6630	36.3520	16.267	2.235	\| \|**** \|	0.604

Sum of Residuals	0
Sum of Squared Residuals	5586.29253
Predicted Residual SS (PRESS)	7619.90354

Figure 55.5: Confidence Limits

The observed dependent variable is displayed for each observation along with its predicted value from the regression equation and the standard error of the mean predicted value. The 95% CL Mean columns are the confidence limits for the expected value of each observation. The 95% CL Predict columns are the confidence limits for the individual observations.

Figure 55.5 also displays the residual analysis requested by the R option.

The residual, its standard error, and the studentized residuals are displayed for each observation. The studentized residual is the residual divided by its standard error. The magnitude of each studentized residual is shown in a plot. Studentized residuals follow a t distribution and can be used to identify outlying or extreme observations. Asterisks (*) extending beyond the dashed lines indicate that the residual is more than three standard errors from zero. Many observations having absolute studentized residuals greater than 2 may indicate an inadequate model. The wave pattern seen in this plot is also an indication that the model is inadequate; a quadratic term may be needed or autocorrelation may be present in the data. Cook's D is a measure of the change in the predicted values upon deletion of that observation from the data set; hence, it measures the influence of the observation on the estimated regression coefficients. A fairly close agreement between the PRESS statistic (see Table 55.5) and the Sum of Squared Residuals indicates that the MSE is a reasonable measure of the predictive accuracy of the fitted model (Neter, Wasserman, and Kutner, 1990).

A plot of the residuals versus predicted values is shown in Figure 55.6.

Figure 55.7: Plot of Residual vs. Predicted Values

The wave pattern of the studentized residual plot is seen here again. The semi-circle shape indicates an inadequate model; perhaps additional terms (such as the quadratic) are needed, or perhaps the data need to be transformed before analysis. If a model fits well, the plot of residuals against predicted values should exhibit no apparent trends.

Using the interactive feature of PROC REG, the following commands add the variable YearSq to the independent variables and refit the model.

   add YearSq;
   print;
   plot / cframe=ligr;
   run;

The ADD statement requests that YearSq be added to the model, and the PRINT command displays the ANOVA table for the new model. The PLOT statement with no variables recreates the most recent plot requested, in this case a plot of residual versus predicted values.

Figure 55.7 displays the ANOVA table and estimates for the new model.

The REG Procedure

Model: MODEL1.1

Dependent Variable: Population

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	71799	35900	4641.72	<.0001
Error	16	123.74557	7.73410
Corrected Total	18	71923

Root MSE	2.78102	R-Square	0.9983
Dependent Mean	69.76747	Adj R-Sq	0.9981
Coeff Var	3.98613

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	20450	843.47533	24.25	<.0001
Year	1	-22.78061	0.89785	-25.37	<.0001
YearSq	1	0.00635	0.00023877	26.58	<.0001

Figure 55.8: ANOVA Table and Parameter Estimates

The overall F statistic is still significant (F=4641.719, p<0.0001). The R-square has increased from 0.9223 to 0.9983, indicating that the model now accounts for 99.8% of the variation in Population. All effects are significant with p<0.0001 for each effect in the model.

The fitted equation is now

Population = 20450 - 22.781 × Year + 0.006 × Yearsq

The confidence limits and residual analysis for the second model are displayed in Figure 55.8.

The REG Procedure

Model: MODEL1.1

Dependent Variable: Population

Output Statistics
Obs	Dep Var Population	Predicted Value	Std Error Mean Predict	95% CL Mean		95% CL Predict		Residual	Std Error Residual	Student Residual	-2-1 0 1 2	Cook's D
1	3.9290	5.0384	1.7289	1.3734	8.7035	-1.9034	11.9803	-1.1094	2.178	-0.509	\| *\| \|	0.054
2	5.3080	5.0389	1.3909	2.0904	7.9874	-1.5528	11.6306	0.2691	2.408	0.112	\| \| \|	0.001
3	7.2390	6.3085	1.1304	3.9122	8.7047	-0.0554	12.6724	0.9305	2.541	0.366	\| \| \|	0.009
4	9.6380	8.8472	0.9571	6.8182	10.8761	2.6123	15.0820	0.7908	2.611	0.303	\| \| \|	0.004
5	12.8660	12.6550	0.8721	10.8062	14.5037	6.4764	18.8335	0.2110	2.641	0.0799	\| \| \|	0.000
6	17.0690	17.7319	0.8578	15.9133	19.5504	11.5623	23.9015	-0.6629	2.645	-0.251	\| \| \|	0.002
7	23.1910	24.0779	0.8835	22.2049	25.9509	17.8920	30.2638	-0.8869	2.637	-0.336	\| \| \|	0.004
8	31.4430	31.6931	0.9202	29.7424	33.6437	25.4832	37.9029	-0.2501	2.624	-0.0953	\| \| \|	0.000
9	39.8180	40.5773	0.9487	38.5661	42.5885	34.3482	46.8065	-0.7593	2.614	-0.290	\| \| \|	0.004
10	50.1550	50.7307	0.9592	48.6972	52.7642	44.4944	56.9671	-0.5757	2.610	-0.221	\| \| \|	0.002
11	62.9470	62.1532	0.9487	60.1420	64.1644	55.9241	68.3823	0.7938	2.614	0.304	\| \| \|	0.004
12	75.9940	74.8448	0.9202	72.8942	76.7955	68.6350	81.0547	1.1492	2.624	0.438	\| \| \|	0.008
13	91.9720	88.8056	0.8835	86.9326	90.6785	82.6197	94.9915	3.1664	2.637	1.201	\| \|** \|	0.054
14	105.7100	104.0354	0.8578	102.2169	105.8540	97.8658	110.2051	1.6746	2.645	0.633	\| \|* \|	0.014
15	122.7750	120.5344	0.8721	118.6857	122.3831	114.3558	126.7130	2.2406	2.641	0.848	\| \|* \|	0.026
16	131.6690	138.3025	0.9571	136.2735	140.3315	132.0676	144.5374	-6.6335	2.611	-2.540	\| *****\| \|	0.289
17	151.3250	157.3397	1.1304	154.9434	159.7360	150.9758	163.7036	-6.0147	2.541	-2.367	\| ****\| \|	0.370
18	179.3230	177.6460	1.3909	174.6975	180.5945	171.0543	184.2377	1.6770	2.408	0.696	\| \|* \|	0.054
19	203.2110	199.2215	1.7289	195.5564	202.8865	192.2796	206.1633	3.9895	2.178	1.831	\| \|*** \|	0.704

Sum of Residuals	-5.8175E-11
Sum of Squared Residuals	123.74557
Predicted Residual SS (PRESS)	188.54924

Figure 55.9: Confidence Limits and Residual Analysis

The plot of the studentized residuals shows that the wave structure is gone. The PRESS statistic is much closer to the Sum of Squared Residuals now, and both statistics have been dramatically reduced. Most of the Cook's D statistics have also been reduced.

Figure 55.10: Plot of Residual vs. Predicted Values

The plot of residuals versus predicted values seen in Figure 55.9 has improved since a major trend is no longer visible.

To create a plot of the observed values, predicted values, and confidence limits against Year all on the same plot and to exert some control over the look of the resulting plot, you can submit the following statements.

   symbol1 v=dot     c=yellow h=.3;
   symbol2 v=square  c=red;
   symbol3 f=simplex c=blue  h=2 v='-';
   symbol4 f=simplex c=blue  h=2 v='-';
   plot (Population predicted. u95. l95.)*Year
        / overlay cframe=ligr;
   run;

Figure 55.11: Plot of Population vs Year with Confidence Limits

The SYMBOL statements requests that the actual data be displayed as dots, the predicted values as squares, and the upper and lower 95% confidence limits for an individual value (sometimes called a prediction interval) as dashes. PROC REG provides the short-hand commands CONF and PRED to request confidence and prediction intervals for simple regression models; see the "PLOT Statement" section for details.

To complete an analysis of these data, you may want to examine influence statistics and, since the data are essentially time series data, examine the Durbin-Watson statistic. You might also want to examine other residual plots, such as the residuals vs. regressors.

Chapter Contents
Previous
Next
Top