Chapter Contents |
Previous |
Next |
The PLS Procedure |
The following example, from Umetrics (1995), demonstrates different ways to examine a PLS model. The data come from the field of drug discovery. New drugs are developed from chemicals that are biologically active. Testing a compound for biological activity is an expensive procedure, so it is useful to be able to predict biological activity from cheaper chemical measurements. In fact, computational chemistry makes it possible to calculate certain chemical measurements without even making the compound. These measurements include size, lipophilicity, and polarity at various sites on the molecule. The following statements create a data set named penta, which contains these data.
data penta; input obsnam $ S1 L1 P1 S2 L2 P2 S3 L3 P3 S4 L4 P4 S5 L5 P5 log_RAI @@; n = _n_; datalines; VESSK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 1.9607 -1.6324 0.5746 1.9607 -1.6324 0.5746 2.8369 1.4092 -3.1398 0.00 VESAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 1.9607 -1.6324 0.5746 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 0.28 VEASK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 0.0744 -1.7333 0.0902 1.9607 -1.6324 0.5746 2.8369 1.4092 -3.1398 0.20 VEAAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 0.51 VKAAK -2.6931 -2.5271 -1.2871 2.8369 1.4092 -3.1398 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 0.11 VEWAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 2.73 VEAAP -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 -1.2201 0.8829 2.2253 0.18 VEHAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 2.4064 1.7438 1.1057 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 1.53 VAAAK -2.6931 -2.5271 -1.2871 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 -0.10 GEAAK 2.2261 -5.3648 0.3049 3.0777 0.3891 -0.0701 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 -0.52 LEAAK -4.1921 -1.0285 -0.9801 3.0777 0.3891 -0.0701 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 0.40 FEAAK -4.9217 1.2977 0.4473 3.0777 0.3891 -0.0701 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 0.30 VEGGK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 2.2261 -5.3648 0.3049 2.2261 -5.3648 0.3049 2.8369 1.4092 -3.1398 -1.00 VEFAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 -4.9217 1.2977 0.4473 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 1.57 VELAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 -4.1921 -1.0285 -0.9801 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 0.59 AAAAA 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 -0.10 AAYAA 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 -1.3944 2.3230 0.0139 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 0.46 AAWAA 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 0.75 VAWAA -2.6931 -2.5271 -1.2871 0.0744 -1.7333 0.0902 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 1.43 VAWAK -2.6931 -2.5271 -1.2871 0.0744 -1.7333 0.0902 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 1.45 VKWAA -2.6931 -2.5271 -1.2871 2.8369 1.4092 -3.1398 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 1.71 VWAAK -2.6931 -2.5271 -1.2871 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 2.8369 1.4092 -3.1398 0.04 VAAWK -2.6931 -2.5271 -1.2871 0.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902 -4.7548 3.6521 0.8524 2.8369 1.4092 -3.1398 0.23 EKWAP 3.0777 0.3891 -0.0701 2.8369 1.4092 -3.1398 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 -1.2201 0.8829 2.2253 1.30 VKWAP -2.6931 -2.5271 -1.2871 2.8369 1.4092 -3.1398 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 -1.2201 0.8829 2.2253 2.35 RKWAP 2.8827 2.5215 -3.4435 2.8369 1.4092 -3.1398 -4.7548 3.6521 0.8524 0.0744 -1.7333 0.0902 -1.2201 0.8829 2.2253 1.98 VEWVK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701 -4.7548 3.6521 0.8524 -2.6931 -2.5271 -1.2871 2.8369 1.4092 -3.1398 1.71 PGFSP -1.2201 0.8829 2.2253 2.2261 -5.3648 0.3049 -4.9217 1.2977 0.4473 1.9607 -1.6324 0.5746 -1.2201 0.8829 2.2253 0.90 FSPFR -4.9217 1.2977 0.4473 1.9607 -1.6324 0.5746 -1.2201 0.8829 2.2253 -4.9217 1.2977 0.4473 2.8827 2.5215 -3.4435 0.64 RYLPT 2.8827 2.5215 -3.4435 -1.3944 2.3230 0.0139 -4.1921 -1.0285 -0.9801 -1.2201 0.8829 2.2253 0.9243 -2.0921 -1.3996 0.40 GGGGG 2.2261 -5.3648 0.3049 2.2261 -5.3648 0.3049 2.2261 -5.3648 0.3049 2.2261 -5.3648 0.3049 2.2261 -5.3648 0.3049 . ; data ptrain; set penta; if (n <= 15); data ptest ; set penta; if (n > 15); run;
You would like to study the relationship between these measurements and the activity of the compound, represented by the logarithm of the relative Bradykinin activating activity (log_RAI). Notice that these data consist of many predictors relative to the number of observations. Partial least squares is especially appropriate in this situation as a useful tool for finding a few underlying predictive factors that account for most of the variation in the response. Typically, the model is fit for part of the data (the "training" or "work" set), and the quality of the fit is judged by how well it predicts the other part of the data (the "test" or "prediction" set). For this example, the first 15 observations serve as the training set and the rest constitute the test set (refer to Ufkes et al. 1978, 1982).
When you fit a PLS model, you hope to find a few PLS factors that explain most of the variation in both predictors and responses. Factors that explain response variation provide good predictive models for new responses, and factors that explain predictor variation are well represented by the observed values of the predictors. The following statements fit a PLS model with two factors and save predicted values, residuals, and other information for each data point in a data set named outpls.
proc pls data=ptrain nfac=2; model log_RAI = S1-S5 L1-L5 P1-P5; output out=outpls predicted = yhat1 yresidual = yres1 xresidual = xres1-xres15 xscore = xscr yscore = yscr; run;
The PLS procedure displays a table, shown in Output 51.1.1, showing how much predictor and response variation is explained by each PLS factor.
Output 51.1.1: Amount of Training Set Variation ExplainedPartial least squares algorithms choose successive orthogonal factors that maximize the covariance between each X-score and the corresponding Y-score. For a good PLS model, the first few factors show a high correlation between the X- and Y-scores. The correlation usually decreases from one factor to the next. You can plot the X-scores versus the Y-scores for the first PLS factor using the following SAS statements.
%let ifac = 1; data pltanno; set outpls; length text $ 2; retain function 'label' position '5' hsys '3' xsys '2' ysys '2' color 'blue' style 'swissb'; text=%str(n); x=xscr&ifac; y=yscr&ifac; axis1 label=(angle=270 rotate=90 "Y score &ifac") major=(number=5) minor=none; axis2 label=("X-score &ifac") minor=none; symbol1 v=none i=none; proc gplot data=outpls; plot yscr&ifac*xscr&ifac=1 / anno=pltanno vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
By changing the macro variable ifac to 2 instead of 1, you can use the same statements to plot the X-scores versus the Y-scores for the second PLS factor. The resulting plots are shown in Output 51.1.2 and Output 51.1.3. The numbers on the plot represent the observation number in the penta data set.
Output 51.1.2: First X- and Y-scores for Penta-Peptide Model 1Output 51.1.3: Second X- and Y-scores for Penta-Peptide Model 1
You can also plot the X-scores against each other to look for irregularities in the data. You should look for patterns or clearly grouped observations. If you see a curved pattern, for example, you may want to add a quadratic term. Two or more groupings of observations indicate that it might be better to analyze the groups separately, perhaps by including classification effects in the model. The following SAS statements produce a plot of the first and second X-scores:
data pltanno; set outpls; length text $ 2; retain function 'label' position '5' hsys '3' xsys '2' ysys '2' color 'blue' style 'swissb'; text=%str(n); x=xscr1; y=xscr2; axis1 label=(angle=270 rotate=90 "X score 2") major=(number=5) minor=none; axis2 label=("X-score 1") minor=none; symbol1 v=none i=none; proc gplot data=outpls; plot xscr2*xscr1=1 / anno=pltanno vaxis=axis1 haxis=axis2 frame cframe=ligr; run;
The plot is shown in Output 51.1.4.
Output 51.1.4: First and Second X-scores for Penta-Peptide Model 1Plots of the weights give the directions toward which each PLS factor projects. They show which predictors are most represented in each factor. Those predictors with small weights in absolute value are less important than those with large weights.
You can use the DETAILS option in the PROC PLS statement to display various model details, including the X-weights. You can then use the ODS statement to send the weights to an output data set, as follows:
ods output XWeights=xweights; proc pls data=ptrain nfac=2 details; model log_RAI = S1-S5 L1-L5 P1-P5; run;
Once the X-weights are in a data set, you can use the following statements to plot the weights for the first two PLS factors against one another:
proc transpose data=xweights(drop=NumberOfFactors InnerRegCoef) out =xweights; data xweights; set xweights; rename col1=w1 col2=w2; data wt_anno; set xweights; length text $ 2; retain function 'label' position '5' hsys '3' xsys '2' ysys '2' color 'blue' style 'swissb'; text=%str(_name_); x=w1; y=w2; run; axis1 label=(angle=270 rotate=90 "X weight 2") major=(number=5) minor=none; axis2 label=("X-weight 1") minor=none; symbol1 v=none i=none; proc gplot data=xweights; plot w2*w1=1 / anno=wt_anno vaxis=axis1 haxis=axis2 frame cframe=ligr; run; quit;
The plot of the X-weights is shown in Output 51.1.5.
Output 51.1.5: First and Second X-weights for Penta-Peptide Model 1To explore further which predictors can be eliminated from the analysis, you can look at the regression coefficients for the standardized data. Predictors with small coefficients (in absolute value) make a small contribution to the response prediction. Another statistic summarizing the contribution a variable makes to the model is the Variable Importance for Projection (VIP) of Wold (1994). Whereas the regression coefficients represent the importance each predictor has in the prediction of just the response, the VIP represents the value of each predictor in fitting the PLS model for both predictors and response. If a predictor has a relatively small coefficient (in absolute value) and a small value of VIP, then it is a prime candidate for deletion. Wold (1994) considers a value less than 0.8 to be "small" for the VIP. The following statements produce coefficients and the VIP:
/* / Put coefficients, weights, and R**2's into data sets. /-------------------------------------------------------*/ ods listing close; ods output PercentVariation = pctvar XWeights = xweights CenScaleParms = solution; proc pls data=ptrain nfac=2 details; model log_RAI = S1 L1 P1 S2 L2 P2 S3 L3 P3 S4 L4 P4 S5 L5 P5 / solution; run; ods listing; /* / Just reformat the coefficients. /-------------------------------------------------------*/ data solution; set solution; format log_RAI 8.5; if (RowName = 'Intercept') then delete; rename RowName = Predictor log_RAI = B; run; /* / Transpose weights and R**2's. /-------------------------------------------------------*/ data xweights; set xweights; _name_='W'||trim(left(_n_)); data pctvar ; set pctvar ; _name_='R'||trim(left(_n_)); proc transpose data=xweights(drop=NumberOfFactors InnerRegCoef) out =xweights; proc transpose data=pctvar(keep=_name_ CurrentYVariation) out =pctvar; run; /* / Sum the squared weights times the normalized R**2's. / The VIP is defined as the square root of this / weighted average times the number of predictors. /-------------------------------------------------------*/
proc sql; create table vip as select * from xweights left join pctvar(drop=_name_) on 1; data vip; set vip; keep _name_ vip; array w{2}; array r{2}; VIP = 0; do i = 1 to 2; VIP = VIP + r{i}*(w{i}**2)/sum(of r1-r2); end; VIP = sqrt(VIP * 15); data vipbpls; merge solution vip(drop=_name_); proc print data=vipbpls; run;
The output appears in Output 51.1.6.
Output 51.1.6: Estimated PLS Regression Coefficients and VIP (Model 1)
|
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.