Chapter Contents |
Previous |
Next |
The MODECLUS Procedure |
This section illustrates how PROC MODECLUS can be used to examine the clusters of data in the following artificial data set.
data example; input x y @@; datalines; 18 18 20 22 21 20 12 23 17 12 23 25 25 20 16 27 20 13 28 22 80 20 75 19 77 23 81 26 55 21 64 24 72 26 70 35 75 30 78 42 18 52 27 57 41 61 48 64 59 72 69 72 80 80 31 53 51 69 72 81 ;
It is a good practice to plot the data to check for obvious clusters or pathologies prior to the analysis. The interactive graphics of the SAS/INSIGHT product are effective for visualizing clusters. In this example, with only two variables and a small sample size, the GPLOT procedure is adequate. The following statements produce Figure 42.1:
axis1 label=(angle=90 rotate=0) minor=none order=(0 to 80 by 20); axis2 minor=none; proc gplot; plot y*x /frame cframe=ligr vaxis=axis1 haxis=axis2; run;
The plot suggests three clusters. Of these clusters, the one in the lower left corner is the most compact, while the lower right cluster is more dispersed.
The upper cluster is elongated and would be difficult for most clustering algorithms to identify as a single cluster. The plot also suggests that a Euclidean distance of 10 or 20 is a good initial guess for the neighborhood size in density estimation and clustering.
To obtain a cluster analysis, you must specify the METHOD= option; for most purposes, METHOD=1 is recommended. The cluster analysis can be performed with a list of radii (R=10 15 35), as illustrated in the following PROC MODECLUS step. An output data set containing the cluster membership is created with the OUT= option and then used by PROC GPLOT to display the membership. The following statements produce Figure 42.2 through Figure 42.5:
proc modeclus data=example method=1 r=10 15 35 out=out; run;
For each cluster solution, PROC MODECLUS produces a table of cluster statistics including the cluster number, the number of observations in the cluster, the maximum estimated density within the cluster, the number of observations in the cluster having a neighbor that belongs to a different cluster, and the estimated saddle density of the cluster. The results are displayed in Figure 42.2, Figure 42.3, and Figure 42.4 for three different radii. A smaller radius (R=10) yields a larger number of clusters (6), as displayed in Figure 42.1; a larger radius (R=35) includes all observations in a single cluster, as displayed in Figure 42.5. Note that all clusters in these three figures are "isolated" since their corresponding boundary frequencies are all 0s. Therefore, all the estimated saddle densities are missing.
|
|
A table summarizing each cluster solution is then produced, as displayed in Figure 42.5.
|
The OUT= data set contains a complete copy of the input data set for each cluster solution. Using a BY statement in the following PROC GPLOT step, you can examine the differences in cluster memberships for each radius. The following statements produce Figure 42.6 through Figure 42.8:
symbol1 v='1' font=swiss c=white; symbol2 v='2' font=swiss c=yellow; symbol3 v='3' font=swiss c=cyan; symbol4 v='4' font=swiss c=green; symbol5 v='5' font=swiss c=orange;symbol6 v='6' font=swiss c=blue; symbol7 v='7' font=swiss c=black; proc gplot data=out; plot y*x=cluster /frame cframe=ligr nolegend vaxis=axis1 haxis=axis2; by _r_; run;
Chapter Contents |
Previous |
Next |
Top |
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.