Clustering Methods

The CLUSTER Procedure

Clustering Methods

The following notation is used, with lowercase symbols generally pertaining to observations and uppercase symbols pertaining to clusters:

n: number of observations
v: number of variables if data are coordinates
G: number of clusters at any given level of the hierarchy
x_i or x_i: ith observation (row vector if coordinate data)
C_K: Kth cluster, subset of {1, 2, ... , n }
N_K: number of observations in C_K
${{\bar{x}}}$: sample mean vector
${{\bar{x}}}_K$: mean vector for cluster C_K
|x|: Euclidean length of the vector x, that is, the square root of the sum of the squares of the elements of x
T: $\sum_{i=1}^n \Vert x_i - {{\bar{x}}}\Vert^2$
W_K: $\sum_{i \in C_k} \Vert x_i - {{\bar{x}}}_K \Vert^2$
P_G: $\sum W_J$ , where summation is over the G clusters at the Gth level of the hierarchy
B_KL: W_M - W_K - W_L if $C_M=C_K \cup C_L$
d(x,y): any distance or dissimilarity measure between observations or vectors x and y
D_KL: any distance or dissimilarity measure between clusters C_K and C_L

The distance between two clusters can be defined either directly or combinatorially (Lance and Williams 1967), that is, by an equation for updating a distance matrix when two clusters are joined. In all of the following combinatorial formulas, it is assumed that clusters C_K and C_L are merged to form C_M, and the formula gives the distance between the new cluster C_M and any other cluster C_J.

For an introduction to most of the methods used in the CLUSTER procedure, refer to Massart and Kaufman (1983).

Average Linkage

The following method is obtained by specifying METHOD=AVERAGE. The distance between two clusters is defined by

$D_{KL} = \frac{1}{N_K N_L} \sum_{i \in C_K} \sum_{j \in C_L} d(x_i , x_j)$

If d(x , y) = | x - y |², then

$D_{KL} = \Vert {{\bar{x}}}_K - {{\bar{x}}}_L \Vert^2 + \frac{W_K}{N_K} + \frac{W_L}{N_L}$

The combinatorial formula is

D_JM = [(N_K D_JK + N_L D_JL)/(N_M)]

In average linkage the distance between two clusters is the average distance between pairs of observations, one in each cluster. Average linkage tends to join clusters with small variances, and it is slightly biased toward producing clusters with the same variance.

Average linkage was originated by Sokal and Michener (1958).

Centroid Method

The following method is obtained by specifying METHOD=CENTROID. The distance between two clusters is defined by

$D_{KL} = \Vert {{\bar{x}}}_K - {{\bar{x}}}_L \Vert^2$

If d(x , y) = | x - y |², then the combinatorial formula is

D_JM = [(N_K D_JK + N_L D_JL)/(N_M)] - [(N_K N_L D_KL)/(N_M²)]

In the centroid method, the distance between two clusters is defined as the (squared) Euclidean distance between their centroids or means. The centroid method is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward's method or average linkage (Milligan 1980).

The centroid method was originated by Sokal and Michener (1958).

Complete Linkage

The following method is obtained by specifying METHOD=COMPLETE. The distance between two clusters is defined by

$D_{KL} = \max_{i \in C_K} \max_{j \in C_L} d(x_i , x_j)$

The combinatorial formula is

D_JM = max(D_JK , D_JL)

In complete linkage, the distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. Complete linkage is strongly biased toward producing clusters with roughly equal diameters, and it can be severely distorted by moderate outliers (Milligan 1980).

Complete linkage was originated by Sorensen (1948).

Density Linkage

The phrase density linkage is used here to refer to a class of clustering methods using nonparametric probability density estimates (for example, Hartigan 1975, pp. 205 -212; Wong 1982; Wong and Lane 1983). Density linkage consists of two steps:

A new dissimilarity measure, d^*, based on density estimates and adjacencies is computed. If x_i and x_j are adjacent (the definition of adjacency depends on the method of density estimation), then d^*(x_i , x_j) is the reciprocal of an estimate of the density midway between x_i and x_j; otherwise, d^*(x_i , x_j) is infinite.
A single linkage cluster analysis is performed using d^*.

The CLUSTER procedure supports three types of density linkage: the kth-nearest-neighbor method, the uniform kernel method, and Wong's hybrid method. These are obtained by using METHOD=DENSITY and the K=, R=, and HYBRID options, respectively.

kth-Nearest Neighbor Method

The kth-nearest-neighbor method (Wong and Lane 1983) uses kth-nearest neighbor density estimates. Let r_k(x) be the distance from point x to the kth-nearest observation, where k is the value specified for the K= option. Consider a closed sphere centered at x with radius r_k(x). The estimated density at x, f(x), is the proportion of observations within the sphere divided by the volume of the sphere. The new dissimilarity measure is computed as

$d^*(x_i , x_j) = \{ \frac{1}2 ( \frac{1}{f(x_i)} + \frac{1}{f(x_j)} ) & & {if } d(x_i , x_j) \leq \max (r_k(x_i) , r_k(x_j)) \ \infty & & {otherwise}.$

Wong and Lane (1983) show that kth-nearest-neighbor density linkage is strongly set consistent for high-density (density-contour) clusters if k is chosen such that

and $k/\ln(n) arrow \infty$ as $n arrow \infty$ . Wong and Schaack (1982) discuss methods for estimating the number of population clusters using kth-nearest-neighbor clustering.

Uniform-Kernel Method

The uniform-kernel method uses uniform-kernel density estimates. Let r be the value specified for the R= option. Consider a closed sphere centered at point x with radius r. The estimated density at x, f(x), is the proportion of observations within the sphere divided by the volume of the sphere. The new dissimilarity measure is computed as

$d^*(x_i , x_j) = \{ \frac{1}2 ( \frac{1}{f(x_i)} + \frac{1}{f(x_j)} ) & & {if } d(x_i , x_j) \leq r \ \infty & & {otherwise} \.$

Wong's Hybrid Method

Wong's (1982) hybrid clustering method uses density estimates based on a preliminary cluster analysis by the k-means method. The preliminary clustering can be done by the FASTCLUS procedure, using the MEAN= option to create a data set containing cluster means, frequencies, and root-mean-square standard deviations. This data set is used as input to the CLUSTER procedure, and the HYBRID option is specified with METHOD=DENSITY to request the hybrid analysis. The hybrid method is appropriate for very large data sets but should not be used with small data sets, say fewer than 100 observations in the original data. The term preliminary cluster refers to an observation in the DATA= data set.

For preliminary cluster C_K, N_K and W_K are obtained from the input data set, as are the cluster means or the distances between the cluster means. Preliminary clusters C_K and C_L are considered adjacent if the midpoint between ${\bar{x}}_K$ and ${{\bar{x}}}_L$ is closer to either ${{\bar{x}}}_K$ or ${{\bar{x}}}_L$ than to any other preliminary cluster mean or, equivalently, if $d^2 ({{\bar{x}}}_K , {{\bar{x}}}_L ) \lt d^2 ({{\bar{x}}}_K , {{\bar{x}}}_M ) + d^2 ({{\bar{x}}}_L , {{\bar{x}}}_M )$ for all other preliminary clusters C_M, $M \ne K$ or L. The new dissimilarity measure is computed as

$d^* ({{\bar{x}}}_K , {{\bar{x}}}_L ) = \{ \frac{(W_K + W_L + \frac{1}4(N_K + N_L... ...}} & & {if C_K\space and C_L\space are adjacent} \ \infty & & {otherwise} \ .$

Using the K= and R= Options

The values of the K= and R= options are called smoothing parameters. Small values of K= or R= produce jagged density estimates and, as a consequence, many modes. Large values of K= or R= produce smoother density estimates and fewer modes. In the hybrid method, the smoothing parameter is the number of clusters in the preliminary cluster analysis. The number of modes in the final analysis tends to increase as the number of clusters in the preliminary analysis increases. Wong (1982) suggests using n^0.3 preliminary clusters, where n is the number of observations in the original data set. There is no general rule-of-thumb for selecting K= values. For all types of density linkage, you should repeat the analysis with several different values of the smoothing parameter (Wong and Schaack 1982).

There is no simple answer to the question of which smoothing parameter to use (Silverman 1986, pp. 43 -61, 84 -88, and 98 -99). It is usually necessary to try several different smoothing parameters. A reasonable first guess for the R= option in many coordinate data sets is given by

$[ \frac{2^{v+2} (v + 2) \Gamma (\frac{v}2 + 1)} {nv^2} ]^{\frac{1}{v+4}} \sqrt{\sum_{l=1}^v s_l^2}$

where s²_l is the standard deviation of the lth variable. The estimate for R= can be computed in a DATA step using the GAMMA function for $\Gamma$ .This formula is derived under the assumption that the data are sampled from a multivariate normal distribution and tends, therefore, to be too large (oversmooth) if the true distribution is multimodal. Robust estimates of the standard deviations may be preferable if there are outliers. If the data are distances, the factor $\sum s^2_l$ can be replaced by an average (mean, trimmed mean, median, root-mean-square, and so on) distance divided by $\sqrt{2}$ .To prevent outliers from appearing as separate clusters, you can also specify K=2, or more generally K=m, $m \geq 2$ , which in most cases forces clusters to have at least m members.

If the variables all have unit variance (for example, if the STANDARD option is used), Table 23.1 can be used to obtain an initial guess for the R= option:

Table 23.1: Reasonable First Guess for the R= Option for Standardized Data

Number of	Number of Variables
Observations	1	2	3	4	5	6	7	8	9	10
20	1.01	1.36	1.77	2.23	2.73	3.25	3.81	4.38	4.98	5.60
35	0.91	1.24	1.64	2.08	2.56	3.08	3.62	4.18	4.77	5.38
50	0.84	1.17	1.56	1.99	2.46	2.97	3.50	4.06	4.64	5.24
75	0.78	1.09	1.47	1.89	2.35	2.85	3.38	3.93	4.50	5.09
100	0.73	1.04	1.41	1.82	2.28	2.77	3.29	3.83	4.40	4.99
150	0.68	0.97	1.33	1.73	2.18	2.66	3.17	3.71	4.27	4.85
200	0.64	0.93	1.28	1.67	2.11	2.58	3.09	3.62	4.17	4.75
350	0.57	0.85	1.18	1.56	1.98	2.44	2.93	3.45	4.00	4.56
500	0.53	0.80	1.12	1.49	1.91	2.36	2.84	3.35	3.89	4.45
750	0.49	0.74	1.06	1.42	1.82	2.26	2.74	3.24	3.77	4.32
1000	0.46	0.71	1.01	1.37	1.77	2.20	2.67	3.16	3.69	4.23
1500	0.43	0.66	0.96	1.30	1.69	2.11	2.57	3.06	3.57	4.11
2000	0.40	0.63	0.92	1.25	1.63	2.05	2.50	2.99	3.49	4.03

Since infinite d^* values occur in density linkage, the final number of clusters can exceed one when there are wide gaps between the clusters or when the smoothing parameter results in little smoothing.

Density linkage applies no constraints to the shapes of the clusters and, unlike most other hierarchical clustering methods, is capable of recovering clusters with elongated or irregular shapes. Since density linkage employs less prior knowledge about the shape of the clusters than do methods restricted to compact clusters, density linkage is less effective at recovering compact clusters from small samples than are methods that always recover compact clusters, regardless of the data.

EML

The following method is obtained by specifying METHOD=EML. The distance between two clusters is given by

D_KL = nv ln( 1 + [(B_KL)/(P_G)] ) - 2 ( N_M ln(N_M) - N_K ln(N_K) - N_L ln(N_L) )

The EML method joins clusters to maximize the likelihood at each level of the hierarchy under the following assumptions:

multivariate normal mixture
equal spherical covariance matrices
unequal sampling probabilities

The EML method is similar to Ward's minimum-variance method but removes the bias toward equal-sized clusters. Practical experience has indicated that EML is somewhat biased toward unequal-sized clusters. You can specify the PENALTY= option to adjust the degree of bias. If you specify PENALTY=p, the formula is modified to

D_KL = nv ln( 1 + [(B_KL)/(P_G)] ) - p ( N_M ln(N_M) - N_K ln(N_K) - N_L ln(N_L) )

The EML method was derived by W.S. Sarle of SAS Institute Inc. from the maximum-likelihood formula obtained by Symons (1981, p. 37, equation 8) for disjoint clustering. There are currently no other published references on the EML method.

Flexible-Beta Method

The following method is obtained by specifying METHOD=FLEXIBLE. The combinatorial formula is

D_JM = (D_JK + D_JL) [(1 - b)/2] + D_KL b

where b is the value of the BETA= option, or -0.25 by default.

The flexible-beta method was developed by Lance and Williams (1967). See also Milligan (1987).

McQuitty's Similarity Analysis

The following method is obtained by specifying METHOD=MCQUITTY. The combinatorial formula is

D_JM = [(D_JK + D_JL)/2]

The method was independently developed by Sokal and Michener (1958) and McQuitty (1966).

Median Method

The following method is obtained by specifying METHOD=MEDIAN. If d(x , y) = | x - y |², then the combinatorial formula is

D_JM = [(D_JK + D_JL)/2] - [(D_KL)/4]

The median method was developed by Gower (1967).

Single Linkage

The following method is obtained by specifying METHOD=SINGLE. The distance between two clusters is defined by

$D_{KL} = \min_{i \in C_K} \min_{j \in C_L} d(x_i , x_j)$

The combinatorial formula is

D_JM = min(D_JK , D_JL)

In single linkage, the distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. Single linkage has many desirable theoretical properties (Jardine and Sibson 1971; Fisher and Van Ness 1971; Hartigan 1981) but has fared poorly in Monte Carlo studies (for example, Milligan 1980). By imposing no constraints on the shape of clusters, single linkage sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters. You must also recognize that single linkage tends to chop off the tails of distributions before separating the main clusters (Hartigan 1981). The notorious chaining tendency of single linkage can be alleviated by specifying the TRIM= option (Wishart 1969, pp. 296 -298).

Density linkage and two-stage density linkage retain most of the virtues of single linkage while performing better with compact clusters and possessing better asymptotic properties (Wong and Lane 1983).

Single linkage was originated by Florek et al. (1951a, 1951b) and later reinvented by McQuitty (1957) and Sneath (1957).

Two-Stage Density Linkage

If you specify METHOD=DENSITY, the modal clusters often merge before all the points in the tails have clustered. The option METHOD=TWOSTAGE is a modification of density linkage that ensures that all points are assigned to modal clusters before the modal clusters are allowed to join. The CLUSTER procedure supports the same three varieties of two-stage density linkage as of ordinary density linkage: kth-nearest neighbor, uniform kernel, and hybrid.

In the first stage, disjoint modal clusters are formed. The algorithm is the same as the single linkage algorithm ordinarily used with density linkage, with one exception: two clusters are joined only if at least one of the two clusters has fewer members than the number specified by the MODE= option. At the end of the first stage, each point belongs to one modal cluster.

In the second stage, the modal clusters are hierarchically joined by single linkage. The final number of clusters can exceed one when there are wide gaps between the clusters or when the smoothing parameter is small.

Each stage forms a tree that can be plotted by the TREE procedure. By default, the TREE procedure plots the tree from the first stage. To obtain the tree for the second stage, use the option HEIGHT=MODE in the PROC TREE statement. You can also produce a single tree diagram containing both stages, with the number of clusters as the height axis, by using the option HEIGHT=N in the PROC TREE statement. To produce an output data set from PROC TREE containing the modal clusters, use _HEIGHT_ for the HEIGHT variable (the default) and specify LEVEL=0.

Two-stage density linkage was developed by W.S. Sarle of SAS Institute Inc. There are currently no other published references on two-stage density linkage.

Ward's Minimum-Variance Method

The following method is obtained by specifying METHOD=WARD. The distance between two clusters is defined by

$D_{KL} = B_{KL} = \frac{ \Vert {{\bar{x}}}_K - {{\bar{x}}}_L \Vert^2 } {\frac{1}{N_K} + \frac{1}{N_L}}$

If d(x,y) = (1/2) | x - y |², then the combinatorial formula is

D_JM = [((N_J + N_K) D_JK + (N_J + N_L) D_JL - N_J D_KL)/(N_J + N_M)]

In Ward's minimum-variance method, the distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables. At each generation, the within-cluster sum of squares is minimized over all partitions obtainable by merging two clusters from the previous generation. The sums of squares are easier to interpret when they are divided by the total sum of squares to give proportions of variance (squared semipartial correlations).

Ward's method joins clusters to maximize the likelihood at each level of the hierarchy under the following assumptions:

multivariate normal mixture
equal spherical covariance matrices
equal sampling probabilities

Ward's method tends to join clusters with a small number of observations, and it is strongly biased toward producing clusters with roughly the same number of observations. It is also very sensitive to outliers (Milligan 1980).

Ward (1963) describes a class of hierarchical clustering methods including the minimum variance method.

Chapter Contents
Previous
Next
Top