Chapter Contents |
Previous |
Next |
The CORR Procedure |
Pearson Product-Moment Correlation |
The sample correlation, such as a Pearson product-moment correlation or weighted product-moment correlation, estimates the true correlation. The formula for the Pearson product-moment correlation is
where is the sample mean of and is the sample mean of .
The formula for a weighted Pearson product-moment correlation is
where
Note that is the weighted mean of , is the weighted mean of , and is the weight.
When one variable is dichotomous (0,1) and the other variable is continuous, a Pearson correlation is equivalent to a point biserial correlation. When both variables are dichotomous, a Pearson correlation coefficient is equivalent to the phi coefficient.
Spearman Rank-Order Correlation |
where is the rank of the value, is the rank of the value, is the mean of the values, and is the mean of the values.
PROC CORR computes the Spearman's correlation by ranking the data and using the ranks in the Pearson product-moment correlation formula. In case of ties, the averaged ranks are used.
Kendall's tau-b |
where
and where is the number of tied values in the group of tied values, is the number of tied values in the group of tied values, is the number of observations, and sgn(z) is defined as
PROC CORR computes Kendall's correlation by ranking the data and using a method similar to Knight (1966). The data are double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. PROC CORR computes Kendall's tau-b from the number of interchanges of the first variable and corrects for tied pairs (pairs of observations with equal values of X or equal values of Y).
Hoeffding's Measure of Dependence, D |
where
is the rank of , is the rank of , and (also called the bivariate rank) is 1 plus the number of points with both and values less than the point. A point that is tied on only the value or value contributes 1/2 to if the other value is less than the corresponding value for the point. A point that is tied on both and contributes 1/4 to .
PROC CORR obtains the values by first ranking the data. The data are then double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. Hoeffding's D statistic is computed using the number of interchanges of the first variable.
When no ties occur among data set observations, the D statistic values are between -0.5 and 1, with 1 indicating complete dependence. However, when ties occur, the D statistic may result in a smaller value. That is, for a pair of variables with identical values, the Hoeffding's D statistic may be less than 1. With a large number of ties in a small data set, the D statistic may be less than -0.5 . For more information on Hoeffding's D, see Hollander and Wolfe (1973, p. 228).
Partial Correlation |
is a regression model for given . The population Pearson partial correlation between the and the variables of given is defined as the correlation between errors and .
If the exact values of and are unknown, you can use a sample Pearson partial correlation to estimate the population Pearson partial correlation. For a given sample of observations, you estimate the sets of unknown parameters and using the least-squares estimators and . Then the fitted least-squares regression model is
The partial corrected sums of squares and crossproducts (CSSCP) of given are the corrected sums of squares and crossproducts of the residuals . Using these partial corrected sums of squares and crossproducts, you can calculate the partial variances, partial covariances, and partial correlations.
PROC CORR derives the partial corrected sums of squares and crossproducts matrix by applying the Cholesky decomposition algorithm to the CSSCP matrix. For Pearson partial correlations, let be the partitioned CSSCP matrix between two sets of variables, and :
PROC CORR calculates , the partial CSSCP matrix of after controlling for , by applying the Cholesky decomposition algorithm sequentially on the rows associated with , the variables being partialled out.
After applying the Cholesky decomposition algorithm to each row associated with variables , PROC CORR checks all higher numbered diagonal elements associated with for singularity. After the Cholesky decomposition, a variable is considered singular if the value of the corresponding diagonal element is less than times the original unpartialled corrected sum of squares of that variable. You can specify the singularity criterion using the SINGULAR= option. For Pearson partial correlations, a controlling variable is considered singular if the for predicting this variable from the variables that are already partialled out exceeds . When this happens, PROC CORR excludes the variable from the analysis. Similarly, a variable is considered singular if the for predicting this variable from the controlling variables exceeds . When this happens, its associated diagonal element and all higher numbered elements in this row or column are set to zero.
After the Cholesky decomposition algorithm is performed on all rows associated with , the resulting matrix has the form
where is an upper triangular matrix with
If is positive definite, then the partial CSSCP matrix is identical to the matrix derived from the formula
The partial variance-covariance matrix is calculated with the variance divisor (VARDEF= option). PROC CORR can then use the standard Pearson correlation formula on the partial variance-covariance matrix to calculate the Pearson partial correlation matrix. Another way to calculate Pearson partial correlation is by applying the Cholesky decomposition algorithm directly to the correlation matrix and by using the correlation formula on the resulting matrix.
To derive the corresponding Spearman partial rank-order correlations and Kendall partial tau-b correlations, PROC CORR applies the Cholesky decomposition algorithm to the Spearman rank-order correlation matrix and Kendall tau-b correlation matrix and uses the correlation formula. The singularity criterion for nonparametric partial correlations is identical to Pearson partial correlation except that PROC CORR uses a matrix of nonparametric correlations and sets a singular variable's associated correlations to missing. The partial tau-b correlations range from -1 to 1. However, the sampling distribution of this partial tau-b is unknown; therefore, the probability values are not available.
When a correlation matrix (Pearson, Spearman, or Kendall tau-b correlation matrix) is positive definite, the resulting partial correlation between variables and after adjusting for a single variable is identical to that obtained from the first-order partial correlation formula
where , , and are the appropriate correlations.
The formula for higher-order partial correlations is a straightforward extension of the above first-order formula. For example, when the correlation matrix is positive definite, the partial correlation between and controlling for both and is identical to the second-order partial correlation formula
where , , and are first-order partial correlations among variables , , and given .
Cronbach's Coefficient Alpha |
When a value is recorded, the observed value contains some degree of measurement error. Two sets of measurements on the same variable for the same individual may not have identical values. However, repeated measurements for a series of individuals will show some consistency. Reliability measures internal consistency from one set of measurements to another. The observed value Y is divided into two components, a true value T and a measurement error E. The measurement error is assumed to be independent of the true value, that is,
The reliability coefficient of a measurement test is defined as the squared correlation between the observed value Y and the true value T, that is,
which is the proportion of the observed variance due to true differences among individuals in the sample. If Y is the sum of several observed variables measuring the same feature, you can estimate var(T). Cronbach's coefficient alpha, based on a lower bound for var(T), is an estimate of the reliability coefficient.
Suppose variables are used with for , where is the observed value, is the true value, and is the measurement error. The measurement errors ( ) are independent of the true values ( ) and are also independent of each other. Let be the total observed score and be the total true score. Because
a lower bound for is given by
With for , a lower bound for the reliability coefficient is then given by the Cronbach's coefficient alpha:
If the variances of the items vary widely, you can standardize the items to a standard deviation of 1 before computing the coefficient alpha. If the variables are dichotomous (0,1), the coefficient alpha is equivalent to the Kuder-Richardson 20 (KR-20) reliability measure.
When the correlation between each pair of variables is 1, the coefficient alpha has a maximum value of 1. With negative correlations between some variables, the coefficient alpha can have a value less than zero. The larger the overall alpha coefficient, the more likely that items contribute to a reliable scale. Nunnally (1978) suggests .70 as an acceptable reliability coefficient; smaller reliability coefficients are seen as inadequate. However, this varies by discipline.
To determine how each item reflects the reliability of the scale, you calculate a coefficient alpha after deleting each variable independently from the scale. The Cronbach's coefficient alpha from all variables except the variable is given by
If the reliability coefficient increases after deleting an item from the scale, you can assume that the item is not correlated highly with other items in the scale. Conversely, if the reliability coefficient decreases you can assume that the item is highly correlated with other items in the scale. See SAS Communications, 4th Quarter 1994, for more information on how to interpret Cronbach's coefficient alpha.
Listwise deletion of observations with missing values is necessary to correctly calculate Cronbach's coefficient alpha. PROC CORR does not automatically use listwise deletion when you specify ALPHA. Therefore, use the NOMISS option if the data set contains missing values. Otherwise, PROC FREQ prints a warning message in the SAS log indicating the need to use NOMISS with ALPHA.
Probability Values |
as coming from a t distribution with degrees of freedom, where is the appropriate correlation.
Probability values for the Pearson and Spearman partial correlations are computed by treating
as coming from a t distribution with degrees of freedom, where is the appropriate partial correlation and is the number of variables being partialled out.
Probability values for Kendall correlations are computed by treating
as coming from a normal distribution when
and where are the values of the first variable, are the values of the second variable, and the function sgn(z) is defined as
The formula for the variance of , var( ), is computed as
where
The sums are over tied groups of values where is the number of tied values and is the number of tied values (Noether 1967). The sampling distribution of Kendall's partial tau-b is unknown; therefore, the probability values are not available.
The probability values for Hoeffding's D statistic are computed using the asymptotic distribution computed by Blum, Kiefer, and Rosenblatt (1961). The formula is
which comes from the asymptotic distribution. When the sample size is less than 10, see the tables for the distribution of D in Hollander and Wolfe (1973).
Chapter Contents |
Previous |
Next |
Top of Page |
Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.