![]() Chapter Contents |
![]() Previous |
![]() Next |
The CORR Procedure |
Pearson Product-Moment Correlation |
The sample correlation, such as a Pearson product-moment correlation or weighted product-moment correlation, estimates the true correlation. The formula for the Pearson product-moment correlation is
where
is the sample mean of
and
is the sample mean of
.
The formula for a weighted Pearson product-moment correlation is
where
Note that
is the weighted mean of
,
is the weighted mean of
, and
is the weight.
When one variable is dichotomous (0,1) and the other variable is continuous, a Pearson correlation is equivalent to a point biserial correlation. When both variables are dichotomous, a Pearson correlation coefficient is equivalent to the phi coefficient.
Spearman Rank-Order Correlation |
where
is the rank of the
value,
is the rank of the
value,
is the mean of the
values, and
is the mean of the
values.
PROC CORR computes the Spearman's correlation by ranking the data and using the ranks in the Pearson product-moment correlation formula. In case of ties, the averaged ranks are used.
Kendall's tau-b |
where
and where
is the number of tied
values in the
group of tied
values,
is the number of tied
values in the
group of tied
values,
is the number of observations, and sgn(z)
is defined as
PROC CORR computes Kendall's correlation by ranking the data and using a method similar to Knight (1966). The data are double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. PROC CORR computes Kendall's tau-b from the number of interchanges of the first variable and corrects for tied pairs (pairs of observations with equal values of X or equal values of Y).
Hoeffding's Measure of Dependence, D |
where
is the rank of
,
is the rank of
, and
(also called the bivariate rank) is 1 plus the number of
points with both
and
values less than the
point. A point that is tied on only the
value or
value contributes 1/2 to
if the other value is less than the corresponding value
for the
point. A point that is tied on both
and
contributes 1/4 to
.
PROC CORR obtains the
values by first ranking the data. The data are then double
sorted by ranking observations according to values of the first variable and
reranking the observations according to values of the second variable. Hoeffding's
D statistic is computed using the number of interchanges of the first variable.
When no ties occur among data set observations, the D statistic values are between -0.5 and 1, with 1 indicating complete dependence. However, when ties occur, the D statistic may result in a smaller value. That is, for a pair of variables with identical values, the Hoeffding's D statistic may be less than 1. With a large number of ties in a small data set, the D statistic may be less than -0.5 . For more information on Hoeffding's D, see Hollander and Wolfe (1973, p. 228).
Partial Correlation |
is a regression model for
given
. The population Pearson partial correlation between the
and the
variables of
given
is defined as the correlation between errors
and
.
If the exact values of
and
are unknown, you can use a sample Pearson partial correlation
to estimate the population Pearson partial correlation. For a given sample
of observations, you estimate the sets of unknown parameters
and
using the least-squares estimators
and
. Then the fitted least-squares regression model is
The partial corrected sums of squares and crossproducts (CSSCP) of
given
are the corrected sums of squares and crossproducts of
the residuals
. Using these partial corrected sums of squares and crossproducts,
you can calculate the partial variances, partial covariances, and partial
correlations.
PROC CORR derives the partial corrected sums of squares and crossproducts
matrix by applying the Cholesky decomposition algorithm to the CSSCP matrix.
For Pearson partial correlations, let
be the partitioned CSSCP matrix between two sets of variables,
and
:
PROC CORR calculates
, the partial CSSCP matrix of
after controlling for
, by applying the Cholesky decomposition algorithm sequentially
on the rows associated with
, the variables being partialled out.
After applying the Cholesky decomposition algorithm to each row associated
with variables
, PROC CORR checks all higher numbered diagonal elements
associated with
for singularity. After the Cholesky decomposition, a variable
is considered singular if the value of the corresponding diagonal element
is less than
times the original unpartialled corrected sum of squares
of that variable. You can specify the singularity criterion
using the SINGULAR= option. For Pearson partial correlations,
a controlling variable
is considered singular if the
for predicting this variable from the variables that are
already partialled out exceeds
. When this happens, PROC CORR excludes the variable from
the analysis. Similarly, a variable is considered singular if the
for predicting this variable from the controlling variables
exceeds
. When this happens, its associated diagonal element and
all higher numbered elements in this row or column are set to zero.
After the Cholesky decomposition algorithm is performed on all rows
associated with
, the resulting matrix has the form
where
is an upper triangular matrix with
If
is positive definite, then the partial CSSCP matrix
is identical to the matrix derived from the formula
The partial variance-covariance matrix is calculated with the variance divisor (VARDEF= option). PROC CORR can then use the standard Pearson correlation formula on the partial variance-covariance matrix to calculate the Pearson partial correlation matrix. Another way to calculate Pearson partial correlation is by applying the Cholesky decomposition algorithm directly to the correlation matrix and by using the correlation formula on the resulting matrix.
To derive the corresponding Spearman partial rank-order correlations and Kendall partial tau-b correlations, PROC CORR applies the Cholesky decomposition algorithm to the Spearman rank-order correlation matrix and Kendall tau-b correlation matrix and uses the correlation formula. The singularity criterion for nonparametric partial correlations is identical to Pearson partial correlation except that PROC CORR uses a matrix of nonparametric correlations and sets a singular variable's associated correlations to missing. The partial tau-b correlations range from -1 to 1. However, the sampling distribution of this partial tau-b is unknown; therefore, the probability values are not available.
When a correlation matrix (Pearson, Spearman, or Kendall tau-b correlation
matrix) is positive definite, the resulting partial correlation between variables
and
after adjusting for a single variable
is identical to that obtained from the first-order partial
correlation formula
where
,
, and
are the appropriate correlations.
The formula for higher-order partial correlations is a straightforward
extension of the above first-order formula. For example, when the correlation
matrix is positive definite, the partial correlation between
and
controlling for both
and
is identical to the second-order partial correlation formula
where
,
, and
are first-order partial correlations among variables
,
, and
given
.
Cronbach's Coefficient Alpha |
When a value is recorded, the observed value contains some degree of measurement error. Two sets of measurements on the same variable for the same individual may not have identical values. However, repeated measurements for a series of individuals will show some consistency. Reliability measures internal consistency from one set of measurements to another. The observed value Y is divided into two components, a true value T and a measurement error E. The measurement error is assumed to be independent of the true value, that is,
The reliability coefficient of a measurement test is defined as the squared correlation between the observed value Y and the true value T, that is,
which is the proportion of the observed variance due to true differences among individuals in the sample. If Y is the sum of several observed variables measuring the same feature, you can estimate var(T). Cronbach's coefficient alpha, based on a lower bound for var(T), is an estimate of the reliability coefficient.
Suppose
variables are used with
for
, where
is the observed value,
is the true value, and
is the measurement error. The measurement errors (
) are independent of the true values (
) and are also independent of each other. Let
be the total observed score and
be the total true score. Because
a lower bound for
is given by
With
for
, a lower bound for the reliability coefficient is then
given by the Cronbach's coefficient alpha:
If the variances of the items vary widely, you can standardize the items to a standard deviation of 1 before computing the coefficient alpha. If the variables are dichotomous (0,1), the coefficient alpha is equivalent to the Kuder-Richardson 20 (KR-20) reliability measure.
When the correlation between each pair of variables is 1, the coefficient alpha has a maximum value of 1. With negative correlations between some variables, the coefficient alpha can have a value less than zero. The larger the overall alpha coefficient, the more likely that items contribute to a reliable scale. Nunnally (1978) suggests .70 as an acceptable reliability coefficient; smaller reliability coefficients are seen as inadequate. However, this varies by discipline.
To determine how each item reflects the reliability of the scale, you
calculate a coefficient alpha after deleting each variable independently from
the scale. The Cronbach's coefficient alpha from all variables except the
variable is given by
If the reliability coefficient increases after deleting an item from the scale, you can assume that the item is not correlated highly with other items in the scale. Conversely, if the reliability coefficient decreases you can assume that the item is highly correlated with other items in the scale. See SAS Communications, 4th Quarter 1994, for more information on how to interpret Cronbach's coefficient alpha.
Listwise deletion of observations with missing values is necessary to correctly calculate Cronbach's coefficient alpha. PROC CORR does not automatically use listwise deletion when you specify ALPHA. Therefore, use the NOMISS option if the data set contains missing values. Otherwise, PROC FREQ prints a warning message in the SAS log indicating the need to use NOMISS with ALPHA.
Probability Values |
as coming from a t distribution with
degrees of freedom, where
is the appropriate correlation.
Probability values for the Pearson and Spearman partial correlations are computed by treating
as coming from a t distribution with
degrees of freedom, where
is the appropriate partial correlation and
is the number of variables being partialled out.
Probability values for Kendall correlations are computed by treating
as coming from a normal distribution when
and where
are the values of the first variable,
are the values of the second variable, and the function
sgn(z) is defined as
The formula for the variance of
, var(
), is computed as
where
![]() | |
![]() | |
![]() | |
![]() | |
![]() |
The sums are over tied groups of values where
is the number of tied
values and
is the number of tied
values (Noether 1967). The sampling distribution of Kendall's
partial tau-b is unknown; therefore, the probability values are not available.
The probability values for Hoeffding's D statistic are computed using the asymptotic distribution computed by Blum, Kiefer, and Rosenblatt (1961). The formula is
which comes from the asymptotic distribution. When the sample size is less than 10, see the tables for the distribution of D in Hollander and Wolfe (1973).
![]() Chapter Contents |
![]() Previous |
![]() Next |
![]() Top of Page |
Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.