Chapter Contents |
Previous |
Next |
The UNIVARIATE Procedure |
Confidence Limits for Parameters of the Normal Distribution |
where is and is the ( ) percentile of the t distribution with degrees of freedom.
The one-sided percent confidence limit is computed as
The two-sided percent confidence interval for the standard deviation has lower and upper limits
where and are the and percentiles of the chi-square distribution with degrees of freedom. A one-sided percent confidence limit is computed by replacing with .
A percent confidence interval for the variance has upper and lower limits equal to the squares of the corresponding upper and lower limits for the standard deviation.
When you use the WEIGHT statement and specify VARDEF=DF in the PROC statement, the percent confidence interval for the weighted mean is
where is the weighted mean, is the weighted standard deviation, is the weight for observation, and is the critical percentage for the t distribution with degrees of freedom.
Tests for Location |
The Student's t test is appropriate when the data are from an approximately normal population; otherwise, use nonparametric tests such as the sign test or the signed rank test. For large sample situations, the t test is asymptotically equivalent to a z test.
If you use the WEIGHT statement, PROC UNIVARIATE computes only one weighted test for location, the t test. You must use the default value for the VARDEF= option in the PROC statement.
You can also
compare means or medians of paired data. Data are said to be
paired when subjects or units are matched in pairs according to one or more
variables, such as pairs of subjects with the same age and gender. Paired
data also occur when each subject or unit is measured at two times or under
two conditions. To compare the means or medians of the two times, create an
analysis variable that is the difference between the two measures. The test
that the mean or the median difference of the variables equals zero is equivalent
to the test that the means or medians of the two original variables are equal.
See Performing a Sign Test Using Paired Data .
where is the sample mean, is the number of nonmissing values for a variable, and is the sample standard deviation. Under the null hypothesis, the population mean equals . When the data values are approximately normally distributed, the probability under the null hypothesis of a t statistic that is as extreme, or more extreme, than the observed value (the p-value) is obtained from the t distribution with degrees of freedom. For large , the t statistic is asymptotically equivalent to a z test.
When you use the WEIGHT statement and the default value of VARDEF=, which is DF, the t statistic is calculated as
where
is the weighted mean,
is the weighted standard deviation, and
is the weight for
observation. The
statistic is treated as having a Student's t
distribution with
degrees of freedom. If you specify the EXCLNPWGT option
in the PROC statement,
is the number of nonmissing observations when the value
of the WEIGHT variable is positive. By default,
is the number of nonmissing observations for the WEIGHT
variable.
where is the number of values that is greater than and is the number of values that is less than . Values equal to are discarded.
Under the null hypothesis that the population median is equal to , the p-value for the observed statistic M is
where
is the number of
values not equal to
.
where is the rank of after discarding values of equal to , is the number of values not equal to , and the sum is calculated for values of greater than 0. Average ranks are used for tied values.
The p-value is the probability of obtaining a signed rank statistic greater in absolute value than the absolute value of the observed statistic S. If , the significance level of is computed from the exact distribution of , which can be enumerated under the null hypothesis that the distribution is symmetric about . When , the significance of level is computed by treating
as a Student's t variate with degrees of freedom. is computed as
where the sum is calculated over groups that are tied in absolute value, and is the number of tied values in the th group (Iman 1974; Conover 1980).
The Wilcoxon signed rank test assumes that the distribution is symmetric. If the assumption is not valid, you can use the sign test to test that the median is . See Lehmann (1975) for more details.
Goodness-of-Fit Tests |
You determine whether to reject the null hypothesis by examining the probability that is associated with a test statistic. When the p-value is less than the predetermined critical value (alpha value), you reject the null hypothesis and conclude that the data came from the theoretical distribution.
If you want to test the normality assumptions that underlie analysis
of variance methods, beware of using a statistical test for normality alone.
A test's ability to reject the null hypothesis (known as the power
of the test) increases with the sample size. As the sample size becomes larger,
increasingly smaller departures from normality can be detected. Since small
deviations from normality do not severely affect the validity of analysis
of variance tests, it is important to examine other statistics and plots to
make a final assessment of normality. The skewness and kurtosis measures
and the plots that are provided by the PLOTS option, the HISTOGRAM statement,
PROBPLOT statement, and QQPLOT statement can be very helpful. For small sample
sizes, power is low for detecting larger departures from normality that may
be important. To increase the test's ability to detect such deviations, you
may want to declare significance at higher levels, such as 0.15 or 0.20, rather
than the often-used 0.05 level. Again, consulting plots and additional statistics
will help you assess the severity of the deviations from normality.
when and
when
, where
and
are functions of
, obtained from simulation results, and
is a standard normal variate. Large values of
indicate departure from normality.
Note that is a step function that takes a step of height at each observation. This function estimates the distribution function . At any value is the proportion of observations that is less than or equal to while is the theoretical probability of an observation that is less than or equal to . EDF statistics measure the discrepancy between and .
The computational formulas for the EDF statistics use the probability integral transformation . If is the distribution function of , the random variable is uniformly distributed between 0 and 1.
Given observations , PROC UNIVARIATE computes the values by applying the transformation, as follows.
When you specify the NORMAL option in the PROC UNIVARIATE statement or use the HISTOGRAM statement to fit a parametric distribution, PROC UNIVARIATE provides a series of goodness-of-fit tests that are based on the empirical distribution function (EDF):
Once the EDF test statistics are computed, the associated p-values are calculated. PROC UNIVARIATE uses internal tables of probability levels that are similar to those given by D'Agostino and Stephens (1986). If the value lies between two probability levels, then linear interpolation is used to estimate the probability value.
Note: PROC UNIVARIATE does not support some of the EDF tests when you
use the HISTOGRAM statement and you estimate the parameters of the specified
distribution. See Availability of EDF Tests for more information.
The Kolmogorov-Smirnov statistic belongs to the supremum class of EDF statistics. This class of statistics is based on the largest vertical difference between and .
The Kolmogorov-Smirnov statistic is computed as the maximum of and . is the largest vertical distance between the EDF and the distribution function when the EDF is greater than the distribution function. is the largest vertical distance when the EDF is less than the distribution function.
PROC UNIVARIATE uses a modified
Kolmogorov D statistic to test the data against a normal distribution
with mean and variance equal to the sample mean and variance.
The function weights the squared difference .
The Anderson-Darling statistic ( ) is defined as
where the weight function is .
The Anderson-Darling statistic is computed as
where the weight function is .
The Cramer-von Mises statistic is computed as
The probability value depends upon the parameters that are known and the parameters that PROC UNIVARIATE estimates for the fitted distribution. Availability of EDF Tests summarizes different combinations of estimated parameters for which EDF tests are available.
Note: PROC UNIVARIATE assumes that the threshold (THETA=) parameter
for the beta, exponential, gamma, lognormal, and Weibull distributions is
known. If you omit its value, PROC UNIVARIATE assumes that it is zero and
that it is known. Likewise, PROC UNIVARIATE assumes that the SIGMA= parameter,
which determines the upper threshold (SIGMA) for the beta distribution, is
known. If you omit its value, PROC UNIVARIATE assumes that the value is one.
These parameters are not listed in Availability of EDF Tests because they are assumed to be known in
all cases, and they do not affect which EDF statistics PROC UNIVARIATE computes.
Distribution | Parameters | EDF |
---|---|---|
Beta |
and
unknown known, unknown unknown, known and known |
none none none all |
Exponential |
unknown known |
all all |
Gamma |
and
unknown known, unknown unknown, known and known |
none none none all |
Lognormal |
and
unknown known, unknown unknown, known and known |
all and and all |
Normal |
and
unknown known, unknown unknown, known and known |
all and and all |
Weibull |
and
unknown known, unknown unknown, known and known |
and
and and all |
Robust Estimators |
The Winsorized mean is computed after the smallest observations are replaced by the ( ) smallest observation, and the largest observations are replaced by the ( ) largest observation.
For a symmetric distribution, the symmetrically Winsorized mean is an unbiased estimate of the population mean. But the Winsorized mean does not have a normal distribution even if the data are from a normal population.
The Winsorized sum of squared deviations is defined as
A Winsorized t test is given by
where the standard error of the Winsorized mean is
When the data are from a symmetric distribution, the distribution of the Winsorized t statistic is approximated by a Student's t distribution with degrees of freedom (Tukey and McLaughlin 1963, Dixon and Tukey 1968).
A percent confidence interval for the Winsorized mean has upper and lower limits
and the (
) critical value of the Student's t statistics
has
degrees of freedom.
The trimmed mean is computed after the smallest and largest observations are deleted from the sample. In other words, the observations are trimmed at each end.
For a symmetric distribution, the symmetrically trimmed mean is an unbiased estimate of the population mean. But the trimmed mean does not have a normal distribution even if the data are from a normal population.
A robust estimate of the variance of the trimmed mean can be based on the Winsorized sum of squared deviations (Tukey and McLaughlin 1963). The resulting trimmed t test is given by
where the standard error of the trimmed mean is
and is the square root of the Winsorized sum of squared deviations
When the data are from a symmetric distribution, the distribution of the trimmed t statistic is approximated by a Student's t distribution with degrees of freedom (Tukey and McLaughlin 1963, Dixon and Tukey 1968).
A percent confidence interval for the trimmed mean has upper and lower limits
and the (
) critical value of the Student's t statistics
has
degrees of freedom.
PROC UNIVARIATE computes robust measures of scale that include statistics of interquartile range, Gini's mean difference G, MAD, , and , with their corresponding estimates of .
The interquartile range is a simple robust scale estimator, which is the difference between the upper and lower quartiles. For a normal population, the standard deviation can be estimated by dividing the interquartile range by 1.34898.
Gini's mean difference is also a robust estimator of the standard deviation . For a normal population, Gini's mean difference has expected value . Thus, multiplying Gini's mean difference by yields a robust estimator of the standard deviation when the data are from a normal sample. The constructed estimator has high efficiency for the normal distribution relative to the usual sample standard deviation. It is also less sensitive to the presence of outliers than the sample standard deviation.
Gini's mean difference is computed as
If the observations are from a normal distribution, then is an unbiased estimator of the standard deviation .
A very robust scale estimator is the MAD, the median absolute deviation about the median (Hampel, 1974.)
where the inner median, , is the median of the observations and the outer median, , is the median of the absolute values of the deviations about the median.
For a normal distribution, 1.4826·MAD can be used to estimate the standard deviation .
The MAD statistic has low efficiency for normal distributions, and it may not be appropriate for symmetric distributions. Rousseeuw and Croux (1993) proposed two new statistics as alternatives to the MAD statistic.
The first statistic is
where the outer median, , is the median of the medians of .
To reduce the small-sample bias, is used to estimate the standard deviation , where is a the correction factor (Croux and Rousseeuw, 1992.)
The second statistic is
where , and is the integer part of . That is, is 2.2219 times the th order statistic of the distances between data points.
The bias-corrected statistic, , is used to estimate the standard deviation , where is a correction factor.
Calculating Percentiles |
To compute the quantile that each observation falls in, use PROC RANK
with the GROUP= option. To calculate percentiles other than the default percentiles,
use PCTLPTS= and PCTLPRE= in the OUTPUT statement.
When , the two-sided percent confidence interval for quantiles that are based on normal data has lower and upper limits
where is the percentile .
When , the lower and upper limits are
A one-sided percent confidence limit is computed by replacing with . The factor is described in Owen and Hua (1977) and Odeh and Owen (1980).
The two-sided distribution-free % confidence interval for quantiles from a sample of size is
where is jth order statistic. The lower rank and upper rank are integers that are symmetric or nearly symmetric around , where is the integral part of .
The and are chosen so that the order statistics and
where is the cumulative binomial probability, , and .
The coverage probability is sometimes less that . This can occur in the tails of the distribution when the sample size is small. To avoid this problem, you can specify the option TYPE=ASYMMETRIC, which causes PROC UNIVARIATE to use asymmetric values of and . However, PROC UNIVARIATE first attempts to compute confidence limits that satisfy all three conditions. If the last condition is not satisfied, then the first condition is relaxed. Thus, some of the confidence limits may be symmetric while others, especially in the extremes, are not.
A one-sided distribution-free lower percent confidence limit is computed as when is the largest integer that satisfies the inequality
where , and . Likewise, a one-sided distribution-free upper % confidence limit is computed as when is the smallest integer that satisfies the inequality
where
, and
.
When you use the WEIGHT statement the percentiles are computed as follows. Let be the th ordered nonmissing value, . Then, for a given value of between 0 and 1, the th weighted quantile (or 100 th weighted percentile), , is computed from the empirical distribution function with averaging
where is the weight associated with , is the sum of the weights and is the weight for th observation.
When the observations have identical weights, the weighted percentiles are the same as the unweighted percentiles with PCTLDEF=5.
Calculating the Mode |
The WEIGHT statement has no effect on the mode.
Formulas for Fitted Continuous Distributions |
where and
This notation is consistent with that of other distributions that you can fit with the HISTOGRAM statement. However, many texts, including Johnson, et al. (1994), write the beta density function as:
The two notations are related as follows:
The range of the beta distribution is bounded below by a threshold parameter and above by . If you specify a fitted beta curve using the BETA option, must be less than the minimum data value, and must be greater than the maximum data value. You can specify and with the THETA= and SIGMA= value in parentheses after the keyword BETA. By default, and . If you specify THETA=EST and SIGMA=EST, maximum likelihood estimates are computed for and .
Note: However, three- and four-parameter maximum
likelihood estimation may not always converge.
In addition, you can specify and with the ALPHA= and BETA= beta-options, respectively. By default, the procedure calculates maximum likelihood estimates for and . For example, to fit a beta density curve to a set of data bounded below by 32 and above by 212 with maximum likelihood estimates for and , use the following statement:
histogram length / beta(theta=32 sigma=180);
The beta distributions
are also referred to as Pearson Type I or II distributions. These include
the power-function distribution (
), the arc-sine distribution
(
), and the generalized arc-sine
distributions (
). You can use the DATA step function BETAINV to compute
beta quantiles and the DATA step function PROBBETA to compute beta probabilities.
where
The threshold parameter must be less than or equal to the minimum data value. You can specify with the THRESHOLD= exponential-option. By default, . If you specify THETA=EST, a maximum likelihood estimate is computed for . In addition, you can specify with the SCALE= exponential-option. By default, the procedure calculates a maximum likelihood estimate for . Note that some authors define the scale parameter as .
The
exponential distribution is a special case of both the gamma distribution
(with
and the Weibull distribution (with
). A related distribution is the extreme
value distribution. If
Y = exp (-X) has an exponential distribution, then
X (Chi) has an extreme value distribution.
where
The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= gamma-option. By default, . If you specify THETA=EST, a maximum likelihood estimate is computed for . In addition, you can specify and with the SCALE= and ALPHA= gamma-options. By default, the procedure calculates maximum likelihood estimates for and .
The gamma distributions are also referred to as Pearson Type III distributions, and they include the chi-square, exponential, and Erlang distributions. The probability density function for the chi-square distribution is
Notice that this is a gamma distribution with , and . The exponential distribution is a gamma distribution with , and the Erlang distribution is a gamma distribution with being a positive integer. A related distribution is the Rayleigh distribution. If where the 's are independent variables, then is distributed with a distribution having a probability density function of
If
, the preceding distribution is referred to as the Rayleigh
distribution. You can use the DATA step function GAMINV to compute gamma
quantiles and the DATA step function PROBGAM to compute gamma probabilities.
where
The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= lognormal-option. By default, . If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify and with the SCALE= and SHAPE= lognormal-options, respectively. By default, the procedure calculates maximum likelihood estimates for these parameters.
Note:
denotes the shape parameter of the lognormal distribution,
whereas
denotes the scale parameter of the beta, exponential, gamma,
normal, and Weibull distributions. The use of
to denote the lognormal shape parameter is based on the
fact that
has a standard normal distribution if
is lognormally distributed.
where
You can specify
and
with the MU= and SIGMA= normal-options,
respectively. By default, the procedure estimates
with the sample mean and
with the sample standard deviation. You can use the DATA
step function PROBIT to compute normal quantiles and the DATA step function
PROBNORM to compute probabilities.
where
The threshold parameter must be less than the minimum data value. You can specify with the THRESHOLD= Weibull-option. By default, . If you specify THETA=EST, a maximum likelihood estimate is computed for . You can specify and with the SCALE= and SHAPE= Weibull-options, respectively. By default, the procedure calculates maximum likelihood estimates for and .
The exponential distribution is a special case of the Weibull distribution
where
.
where is a kernel function, is the bandwidth, is the sample size, and is the observation.
The KERNEL option provides three kernel functions : normal, quadratic, and triangular. You can specify the function with the K=kernel-function in parentheses after the KERNEL option. Values for the K= option are NORMAL, QUADRATIC, and TRIANGULAR (with aliases of N, Q, and T, respectively). By default, a normal kernel is used. The formulas for the kernel functions are
Normal | for |
Quadratic | for |
Triangular | for |
The value of , referred to as the bandwidth parameter, determines the degree of smoothness in the estimated density function. You specify indirectly by specifying a standardized bandwidth with the C=kernel-option. If is the interquartile range, and is the sample size, then is related to by the formula
For a specific kernel function, the discrepancy between the density estimator and the true density is measured by the mean integrated square error (MISE):
The MISE is the sum of the integrated squared bias and the variance. An approximate mean integrated square error (AMISE) is
A bandwidth that minimizes AMISE can be derived by treating as the normal density having parameters and estimated by the sample mean and standard deviation. If you do not specify a bandwidth parameter or if you specify C=MISE, the bandwidth that minimizes AMISE is used. The value of AMISE can be used to compare different density estimates. For each estimate, the bandwidth parameter , the kernel function type, and the value of AMISE are reported in the SAS log.
Theoretical Distributions for Quantile-Quantile and Probability Plots |
Parameters | |||||
---|---|---|---|---|---|
Distribution | Density Function
|
Range | Location | Scale | Shape |
Beta |
|
|
|
|
|
Exponential |
|
|
|
|
|
Gamma |
|
|
|
|
|
Lognormal (3-parameter) |
|
|
|
|
|
Normal |
|
|
|
|
|
Weibull (3-parameter) |
|
|
|
|
|
Weibull2 (2-parameter) |
|
|
(known) |
|
|
The following sections provide the details for constructing Q-Q plots
that are based on these distributions. Probability plots are constructed similarly
except that the horizontal axis is scaled in percentile units.
The point pattern on the plot for ALPHA= and BETA= tends to be linear with intercept and slope if the data are beta distributed with the specific density function
where
and
is the lower threshold parameter,
is the scale parameter
,
the first shape parameter
and
is the second shape parameter
.
The point pattern on the plot tends to be linear with intercept and slope if the data are exponentially distributed with the specific density function
where
is a threshold parameter and
is a positive scale parameter.
The point pattern on the plot tends to be linear with intercept and slope if the data are gamma distributed with the specific density function
where
is the threshold parameter,
is the scale parameter
, and
is the shape parameter
.
The point pattern on the plot for SIGMA= tends to be linear with intercept and slope if the data are lognormally distributed with the specific density function
where
is the threshold parameter,
is the scale parameter, and
is the shape parameter
.
The point pattern on the plot tends to be linear with intercept and slope if the data are normally distributed with the specific density function
where
is the mean and
is the standard deviation (
) .
The point pattern on the plot for C= tends to be linear with intercept and slope if the data are Weibull distributed with the specific density function
where
is the threshold parameter,
is the scale parameter
, and
is the shape parameter
.
Unlike the three-parameter Weibull quantile, the preceding expression is free of distribution parameters. This is why the C= shape parameter is not required in the WEIBULL2 option.
The point pattern on the plot for THETA= tends to be linear with intercept and slope if the data are Weibull distributed with the specific density function
where
is the known lower threshold,
is the scale parameter
, and
is the shape parameter
.
Distribution Keyword | Required Shape Parameter Option | Range |
---|---|---|
BETA | ALPHA= , BETA= |
|
EXPONENTIAL | None | |
GAMMA | ALPHA= |
|
LOGNORMAL | SIGMA= |
|
NORMAL | None | |
WEIBULL | C= |
|
WEIBULL2 | None |
Note: For Q-Q plots that are requested with the WEIBULL2 option,
you can estimate the shape parameter
from a linear pattern by using the fact that the slope
of the pattern is
.
Note: Close visual agreement may not necessarily mean that the distribution
is a good fit based on the criteria that are used by formal goodness-of-fit
tests.
When the point pattern on a Q-Q plot is linear, its intercept and slope provide estimates of the location and scale parameters. (An exception to this rule is the two-parameter Weibull distribution, for which the intercept and slope are related to the scale and shape parameters.) When you use the QQPLOT statement to specify or estimate the slope and intercept of the line, a diagonal distribution reference line appears on the Q-Q plot. This line allows you to check the linearity of the point pattern.
The following table shows which parameters to specify to determine the intercept and slope of the line:
Parameters | Linear Pattern | ||||
---|---|---|---|---|---|
Distribution | Location | Scale | Shape | Intercept | Slope |
BETA |
|
|
|
|
|
EXPONENTIAL |
|
|
|
|
|
GAMMA |
|
|
|
|
|
LOGNORMAL |
|
|
|
|
|
NORMAL |
|
|
|
|
|
WEIBULL (3-parameter) |
|
|
|
|
|
WEIBULL2 (2-parameter) |
(known) |
|
|
|
|
Chapter Contents |
Previous |
Next |
Top of Page |
Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.