PROC CAPABILITY and General Statements |
Robust Estimators
The CAPABILITY procedure provides several methods for
computing robust estimates of location and scale,
which are insensitive to outliers in the data.
The k-times Winsorized mean is a robust estimator of location
which is computed as
where
n is the number of observations,
and x(i) is the
ith order statistic
when the observations are arranged in increasing order:
The Winsorized mean is
the mean computed after replacing the
k smallest
observations with the
(k+1)st smallest observation,
and
the
k largest observations with the
(k+1)st largest observation.
For data from a symmetric distribution, the
Winsorized mean is an unbiased estimate of the population
mean. However, the Winsorized mean does not have a
normal distribution even if the data are normally
distributed.
The Winsorized sum of squared deviations is defined
as
A Winsorized t test is given by
where the standard error of the Winsorized mean is
When the data are from a symmetric distribution,
the distribution of twk
is approximated by a Student's t
distribution
with n-2k-1 degrees of freedom.
Refer to
Tukey and McLaughlin (1963)
and
Dixon and Tukey (1968).
A % Winsorized confidence interval for the
mean has upper and lower limits
where
is the th
percentile of the
Student's t distribution with
n-2k-1 degrees of freedom.
The k-times trimmed mean is a robust estimator of location
which is computed as
where
n is the number of observations,
and x(i) is the
ith order statistic
when the observations are arranged in increasing order:
The trimmed mean is
the mean computed after the
k smallest
observations
and
the
k largest observations
in the sample are deleted.
For data from a symmetric distribution, the
trimmed mean is an unbiased estimate of the population
mean. However, the trimmed mean does not have a
normal distribution even if the data are normally
distributed.
A robust estimate of the variance of the trimmed
mean ttk
can be obtained from the Winsorized sum
of squared deviations;
refer to Tukey and McLaughlin (1963).
the corresponding
trimmed t test is given by
where the standard error of the trimmed mean is
and swk
is the square root of the Winsorized sum of squared deviations.
When the data are from a symmetric distribution,
the distribution of ttk
is approximated by a Student's t
distribution
with n-2k-1 degrees of freedom.
Refer to
Tukey and McLaughlin (1963)
and
Dixon and Tukey (1968).
A % trimmed confidence interval for the
mean has upper and lower limits
where
is the th
percentile of the
Student's t distribution with
n-2k-1 degrees of freedom.
The sample standard deviation, which
is the most commonly used estimator of
scale, is sensitive to outliers.
Robust scale estimators, on the other hand,
remain bounded when a single data value
is replaced by an arbitrarily large or small value.
The CAPABILITY procedure computes several robust
measures of scale, including
the interquartile range
Gini's mean difference G,
the median absolute eviation about the median (MAD),
Qn, and Sn.
In addition, the procedure computes
estimates of the normal standard deviation derived from each of these measures.
The interquartile range (IQR) is simply the difference
between the upper and lower quartiles.
For a normal population, can be estimated as IQR/1.34898.
Gini's mean difference is computed as
For a normal population,
the expected value of G is
.Thus is a robust estimator of when the data are from a normal sample.
For the normal distribution,
this estimator has high efficiency
relative to the usual sample standard deviation,
and it is also less sensitive to the presence of outliers.
A very robust scale estimator is the MAD,
the median absolute deviation from the median (Hampel, 1974),
which is computed as
where the inner median, medj(xj), is the median of the
n observations, and the outer
median (taken over i)
is the median of the n absolute values of the deviations
about the inner median.
For a normal population, 1.4826MAD is an estimator of
.The MAD has low efficiency for normal distributions, and it may
not always be appropriate for symmetric distributions.
Rousseeuw and Croux (1993) proposed two statistics as alternatives
to the MAD.
The first is
where the outer median (taken over i) is the median of the
n medians of |xi - xj|, j = 1, 2, ... , n.
To reduce small-sample bias,
csnSn is used to estimate ,where csn is a correction factor;
refer to Croux and Rousseeuw (1992).
The second statistic is
where
and h = [n/2] + 1.
In other words,
Qn is 2.219 times the kth order staitistic
of the
distances between the data points.
The bias-corrected statistic
cqnQn
is used to estimate ,where cqn
is a correction factor;
refer to Croux and Rousseeuw (1992).
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.