Chapter Contents |
Previous |
Next |
The UNIVARIATE Procedure |
Rounding |
When ROUND=1 and the analysis variable values are between -2.5 and 2.5, the intervals are as follows:
i | Interval | Midpoint | Left endpt rounds to | Right endpt rounds to |
---|---|---|---|---|
-2 | [-2.5,-1.5] | -2 | -2 | -2 |
-1 | [-1.5,-0.5] | -1 | -2 | 0 |
0 | [-0.5,0.5] | 0 | 0 | 0 |
1 | [0.5,1.5] | 1 | 0 | 2 |
2 | [1.5,2.5] | 2 | 2 | 2 |
i | Interval | Midpoint | Left endpt rounds to | Right endpt rounds to |
---|---|---|---|---|
-2 | [-1.25,-0.75] | -1.0 | -1 | -1 |
-1 | [-0.75,-0.25] | -0.5 | -1 | 0 |
0 | [-0.25,0.25] | 0.0 | 0 | 0 |
1 | [0.25,0.75] | 0.5 | 0 | 1 |
2 | [0.75,1.25] | 1.0 | 1 | 1 |
As the rounding unit increases, the interval width also increases. This reduces the number of unique values and decreases the amount of memory that PROC UNIVARIATE needs.
Generating Line Printer Plots |
To change the number of stems that the plot displays, use PLOTSIZE= to increase or decrease the number of rows. Instructions that appear below the plot explain how to determine the values of the variable. If no instructions appear, you multiply Stem.Leaf by 1 to determine the values of the variable. For example, if the stem value is 10 and the leaf value is 1, then the variable value is approximately 10.1.
For the stem-and-leaf plot, the procedure rounds a variable value to
the nearest leaf. If the variable value is exactly halfway between two leaves,
the value rounds to the nearest leaf with an even integer value. For example,
a variable value of 3.15 has a stem value of 3 and a leaf value of 2.
To generate box plot using high-resolution graphics, use the BOXPLOT
procedure in SAS/STAT software.
and where
is . | |
-1 | is the inverse of the standard normal distribution function. |
|
is the rank of the data value when ordered from smallest to largest. |
|
is the number of nonmissing data values. |
where is weight that is associated with for the ordered observation and is the sum of the individual weights.
When each observation has an identical weight, , the formula for reduces to the expression for in the unweighted normal probability plot
When the value of VARDEF= is WDF or WEIGHT, PROC UNIVARIATE draws a reference line with intercept and slope and when the value of VARDEF= is DF or N, the slope is where is the average weight.
When each observation has an identical weight and the value of VARDEF= is DF, N, or WEIGHT, the reference line reduces to the usual reference line with intercept and slope in the unweighted normal probability plot.
If the data are normally distributed with mean
, standard deviation
, and each observation has an identical weight
, then, as in the unweighted normal probability plot, the
points on the plot should lie approximately on a straight line. The intercept
is
and slope is
when VARDEF= is WDF or WEIGHT, and the slope is
when VARDEF= is DF or N.
For more information on how to interpret these plots see SAS System for Elementary Statistical Analysis and SAS System for Statistical Graphics.
Generating High-Resolution Graphics |
The HISTOGRAM statement generates histograms and comparative histograms that allow you to examine the data distribution. You can optionally fit families of density curves and superimpose kernel density estimates on the histograms. For additional information about the fitted distributions and kernel density estimates, see Formulas for Fitted Continuous Distributions .
The PROBPLOT statement generates a probability plot, which compares
ordered values of a variable with percentiles of a specified theoretical distribution.
The QQPLOT statement generates a quantile-quantile plot, which compares ordered
values of a variable with quantiles of a specified theoretical distribution.
Thus, you can use these plots to determine how well a theoretical distribution
models a set of measures.
Construction of a Q-Q Plot
First, the nonmissing values of the variable are ordered from smallest to largest: . Then, the ordered value is represented on the plot by a point whose -coordinate is and whose -coordinate is , where is the theoretical distribution with a zero location parameter and a unit scale parameter. For additional information about the theoretical distributions that you can request, see Theoretical Distributions for Quantile-Quantile and Probability Plots .
You can modify the adjustment constants -0.375 and 0.25 with
the RANKADJ=
and NADJ= options. The default combination is recommended by Blom (1958).
For additional information, see Chambers et al. (1983). Since
is a quantile of the empirical cumulative distribution
function (ecdf), a Q-Q plot compares quantiles of the ecdf with quantiles
of a theoretical distribution. Probability plots are constructed the same
way, except that the
-axis is scaled nonlinearly in percentiles.
Q-Q plots are more convenient than probability plots for graphical estimation of the location and scale parameters because the -axis of a Q-Q plot is scaled linearly. On the other hand, probability plots are more convenient for estimating percentiles or probabilities. There are many reasons why the point pattern in a Q-Q plot may not be linear. Chambers et al. (1983) and Fowlkes (1987) discuss the interpretations of commonly encountered departures from linearity, and these are summarized in the following table.
Description of Point Pattern | Possible Interpretation |
---|---|
All but a few points fall on a line | Outliers in the data |
Left end of pattern is below the line; right end of pattern is above the line | Long tails at both ends of the data distribution |
Left end of pattern is above the line; right end of pattern is below the line | Short tails at both ends of the distribution |
Curved pattern with slope increasing from left to right | Data distribution is skewed to the right |
Curved pattern with slope decreasing from left to right | Data distribution is skewed to the left |
Staircase pattern (plateaus and gaps) | Data have been rounded or are discrete |
In some applications, a nonlinear pattern may be more revealing than a linear pattern. However as noted by Chambers et al. (1983), departures from linearity can also be due to chance variation.
Determining Computer Resources |
The only factor that limits the number of variables that you can analyze is the computer resources that are available. The amount of temporary storage and CPU time that PROC UNIVARIATE requires depends on the statements and the options that you specify. To calculate the computer resources the procedure needs, let
|
be the number of observations in the data set |
|
be the number of variables in the VAR statement |
|
be the number of unique values for the ith variable. |
If bytes are not available, PROC UNIVARIATE must process the data multiple times to compute all the statistics. This reduces the minimum memory requirement to
ROUND= reduces the number of unique values ( ), thereby reducing memory requirements. ROBUSTSCALE requires bytes of temporary storage.
Several factors affect the CPU time requirement:
Each of these factors has a different constant of proportionality. For additional information on how to optimize CPU performance and memory usage, see the SAS documentation for your operating environment.
Chapter Contents |
Previous |
Next |
Top of Page |
Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.