Chapter Contents

Previous

Next
The COMPARE Procedure

Results

PROC COMPARE reports the results of its comparisons in the following ways:


SAS Log
When you use the WARNING, PRINTALL, or ERROR option, PROC COMPARE writes a description of the differences to the SAS log.


Macro Return Codes (SYSINFO)
PROC COMPARE stores a return code in the automatic macro variable SYSINFO. The value of the return code provides information about the result of the comparison. By checking the value of SYSINFO after PROC COMPARE has run and before any other step begins, SAS macros can use the results of a PROC COMPARE step to determine what action to take or what parts of a SAS program to execute.

Macro Return Codes is a key for interpreting the SYSINFO return code from PROC COMPARE. For each of the conditions listed, the associated value is added to the return code if the condition is true. Thus, the SYSINFO return code is the sum of the codes listed in Macro Return Codes for the applicable conditions:

Macro Return Codes
Bit Condition Code Hex Description
1 DSLABEL 1 0001X Data set labels differ
2 DSTYPE 2 0002X Data set types differ
3 INFORMAT 4 0004X Variable has different informat
4 FORMAT 8 0008X Variable has different format
5 LENGTH 16 0010X Variable has different length
6 LABEL 32 0020X Variable has different label
7 BASEOBS 64 0040X Base data set has observation not in comparison
8 COMPOBS 128 0080X Comparison data set has observation not in base
9 BASEBY 256 0100X Base data set has BY group not in comparison
10 COMPBY 512 0200X Comparison data set has BY group not in base
11 BASEVAR 1024 0400X Base data set has variable not in comparison
12 COMPVAR 2048 0800X Comparison data set has variable not in base
13 VALUE 4096 1000X A value comparison was unequal
14 TYPE 8192 2000X Conflicting variable types
15 BYVAR 16384 4000X BY variables do not match
16 ERROR 32768 8000X Fatal error: comparison not done

These codes are ordered and scaled to allow a simple check of the degree to which the data sets differ. For example, if you want to check that two data sets contain the same variables, observations, and values, but you do not care about differences in labels, formats, and so forth, use the following statements:

proc compare base=SAS-data-set
             compare=SAS-data-set;
run;

%if &sysinfo >= 64 %then
   %do;
      handle error;
   %end;

You can examine individual bits in the SYSINFO value by using DATA step bit-testing features to check for specific conditions. For example, to check for the presence of observations in the base data set that are not in the comparison data set, use the following statements:

proc compare base=SAS-data-set
             compare=SAS-data-set;
run;

%let rc=&sysinfo;
data _null_;
   if &rc='1......'b then
      put 'Observations in Base but not
           in Comparison Data Set';
run;

PROC COMPARE must run before you check SYSINFO and you must obtain the SYSINFO value before another SAS step starts because every SAS step resets SYSINFO.


Procedure Output
The following sections show and describe the default output of the two data sets shown in Overview . Because PROC COMPARE produces lengthy output, the output is presented in seven pieces.


Data Set Summary

This report lists the attributes of the data sets being compared. These attributes include the following:

Partial Output shows the Data Set Summary.

Partial Output
                              COMPARE Procedure
                   Comparison of PROCLIB.ONE with PROCLIB.TWO
                                 (Method=EXACT)

                               Data Set Summary 

 Dataset               Created          Modified  NVar    NObs  Label    
     

 PROCLIB.ONE  11SEP97:15:11:07  11SEP97:15:11:09     5       4  First Data Set 
 PROCLIB.TWO  11SEP97:15:11:10  11SEP97:15:11:10     6       5  Second Data Set


Variables Summary

This report compares the variables in the two data sets. The first part of the report lists the following:

The second part of the report lists matching variables with different attributes and shows how the attributes differ. (The COMPARE procedure omits variable labels if the line size is too small for them.)

Partial Output shows the Variables Summary.

Partial Output
                             Variables Summary

               Number of Variables in Common: 5.
               Number of Variables in PROCLIB.TWO but not in PROCLIB.ONE: 1.
               Number of Variables with Conflicting Types: 1.
               Number of Variables with Differing Attributes: 3.


               Listing of Common Variables with Conflicting Types

                      Variable  Dataset      Type  Length

                      student   PROCLIB.ONE  Num        8
                                PROCLIB.TWO  Char       8


             Listing of Common Variables with Differing Attributes

           Variable  Dataset      Type  Length  Format  Label        

           year      PROCLIB.ONE  Char       8          Year of Birth
                     PROCLIB.TWO  Char       8                       
           state     PROCLIB.ONE  Char       8                       
                     PROCLIB.TWO  Char       8          Home State   
           gr1       PROCLIB.ONE  Num        8  4.1                  
                     PROCLIB.TWO  Num        8  5.2                  


Observation Summary

This report provides information about observations in the base and comparison data sets. First of all, the report identifies the first and last observation in each data set, the first and last matching observations, and the first and last differing observations. Then, the report lists the following:

Partial Output shows the Observation Summary.

Partial Output
                            Observation Summary

                         Observation      Base  Compare

                         First Obs           1        1
                         First Unequal       1        1
                         Last  Unequal       4        4
                         Last  Match         4        4
                         Last  Obs           .        5

        Number of Observations in Common: 4.
        Number of Observations in PROCLIB.TWO but not in PROCLIB.ONE: 1.
        Total Number of Observations Read from PROCLIB.ONE: 4.
        Total Number of Observations Read from PROCLIB.TWO: 5.

        Number of Observations with Some Compared Variables Unequal: 4.
        Number of Observations with All Compared Variables Equal: 0.


Values Comparison Summary

This report first lists the following:

In addition, for the variables for which some matching observations have unequal values, the report lists

Partial Output shows the Values Comparison Summary.

Partial Output
                         Values Comparison Summary

        Number of Variables Compared with All Observations Equal: 1.
        Number of Variables Compared with Some Observations Unequal: 3.
        Total Number of Values which Compare Unequal: 6.
        Maximum Difference: 20.

                       Variables with Unequal Values

              Variable  Type  Len   Compare Label  Ndif   MaxDif

              state     CHAR    8   Home State        2
              gr1       NUM     8                     2    1.000
              gr2       NUM     8                     2   20.000


Value Comparison Results

This report consists of a table for each pair of matching variables judged unequal at one or more observations. When comparing character values, PROC COMPARE displays only the first 20 characters. When you use the TRANSPOSE option, it displays only the first 12 characters. Each table shows

Partial Output shows the Value Comparison Results for Variables.

Partial Output
                     Value Comparison Results for Variables

           __________________________________________________________
                      ||  Home State
                      ||  Base Value           Compare Value
                  Obs ||  state                 state
            ________  ||  ________              ________
                      ||
                   2  ||  MD                    MA
                   4  ||  MA                    MD
           __________________________________________________________


           __________________________________________________________
                      ||       Base    Compare
                  Obs ||        gr1        gr1      Diff.     % Diff
            ________  ||  _________  _________  _________  _________
                      ||
                   1  ||       85.0      84.00    -1.0000    -1.1765
                   3  ||       78.0      79.00     1.0000     1.2821
           __________________________________________________________


           __________________________________________________________
                      ||       Base    Compare
                  Obs ||        gr2        gr2      Diff.     % Diff
            ________  ||  _________  _________  _________  _________
                      ||
                   3  ||    72.0000    73.0000     1.0000     1.3889
                   4  ||    94.0000    74.0000   -20.0000   -21.2766
           __________________________________________________________

You can suppress the value comparison results with the NOVALUES option. If you use both the NOVALUES and TRANSPOSE options, PROC COMPARE lists for each observation the names of the variables with values judged unequal but does not display the values and differences.

Table of Summary Statistics

If you use the STATS, ALLSTATS, or PRINTALL options, the Value Comparison Results for Variables section contains summary statistics for the numeric variables being compared. The STATS option generates these statistics for only the numeric variables whose values are judged unequal. The ALLSTATS and PRINTALL options generate these statistics for all numeric variables, even if all values are judged equal.

Note:    In all cases PROC COMPARE calculates the summary statistics based on all matching observations that do not contain missing values, not just on those containing unequal values.  [cautionend]
Partial Output shows the following summary statistics for base data set values, comparison data set values, differences, and percent differences:

N
the number of nonmissing values

MEAN
the mean, or average, of the values

STD
the standard deviation

MAX
the maximum value

MIN
the minimum value

STDERR
the standard error of the mean

T
the T ratio (MEAN/STDERR)

PROB> | T |
the probability of a greater absolute T value if the true population mean is 0.

NDIF
the number of matching observations judged unequal, and the percent of the matching observations that were judged unequal.

DIFMEANS
the difference between the mean of the base values and the mean of the comparison values. This line contains three numbers. The first is the mean expressed as a percentage of the base values mean. The second is the mean expressed as a percentage of the comparison values mean. The third is the difference in the two means (the comparison mean minus the base mean).

R
the correlation of the base and comparison values for matching observations that are nonmissing in both data sets.

RSQ
the square of the correlation of the base and comparison values for matching observations that are nonmissing in both data sets.

Partial Output is from the ALLSTATS option using the two data sets shown in "Overview":

Partial Output
                     Value Comparison Results for Variables

           __________________________________________________________
                      ||       Base    Compare
                  Obs ||        gr1        gr1      Diff.     % Diff
            ________  ||  _________  _________  _________  _________
                      ||
                   1  ||       85.0      84.00    -1.0000    -1.1765
                   3  ||       78.0      79.00     1.0000     1.2821
            ________  ||  _________  _________  _________  _________
                      ||
               N      ||          4          4          4          4
              Mean    ||    85.5000    85.5000          0     0.0264
              Std     ||     5.8023     5.4467     0.8165     1.0042
              Max     ||    92.0000    92.0000     1.0000     1.2821
              Min     ||    78.0000    79.0000    -1.0000    -1.1765
             StdErr   ||     2.9011     2.7234     0.4082     0.5021
               t      ||    29.4711    31.3951     0.0000     0.0526
            Prob>|t|  ||     <.0001     <.0001     1.0000     0.9614
                      ||
             Ndif     ||          2     50.000%
            DifMeans  ||      0.000%     0.000%         0
             r, rsq   ||      0.991      0.983
           __________________________________________________________

           __________________________________________________________
                      ||       Base    Compare
                  Obs ||        gr2        gr2      Diff.     % Diff
            ________  ||  _________  _________  _________  _________
                      ||
                   3  ||    72.0000    73.0000     1.0000     1.3889
                   4  ||    94.0000    74.0000   -20.0000   -21.2766
            ________  ||  _________  _________  _________  _________
                      ||
               N      ||          4          4          4          4
              Mean    ||    86.2500    81.5000    -4.7500    -4.9719
              Std     ||     9.9457     9.4692    10.1776    10.8895
              Max     ||    94.0000    92.0000     1.0000     1.3889
              Min     ||    72.0000    73.0000   -20.0000   -21.2766
             StdErr   ||     4.9728     4.7346     5.0888     5.4447
               t      ||    17.3442    17.2136    -0.9334    -0.9132
            Prob>|t|  ||     0.0004     0.0004     0.4195     0.4285
                      ||
             Ndif     ||          2     50.000%
            DifMeans  ||     -5.507%    -5.828%   -4.7500
             r, rsq   ||      0.451      0.204
           __________________________________________________________

Note:    If you use a wide line size with PRINTALL, PROC COMPARE prints the value comparison result for character variables next to the result for numeric variables. In that case, PROC COMPARE calculates only NDIF for the character variables.  [cautionend]

Comparison Results for Observations (Using the TRANSPOSE Option)

The TRANSPOSE option prints the comparison results by observation instead of by variable. The comparison results precede the observation summary report. By default, the source of the values for each row of the table is indicated by the following label:
 _OBS_1=number-1  _OBS_2=number-2
where number-1 is the number of the observation in the base data set for which the value of the variable is shown, and number-2 is the number of the observation in the comparison data set.

Partial Output shows the differences in PROCLIB.ONE and PROCLIB.TWO by observation instead of by variable.

Partial Output
                      Comparison Results for Observations

        _OBS_1=1 _OBS_2=1:
        Variable    Base Value       Compare         Diff.        % Diff
             gr1          85.0         84.00     -1.000000     -1.176471

        _OBS_1=2 _OBS_2=2:
        Variable    Base Value       Compare
           state            MD            MA

        _OBS_1=3 _OBS_2=3:
        Variable    Base Value       Compare         Diff.        % Diff
             gr1          78.0         79.00      1.000000      1.282051
             gr2     72.000000     73.000000      1.000000      1.388889

        _OBS_1=4 _OBS_2=4:
        Variable    Base Value       Compare         Diff.        % Diff
             gr2     94.000000     74.000000    -20.000000    -21.276596
           state            MA            MD

If you use an ID statement, the identifying label has the following form:

ID-1=ID-value-1 ... ID-n=ID-value-n
where ID is the name of an ID variable and ID-value is the value of the ID variable.

Note:   When you use the TRANSPOSE option, PROC COMPARE prints only the first 12 characters of the value.  [cautionend]


Output Data Set (OUT=)
By default, the OUT= data set contains an observation for each pair of matching observations. The OUT= data set contains the following variables from the data sets you are comparing:

In addition, the data set contains two variables created by PROC COMPARE to identify the source of the values for the matching variables: _TYPE_ and _OBS_.

_TYPE_
is a character variable of length 8. Its value indicates the source of the values for the matching (or VAR) variables in that observation. (For ID and BY variables, which are not compared, the values are the values from the original data sets.) _TYPE_ has the label Type of Observation. The four possible values of this variable are as follows:

BASE
The values in this observation are from an observation in the base data set. PROC COMPARE writes this type of observation to the OUT= data set when you specify the OUTBASE option.

COMPARE
The values in this observation are from an observation in the comparison data set. PROC COMPARE writes this type of observation to the OUT= data set when you specify the OUTCOMP option.

DIF
The values in this observation are the differences between the values in the base and comparison data sets. For character variables, PROC COMPARE uses a period (.) to represent equal characters and an X to represent unequal characters. PROC COMPARE writes this type of observation to the OUT= data set by default. However, if you request any other type of observation with the OUTBASE, OUTCOMP, or OUTPERCENT option, you must specify the OUTDIF option to generate observations of this type in the OUT= data set.

PERCENT
The values in this observation are the percent differences between the values in the base and comparison data sets. For character variables the values in observations of type PERCENT are the same as the values in observations of type DIF.

_OBS_
is a numeric variable containing a number further identifying the source of the OUT= observations.

For observations with _TYPE_ equal to BASE, _OBS_ is the number of the observation in the base data set from which the values of the VAR variables were copied. Similarly, for observations with _TYPE_ equal to COMPARE, _OBS_ is the number of the observation in the comparison data set from which the values of the VAR variables were copied.

For observations with _TYPE_ equal to DIF or PERCENT, _OBS_ is a sequence number that counts the matching observations in the BY group.

_OBS_ has the label Observation Number.

The COMPARE procedure takes variable names and attributes for the OUT= data set from the base data set except for the lengths of ID and VAR variables, for which it uses the longer length regardless of which data set that length is from. This behavior has two important repercussions:

For an example of the OUT= option, see Comparing Values of Observations Using an Output Data Set (OUT=) .


Output Statistics Data Set (OUTSTATS=)
When you use the OUTSTATS= option, PROC COMPARE calculates the same summary statistics as the ALLSTATS option for each pair of numeric variables compared (see Table of Summary Statistics ). The OUTSTATS= data set contains an observation for each summary statistic for each pair of variables. The data set also contains the BY variables used in the comparison and several variables created by PROC COMPARE:

_VAR_
is a character variable containing the name of the variable from the base data set for which the statistic in the observation was calculated.

_WITH_
is a character variable containing the name of the variable from the comparison data set for which the statistic in the observation was calculated. The _WITH_ variable is not included in the OUTSTATS= data set unless you use the WITH statement.

_TYPE_
is a character variable containing the name of the statistic contained in the observation. Values of the _TYPE_ variable are N, MEAN, STD, MIN, MAX, STDERR, T, PROBT, NDIF, DIFMEANS, and R, RSQ.

_BASE_
is a numeric variable containing the value of the statistic calculated from the values of the variable named by _VAR_ in the observations in the base data set with matching observations in the comparison data set.

_COMP_
is a numeric variable containing the value of the statistic calculated from the values of the variable named by the _VAR_ variable (or by the _WITH_ variable if you use the WITH statement) in the observations in the comparison data set with matching observations in the base data set.

_DIF_
is a numeric variable containing the value of the statistic calculated from the differences of the values of the variable named by the _VAR_ variable in the base data set and the matching variable (named by the _VAR_ or _WITH_ variable) in the comparison data set.

_PCTDIF_
is a numeric variable containing the value of the statistic calculated from the percent differences of the values of the variable named by the _VAR_ variable in the base data set and the matching variable (named by the _VAR_ or _WITH_ variable) in the comparison data set.

Note:   For both types of output data sets, PROC COMPARE assigns one of the following data set labels:

Comparison of base-SAS-data-set
with comparison-SAS-data-set

Comparison of variables in base-SAS-data-set
  [cautionend]
Labels are limited to 40 characters.

See Creating an Output Data Set of Statistics (OUTSTATS=) for an example of an OUTSTATS= data set.


Chapter Contents

Previous

Next

Top of Page

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.