Combining SAS Data Sets: The Essentials

What You Need to Know before Combining Information Stored In Multiple SAS Data Sets

Many applications require input data to be in a specific format before the data can be processed to produce meaningful results. The data typically comes from multiple sources and may be in different formats. Therefore, you often, if not always, have to take intermediate steps to logically relate and process data before you can analyze it or create reports from it.

Application requirements vary, but there are common factors for all applications that access, combine, and process data. Once you have determined what you want the output to look like, you must

determine how the input data is related
ensure that the data is properly sorted or indexed, if necessary
select the appropriate access method to process the input data
select the appropriate SAS tools to complete the task.

The Four Ways That Data Can Be Related

Relationships among multiple sources of input data exist when each of the sources contains common data, either at the physical or logical level. For example, employee data and department data could be related through an employee ID variable that shares common values. Another data set could contain numeric sequence numbers whose partial values logically relate it to a separate data set by observation number.

You must be able to identify the existing relationships in your data. This knowledge is crucial for understanding how to process input data in order to produce desired results. All related data fall into one of these four categories, characterized by how observations relate among the data sets:

one-to-one
one-to-many
many-to-one
many-to-many.

To obtain the results you want, you should understand how each of these methods combines observations, how each method treats duplicate values of common variables, and how each method treats missing values or nonmatched values of common variables. Some of the methods also require that you preprocess your data sets by sorting them or by creating indexes. See the description of each method in Combining SAS Data Sets: Methods.

One-to-One

In a one-to-one relationship, typically a single observation in one data set is related to a single observation from another based on the values of one or more selected variables. A one-to-one relationship implies that each value of the selected variable occurs no more than once in each data set. When you work with multiple selected variables, this relationship implies that each combination of values occurs no more than once in each data set.

In the following example, observations in data sets SALARY and TAXES are related by common values for EmployeeNumber.

One-to-One Relationship

[IMAGE]

One-to-Many and Many-to-One

A one-to-many or many-to-one relationship between input data sets implies that one data set has at most one observation with a specific value of the selected variable, but the other input data set may have more than one occurrence of each value. When you work with multiple selected variables, this relationship implies that each combination of values occurs no more than once in one data set, but may occur more than once in the other data set. The order in which the input data sets are processed determines whether the relationship is one-to-many or many-to-one.

In the following example, observations in data sets ONE and TWO are related by common values for variable A. Values of A are unique in data set ONE but not in data set TWO.

One-to-Many Relationship

[IMAGE]

In the following example, observations in data sets ONE, TWO, and THREE are related by common values for variable ID. Values of ID are unique in data sets ONE and THREE but not in TWO. For values 2 and 3 of ID, a one-to-many relationship exists between observations in data sets ONE and TWO, and a many-to-one relationship exists between observations in data sets TWO and THREE.

One-to-Many and Many-to-One Relationships

[IMAGE]

Many-to-Many

The many-to-many category implies that multiple observations from each input data set may be related based on values of one or more common variables.

In the following example, observations in data sets BREAKDOWN and MAINTENANCE are related by common values for variable Vehicle. Values of Vehicle are not unique in either data set. A many-to-many relationship exists between observations in these data sets for values AAA and CCC of Vehicle.

Many-to-Many Relationship

[IMAGE]

Access Methods: Sequential versus Direct

by an observation number
by the value of one or more variables through a simple or composite index.

To access observations directly by their observation number, use the POINT= option with the SET or MODIFY statement. The POINT= option names a variable whose current value determines which observation a SET or MODIFY statement reads.

To access observations directly based on the values of one or more specified variables, you must first create an index for the variables and then read the data set using the KEY= statement option with the SET or MODIFY statement. An index is a separate structure that contains the data values of the key variable or variables, paired with a location identifier for the observations containing the value.

Note: You can also use the SAS File I/O functions such as CUROBS, NOTE, POINT and FETCHOBS to access observations by observation number. [cautionend]

Overview of Methods for Combining SAS Data Sets

You can use these methods to combine SAS data sets:

concatenating
interleaving
one-to-one reading
one-to-one merging
match merging
updating.

Concatenating

The following figure shows the results of concatenating two SAS data sets. Concatenating the data sets appends the observations from one data set to another data set. The DATA step reads DATA1 sequentially until all observations have been processed, and then reads DATA2. Data set COMBINED contains the results of the concatenation. Note that the data sets are processed in the order in which they are listed in the SET statement.

Concatenating Two Data Sets

[IMAGE]

Interleaving

The following figure shows the results of interleaving two SAS data sets. Interleaving intersperses observations from two or more data sets, based on one or more common variables. Data set COMBINED shows the result.

Interleaving Two Data Sets

[IMAGE]

One-to-One Reading and One-to-One Merging

The following figure shows the results of one-to-one reading and one-to-one merging. One-to-one reading combines observations from two or more SAS data sets by creating observations that contain all of the variables from each contributing data set. Observations are combined based on their relative position in each data set, that is, the first observation in one data set with the first in the other, and so on. The DATA step stops after it has read the last observation from the smallest data set. One-to-one merging is similar to a one-to-one reading, with two exceptions: you use the MERGE statement instead of multiple SET statements, and the DATA step reads all observations from all data sets. Data set COMBINED shows the result.

One-to-One Reading and One-to-One Merging

[IMAGE]

Match-Merging

The following figure shows the results of match-merging. Match-merging combines observations from two or more SAS data sets into a single observation in a new data set based on the values of one or more common variables. Data set COMBINED shows the results.

Match-Merging Two Data Sets

[IMAGE]

Updating

The following figure shows the results of updating a master data set. Updating uses information from observations in a transaction data set to delete, add, or alter information in observations in a master data set. You can update a master data set by using the UPDATE statement or the MODIFY statement. If you use the UPDATE statement, your input data sets must be sorted by the values of the variables listed in the BY statement. (In this example, MASTER and TRANSACTION are both sorted by Year.) If you use the MODIFY statement, your input data does not need to be sorted.

UPDATE replaces an existing file with a new file, allowing you to add, delete, or rename columns. MODIFY performs an update in place by rewriting only those records that have changed, or by appending new records to the end of the file.

Note that by default, UPDATE and MODIFY do not replace nonmissing values in a master data set with missing values from a transaction data set.

Updating a Master Data Set

[IMAGE]

Overview of Tools for Combining SAS Data Sets

Using Statements and Procedures

Once you understand the basics of establishing relationships among data, the ways to access data, and the ways that you can combine SAS data sets, you can choose from a variety of SAS tools for accessing, combining, and processing your data. The following table lists and briefly describes the DATA step statements and the procedures that you can use for combining SAS data sets.

Statements or Procedures for Combining SAS Data Sets

Access Method

Statement or Procedure Action Performed Sequential Direct Can Use withBY statement Comments

BY controls the operation of a SET, MERGE, UPDATE, or MODIFY statement in the DATA step and sets up special grouping variables. NA NA NA BY-group processing is a means of processing observations that have the same values of one or more variables.

MERGE reads observations from two or more SAS data sets and joins them into a single observation. X
X When using MERGE with BY, the data must be sorted or indexed on the BY variable.

MODIFY processes observations in a SAS data set in place. (Contrast with UPDATE.) X X X Sorted or indexed data are not required for use with BY, but are recommended for performance.

SET reads an observation from one or more SAS data sets. X X X Use KEY= or POINT= statement options for directly accessing data.

UPDATE applies transactions to observations in a master SAS data set. UPDATE does not update observations in place; it produces an updated copy of the current data set. X
X Both the master and transaction data sets must be sorted by or indexed on the BY variable.

PROC APPEND adds the observations from one SAS data set to the end of another SAS data set. X

PROC SQL (table note 1) reads an observation from one or more SAS data sets; reads observations from up to 32 SAS data sets and joins them into single observations; manipulates observations in a SAS data set in place; easily produces a Cartesian product. X X X All three access methods are available in PROC SQL, but the access method is chosen by the internal optimizer.

***Statements or Procedures for Combining SAS Data Sets***
		Access Method
Statement or Procedure	Action Performed	Sequential	Direct	Can Use withBY statement	Comments
BY	controls the operation of a SET, MERGE, UPDATE, or MODIFY statement in the DATA step and sets up special grouping variables.	NA	NA	NA	BY-group processing is a means of processing observations that have the same values of one or more variables.
MERGE	reads observations from two or more SAS data sets and joins them into a single observation.	X		X	When using MERGE with BY, the data must be sorted or indexed on the BY variable.
MODIFY	processes observations in a SAS data set in place. (Contrast with UPDATE.)	X	X	X	Sorted or indexed data are not required for use with BY, but are recommended for performance.
SET	reads an observation from one or more SAS data sets.	X	X	X	Use KEY= or POINT= statement options for directly accessing data.
UPDATE	applies transactions to observations in a master SAS data set. UPDATE does not update observations in place; it produces an updated copy of the current data set.	X		X	Both the master and transaction data sets must be sorted by or indexed on the BY variable.
PROC APPEND	adds the observations from one SAS data set to the end of another SAS data set.	X
PROC SQL (table note 1)	reads an observation from one or more SAS data sets; reads observations from up to 32 SAS data sets and joins them into single observations; manipulates observations in a SAS data set in place; easily produces a Cartesian product.	X	X	X	All three access methods are available in PROC SQL, but the access method is chosen by the internal optimizer.

TABLE NOTE 1:

PROC SQL is the SAS implementation of Structured Query Language. In addition to expected SQL capabilities, PROC SQL includes additional capabilities specific to SAS, such as the use of formats and SAS macro language.

Using Error Checking

You can use the _IORC_ automatic variable and the SYSRC autocall macro to perform error checking in a DATA step. Use these tools with the MODIFY statement or with the SET statement and the KEY= option. For more information about these tools, see Error Checking When Using Indexes to Randomly Access or Update Data.

How to Prepare Your Data Sets

Before combining SAS data sets, follow these guidelines to produce the results you want:

Know the structure and the contents of the data sets.
Look at sources of common problems.
Ensure that observations are in the correct order, or that they can be retreived in the correct order (for example, by using an index).
Test your program.

Knowing the Structure and Contents of the Data Sets

To help determine how your data are related, look at the structure of the data sets. To see the data set structure, execute the DATASETS procedure, the CONTENTS procedure, or access the SAS Explorer window in your windowing environment to display the descriptor information. Descriptor information includes the number of observations in each data set, the name and attributes of each variable, and which variables are included in indexes. To print a sample of the observations, use the PRINT procedure or the REPORT procedure.

You can also use functions such as VTYPE, VLENGTH, and VLENGTHX to show specific descriptor information. For a short description of these functions, see the Variable Information functions in Functions and CALL Routines. For complete information about these functions, see "Functions and CALL Routines" in SAS Language Reference: Dictionary.

Looking at Sources of Common Problems

If your program does not execute correctly, review your input data for the following errors:

variables that have the same name but that represent different data
SAS includes only one variable of a given name in the new data set. If you are merging two data sets that have variables with the same names but different data, the values from the last data set that was read are written over the values from other data sets.
To correct the error, you can rename variables before you combine the data sets by using the RENAME= data set option in the SET, UPDATE, or MERGE statement, or you can use the DATASETS procedure.
common variables with the same data but different attributes

The way SAS handles these differences depends on which attributes are different:
- type attribute
  If the type attribute is different, SAS stops processing the DATA step and issues an error message stating that the variables are incompatible.
  To correct this error, you must use a DATA step to re-create the variables. The SAS statements you use depend on the nature of the variable.
- length attribute
  If the length attribute is different, SAS takes the length from the first data set that contains the variable. In the following example, all data sets that are listed in the MERGE statement contain the variable Mileage. In QUARTER1, the length of the variable Mileage is four bytes; in QUARTER2, it is eight bytes and in QUARTER3 and QUARTER4, it is six bytes. In the output data set YEARLY, the length of the variable Mileage is four bytes, which is the length derived from QUARTER1.
```
data yearly;
   merge quarter1 quarter2 quarter3 quarter4;
   by Account;
run;
```
  To override the default and set the length yourself, specify the appropriate length in a LENGTH statement that precedes the SET, MERGE, MODIFY, or UPDATE statement.
- label, format, and informat attributes
  If any of these attributes are different, SAS takes the attribute from the first data set that contains the variable with that attribute. However, any label, format, or informat that you explicitly specify overrides a default. If all data sets contain explicitly specified attributes, the one specified in the first data set overrides the others. To ensure that the new output data set has the attributes you prefer, use an ATTRIB statement.
  You can also use the SAS File I/O functions such as VLABEL, VLABELX, and other Variable Information functions to access this information. For a short description of these functions, see the Variable Information functions in Functions and CALL Routines by Category. For complete information about these functions, see "Functions and CALL Routines" in SAS Language Reference: Dictionary.

Ensuring Correct Order

If you use BY-group processing with the UPDATE, SET, and MERGE statements to combine data sets, ensure that the observations in the data sets are sorted in the order of the variables that are listed in the BY statement, or that the data sets have an appropriate index. If you use BY-group processing in a MODIFY statement, your data does not need to be sorted, but sorting the data improves efficiency. The BY variable or variables must be common to both data sets, and they must have the same attributes. For more information, see BY-Group Processing in the DATA Step.

Testing Your Program

As a final step in preparing your data sets, you should test your program. Create small temporary SAS data sets that contain a sample of observations that test all of your program's logic. If your logic is faulty and you get unexpected output, you can use the DATA step debugger to debug your program. For complete information about the DATA Step Debugger, see SAS Language Reference: Dictionary.

Chapter Contents
Previous
Next
Top of Page

Combining SAS Data Sets: The Essentials

One-to-One

One-to-Many and Many-to-One

Many-to-Many

Overview

Sequential Access

Direct Access

Concatenating

Interleaving

One-to-One Reading and One-to-One Merging

Match-Merging

Updating

Using Statements and Procedures

Using Error Checking

Knowing the Structure and Contents of the Data Sets

Looking at Sources of Common Problems

Ensuring Correct Order

Testing Your Program