Chapter Contents

Previous

Next

Techniques for Optimizing I/O


Overview of Techniques for Optimizing I/O

I/O is one of the most important factors for optimizing performance. Most SAS jobs consist of repeated cycles of reading a particular set of data to perform various data analysis and data manipulation tasks. To improve the performance of a SAS job, you must reduce the number of times SAS accesses disk or tape devices.

To do this, you can modify your SAS programs to process only the necessary variables and observations by:

You can also modify your programs to reduce the number of times it processes the data internally by:

You can reduce the number of data accesses by processing more data each time a device is accessed by setting the BUFNO=, BUFSIZE=, CATCACHE=, and COMPRESS= system options.

Sometimes you might be able to use more than one method, making your SAS job even more efficient.


Using WHERE-Expression Processing

You might be able to use a WHERE statement in a procedure to perform the same task as a DATA step with a subsetting IF statement. The WHERE statement can eliminate extra DATA step processing when performing certain analyses because unneeded observations are not processed.

For example, the following DATA step creates a data set SEATBELT, which contains only those observations from the AUTO.SURVEY data set for which the value of SEATBELT is YES. The new data set is then printed.

libname auto '/users/autodata';
data seatbelt;
   set auto.survey;
   if seatbelt='yes';
run;

proc print data=seatbelt;
run;

However, you can get the same output from the PROC PRINT step without creating a data set if you use a WHERE statement in the PROC PRINT step, as in the following example:

proc print data=auto.survey;
   where seatbelt='yes';
run;
The WHERE statement can save resources by eliminating the number of times you process the data. In this example, you might be able to use less time and memory by eliminating the DATA step. Also, you use less I/O because there is no intermediate data set. Note that you cannot use a WHERE statement in a DATA step that reads raw data.

The extent of savings that you can achieve depends on many factors, including the size of the data set. It is recommended that you test your programs to determine which is the most efficient solution. See Deciding Whether to Use a WHERE Expression or a Subsetting IF Statement for more information.


Using DROP and KEEP Statements

Another way to improve efficiency is to use DROP and KEEP statements to reduce the size of your observations. When you create a temporary data set and include only the variables that you need, you can reduce the number of I/O operations that are required to process the data. See SAS Language Reference: Dictionary for more information on the DROP and KEEP statements.


Using LENGTH Statements

You can also use LENGTH statements to reduce the size of your observations. When you include only the necessary storage space for each variable, you can reduce the number of I/O operations that are required to process the data. Before you change the length of a numeric variable, however, see Specifying Variable Lengths. See SAS Language Reference: Dictionary for more information on the LENGTH statement.


Using the OBS= and FIRSTOBS= Data Set Options

You can also use the OBS= and FIRSTOBS= options to reduce the number of observations processed. When you create a temporary data set and include only the necessary observations, you can reduce the number of I/O operations that are required to process the data. See SAS Language Reference: Dictionary for more information on the OBS= and FIRSTOBS= data set options.


Creating SAS Data Sets

If you process the same raw data repeatedly, it is usually more efficient to create a SAS data set. SAS can process SAS data sets more efficiently than it can process raw data files.

Another consideration involves whether you are using data sets created with previous releases of SAS. If you frequently process data sets created with previous releases, it is sometimes more efficient to convert that data set to a new one by creating it in the most recent version of SAS. See Compatibility of Version 8 with Earlier Releases for more information.


Using Indexes

An index is an auxiliary data structure that is used in conjunction with WHERE-expression processing, BY-group processing, or a MODIFY or SET statement with the KEY= option to locate and select specific observations by the value of the indexed variable. By creating and using an index, you can access an observation faster. Without the index, SAS must start at the top of the data set and read the observations sequentially to the end of the data set, applying the WHERE clause or BY statement to each observation. In contrast, an index returns observations in sorted order.

Note:   Indexing might or might not, however, improve the performance of an application. If you are continually rewriting a data set, indexing its variables would be wasteful because an index must be recreated each time the data set is rewritten.  [cautionend]

See SAS Data Files for more information about indexes.


Accessing Data Through Views

You can use the SQL procedure or a DATA step to create views of your data. A view is a stored set of instructions that subsets your data with fewer statements. Also, you can use a view to group data from several data sets without creating a new one, saving both processing time and disk space. See SAS Data Views and the SAS Procedures Guide for more details.


Using Engines Efficiently

If you do not specify an engine on a LIBNAME statement, SAS must perform extra processing steps to determine which engine to associate with the data library. SAS must look at all of the files in the directory until it has enough information to determine which engine to use. For example, the following statement is efficient because it explicitly tells SAS to use a specific engine with the libref FRUITS:

/* Engine specified. */

libname fruits v8 '/users/myid/mydir';
The following statement does not explicitly specify the V8 engine; notice the NOTE about mixed engine types that is generated:
/* Engine not specified. */

libname fruits '/users/myid/mydir';

NOTE: Directory for library FRUITS contains 
      files of mixed engine types.
NOTE: Libref FRUITS was successfully assigned 
      as follows:
      Engine:        V8
      Physical Name: /users/myid/mydir

Operating Environment Information:   In the OS/390 operating environment, you do not need to specify an engine for certain types of libraries.  [cautionend] See SAS I/O Engines for more information about SAS engines.


Setting the BUFNO=, BUFSIZE=, CATCACHE=, and COMPRESS= System Options

The following SAS system options can help you reduce the number of disk accesses that are needed for SAS files, though they might increase memory usage.

BUFNO=
SAS uses the BUFNO= option to adjust the number of open page buffers when it processes a SAS data set. Increasing this option's value can improve your application's performance by allowing SAS to read more data with fewer passes; however, your memory usage increases. Experiment with different values for this option to determine the optimal value for your needs.

Note:   You can also use the CBUFNO= system option to control the number of extra page buffers to allocate for each open SAS catalog.  [cautionend]

See "System Options" in SAS Language Reference: Dictionary and the SAS documentation for your operating environment for more details on this option.

BUFSIZE=
When the V8 engine creates a data set, it uses the BUFSIZE= buffer size option to determine the page size of the data set. In each subsequent I/O operation, SAS moves the number of pages that is set by the BUFNO= option. Whether you use your operating environment's default value or specify a value, the engine always writes complete pages regardless of how full or empty those pages are.

If you know that the total amount of data is going to be small, you can enforce a small page size with the BUFSIZE= option, so that the total data set size remains small and you minimize the amount of wasted space on a page. In contrast, if you know that you are going to have many observations in a data set, you should optimize BUFSIZE= so that as little overhead as possible is needed. Note that each page requires some additional overhead.

Large data sets that are accessed sequentially benefit from larger page sizes because sequential access reduces the number of system calls that are required to read the data set. Note that because observations cannot span pages, typically there is unused space on a page.

Calculating Data Set Size discusses how to estimate data set size.

See "System Options" in SAS Language Reference: Dictionary and the SAS documentation for your operating environment for more details on this option.

CATCACHE=
SAS uses this option to determine the number of SAS catalogs to keep open at one time. Increasing its value can use more memory, although this may be warranted if your application uses catalogs that will be needed relatively soon by other applications. (The catalogs closed by the first application are cached and can be accessed more efficiently by subsequent applications.)

See "System Options" in SAS Language Reference: Dictionary and the SAS documentation for your operating environment for more details on this option.

COMPRESS=
One further technique that can reduce I/O processing is to store your data as compressed data sets by using the COMPRESS= data set option. However, storing your data this way means that more CPU time is needed to decompress the observations as they are made available to SAS. But if your concern is I/O, and not CPU usage, compressing your data may improve the I/O performance of your application.

See SAS Language Reference: Dictionary for more details on this option.


Chapter Contents

Previous

Next

Top of Page

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.