Chapter Contents |
Previous |
Next |
SAS/SHARE User's Guide |
Now that you understand how SAS and SAS/SHARE software use files and computer resources, it's time to apply that knowledge to the design and implementation of your applications.
The most productive way to optimize the performance of your application is programming it to work as efficiently as possible. You can almost always realize more performance improvement by coding your application to exploit features of SAS than you can gain by adjusting the operation of SAS.
When you decide to adjust SAS to operate differently, remember that tuning is a balancing act and invariably requires compromise. Of course, to effectively tune SAS you must understand what your application's bottlenecks are.
This section will first list some programming techniques that are based on the information presented earlier in this paper. After that, the tuning options of SAS/SHARE software and SAS will be described.
Programming Techniques |
As a SAS data file matures, users add new observations, update existing observations, and forget about old observations. In most cases the level of activity is greatest on the newest observations. If the users of your application do not frequently access older information, consider moving older observations from files that are concurrently updated to archive files that are accessed directly (instead of through a server).
Also as a SAS data file matures, new variables are added, some variables turn out to be larger than they need to be, and some variables lose their usefulness. Periodically check your application's SAS data files for variables that are longer than they need to be and for variables that are no longer used.
While compressing a SAS data file reduces the number of pages in it, compression can not be as efficient at eliminating unused space as you can be by deleting unused observations and variables and by shortening variables that are longer than necessary.
Smaller data files improve the performance of all SAS
sessions by reducing the amount of disk space required by each file, by reducing
the number of I/O operations required to process the data in each file, and
by reducing the number and size of messages required to transmit the data
in a file between a server and its users.
When an index is not used to locate directly the observations that satisfy a WHERE clause, the process in the server's session must read observations from the data file until it finds one that matches the WHERE clause. This can consume a very large amount of the I/O and CPU resources. Those resource requirements can be greatly reduced when the variables in the WHERE clause are indexed.
The subsetting IF statement of the DATA step and the FIND, SEARCH, and LOCATE commands of SAS/FSP procedures perform the specified comparison in the user's SAS session instead of in a process in a server. This requires that every observation of the SAS data set be transmitted from the server's session to the user's session, which can consume a very large amount of the messages resource, in addition to the I/O and CPU resources required to read the data file. Since the comparisons of a WHERE clause are performed in the server's session, only the desired observations are transmitted to the user's session and the message resource is conserved.
The I/O resource consumption is the same unoptimized
WHERE, subsetting IF, and FSP's FIND, SEARCH, and LOCATE. Using WHERE clauses
is recommended, however, because the messages resource consumption is higher
for the subsetting IF statement and the FIND, SEARCH, and LOCATE commands.
Adding indexes may be a good idea if your application seems to be taking too long to execute WHERE clauses. However, indexes require extra memory and may present a problem for a server that is memory constrained.
A complete description of index usage may be found in
the paper "Effective Use of Indexes in the SAS System," in the Proceedings
of the SAS User's Group International Sixteenth Annual Conference.
When a SAS data file is accessed through a server, creating
an index on it prevents the server from responding to other users' requests.
While it can be useful to create an index while a data file is accessed through
a server, indexes on large files should be created after hours. Indexes on
large data files should not be created while a server is expected to respond
quickly to users' requests.
Large page sizes can be useful if most of the observations
on each page are likely to be accessed each time the page is read into the
server's memory, or if a large page size causes all or most of a SAS data
file to be kept in the server's memory. Otherwise, large page sizes can increase
the amount of time required to perform I/O activity in the server's SAS session
to the detriment of the server's ability to provide timely response to users'
requests.
The components of SAS that are used most frequently to access data in a random order are:
The server will by default transmit multiple observations per read for either of 'IS' or 'IN' open modes.
If the application's use of data is predominantly sequential, but you occasionally need to re-read a previously read observation then use a mode of 'IN' or 'UN' in your SCL OPEN() function. If the application's use of data is strictly sequential (you will never revisit a previously read observation) then use the open mode 'IS' or 'US'. The 'IS' and 'US' open modes are the most efficient for SCL. A 'IS' or 'US' open mode, however, will restrict an SCL application to those functions which access data sequentially. The SCL functions which access data in a random pattern are:
Specifying an access pattern on an SCL OPEN() function is documented in SAS Component Language: Reference. An example of specifying a sequential access pattern on an SCL OPEN() function is:
DSID = OPEN( 'MYLIB.A', 'IN' );
There are three strategies for using SAS data sets in an SCL program:
The initialization code of an application is the place to open the SAS data sets that will be used throughout the execution of the application. But if an application's initialization code must open a large number of files, the time it takes to get started may be long. By studying how an application is used, you may discover some SAS data sets that can be opened as functions are requested while the application executes, which can reduce the amount of time the application takes to initialize and reduces the concentration of time required to open files.
Whether they are opened during initialization or later, lookup tables that are small should usually not be closed until an application terminates because the single I/O buffer required by such a lookup table does not require a large amount of memory. In such a case it is frequently economical to use a small amount of the memory resource to conserve the CPU resource that would be required to open and close the lookup table over and over.
Larger SAS data sets, and SAS data sets that are used
extremely infrequently (for example, once during initialization) or during
a seldom-used function (for example, a lookup table on a rarely updated field),
should usually be left closed whenever they are not being used.
For example, a travel agent making reservations or a stock broker making trades require every query to show up-to-the-second information. On the other hand, daily reports or analysis of long-term trends can use data that are out of date by several hours, several days, or even several weeks or months.
When copying data from a server, it may be subset horizontally with a WHERE clause and it may be subset vertically with a DROP= or KEEP= data set option. (In relational terminology, the horizontal subsetting is selection and vertical subsetting is projection.) Be sure to take advantage of both methods when copying a file from a server to make the copy of the file as small as possible and thus ensure that reports and generated as efficiently as possible.
Don't forget that files can be stored in users' WORK libraries. It can be most efficient to copy a file that is concurrently updated from a server to a user's WORK library and then use that file more than one time to generate reports. Such a copy of a file contains very timely data yet is not especially expensive to create or use.
A SAS data file that is accessed directly is almost
always less costly to use than a file that is accessed through a server.
A lookup file that is used frequently and updated less often than, say, once a week is likely to be a good candidate for not being accessed through a server if it would be easy to find some time during the week when the files can be updated because the application is not being used. On the other hand, a lookup file that is updated many times each day should, in many cases, be accessed through a server because updating the file will be convenient: the lookup file can be updated while users use it to perform queries.
SAS catalog entries illustrate another way that update frequency can change:
An application may use only a few or many catalog entries. Like lookup files, catalog entries that are updated frequently are likely candidates for access through a server. But catalog entries that are never changed, or only changed very infrequently, should not be accessed through a server.
The update frequency may change for some of an application's catalog entries over time. For example, while an application is under development and being tested, it can be extremely convenient for the developers of the application to be able to update any catalog entry while those testing the application continue with their work. During this phase, the convenience of accessing the catalog entries through a server can more than pay for the cost of the overhead of server access. After the testing is completed and the application has stabilized, some or all of the application's catalogs can be moved to a SAS library that is accessed directly by the users of the application; in this phase efficient access by the users is more important than being able to update the catalog entries conveniently.
Remember that not all of an application's catalog entries
must be accessed the same way. Catalog entries that must be frequently updated
can continue to be accessed through a server, while other catalog entries
that change very seldom can be stored in SAS catalogs that are accessed directly
by the users of the application.
If it is advantageous to your application for its DATA step views to be interpreted in a server's session, be sure that any external files read by the DATA step view are available to the server's SAS session.
Tuning Options in SAS/SHARE Software |
SAS/SHARE software automatically attempts to conserve the message resource by transmitting observations in groups whenever possible. Observations can always be transmitted in groups when a data set is being created or replaced, but when a data set is opened for update access it is never appropriate to transmit more than one observation at a time. The grouping of observations when a data set is opened for input depends on the situation; you control whether observations are transmitted in groups according to:
The factors that control how many observations are transmitted in each group are:
The value of the TBUFSIZE= option is used by a server to automatically calculate how many observations to combine into each group for all of the data sets accessed through it. Therefore, this option can be the easiest way to cause a server to generally combine more or fewer observations into each group.
When the data sets accessed through a server have large observations, the number of observations transferred in each message is small when not many observations fit into 32,768 bytes. A small number of observations per message conserves the memory resource at the expense of the messages resource, but since messages tend to be expensive relative to memory that trade-off will probably not allow your application to perform as well as it could.
The number of observations per group is calculated by subtracting a small amount of overhead from the MOTB size and dividing the result by the length of one observation. For example, consider a SAS data set in which each observation is 10,000 bytes long. The default MOTB size is 32,768. In this case three observations are transmitted in each message when the application opens the data set for sequential access. To increase the number of observations per message to 6, specify the option TBUFSIZE=64k on the PROC SERVER statement.
An example of using the TBUFSIZE= option is:
PROC SERVER TBUFSIZE=128K <other PROC SERVER options>;
The TOBSNO= option may be specified wherever SAS data set options are accepted: as an argument to the OPEN() function of SAS Screen Control Language, on the DATA= option of a SAS procedure, and on the SET, MERGE, UPDATE, and MODIFY statements of the DATA step. It must be specified for each data set for which you want grouping behavior different from the default.
When a data set is opened for input with a sequential access pattern, a server calculates the number of observations per group as the smallest of:
When a SAS data set is opened for input with a random access pattern, the default behavior is transmitting observations individually (the group size is one). This ensures that a user always receives up-to-date data when they position to an observation, and it reduces wasted communications bandwidth because no observations are transmitted to a user's session except the specific observations requested.
At other times, the TOBSNO= data set option may be used to increase the number of observations transferred in each group. For example, consider an SCL program in which a SAS data set dsid is passed to a DATALISTC() or DATALISTN() function. The data set is read from beginning to end by the function, and then the observation chosen by the user is reread. Since by default the OPEN() function of SCL specifies a random access pattern, observations for that dsid are transmitted individually. But the access pattern of the DATALISTC() and DATALISTN() functions is really skip sequential, so transmitting observations individually is not optimum. TOBSNO=4 could be specified in a case like this to reduce the number of messages by three-quarters. (Note that the user could change the open mode from 'I' to 'IN' as an alternative to specifying a TOBSNO data set option.)
The number of observations transmitted when a data set is opened for input is summarized below: An example of using the TOBSNO= data set option is:
PROC FSVIEW DATA=MYLIB.A(TOBSNO=10);
Each MOTB is assigned when a request for data is made and is not released until the request is satisfied. Therefore, when a process in a server's session takes a long time to satisfy a WHERE clause, an MOTB is assigned to that process for a long time. Such a situation reduces the server's ability to reuse MOTBs by assigning them to other processes; one solution to that problem is specifying a larger value for the TBUFNO= option. Of course, you should check the WHERE clauses being used to ensure that they can be evaluated as efficiently as possible.
When a user requests one or more observations and a server cannot assign an MOTB to the process that will execute the user's request, the server writes a message to its SAS log, waits for a brief period of time, and tries again to assign an MOTB to the process for the user's request. If there are still no MOTBs available, the sequence repeats. You can tell when the number of MOTBs is not sufficient by the presence of these messages in a server's SAS log.
An example of using the TBUFNO= option is:
PROC SERVER TBUFNO=8 <other PROC SERVER options>;
Some PROC SQL views are especially good candidates for interpretation in a server's SAS session because the number of observations produced by the view is much smaller than the number of observations read by the view, the data sets read by the view are available to the server and the amount of processing necessary to build each observation is not large.
Other PROC SQL views should be interpreted in users' SAS sessions because the number of observations produced by the view is not appreciably smaller than the number of observations read by the view, some of the data sets read by the view can be directly accessed by the users' SAS sessions, and the amount of processing done by the view is considerable.
By default, SAS data views are interpreted in a server's SAS session, but the RMTVIEW= option of the LIBNAME statement enables you to have the views in a library interpreted in users' SAS sessions instead. The NORMTVIEW option on the PROC SERVER statement enables you to prevent all SAS data views from being interpreted in the server's session.
SAS/ACCESS views do not provide update access to the underlying database when they are interpreted in a server's session, so it is often more practical to interpret SAS/ACCESS views in users' SAS session.
If it is useful for your application to have a SAS/ACCESS view interpreted in a server's session, make sure that all of the necessary database interface components are available to the server's session.
If a user's SAS session is capable of using a SAS/ACCESS interface engine to access the underlying database, it is more efficient to execute the SAS/ACCESS interface engine in the user's SAS session. Note that in this case it may be convenient to store the view file in a SAS library that is accessed through a server if the view will be updated frequently and used by more than one user.
Like SAS/ACCESS views, DATA step views are very often most useful when interpreted in users' SAS sessions. See "Know your application's DATA step views", above, for more information about interpreting DATA step views in a server's session.
See Introductionfor a complete description of the RMTVIEW= option of the LIBNAME statement.
Examples of specifying the RMTVIEW= and NORMTVIEW options are:
LIBNAME MYLIB 'my SAS data library' RMTVIEW=YES <other LIBNAME options>; PROC SERVER NORMTVIEW <other PROC SERVER options>;
Changing the rate at which control is yielded is delicate because the act of yielding control consumes some CPU resource: increasing the frequency at which control is yielded increases a server's CPU consumption all by itself. You can change the rate at which the processes in a server yield control by varying the value of the PROC SERVER option LRPYIELD=. The default value of this option is 10,000; the option has no units.
To make long-running processes yield relatively more frequently, specify a value greater than 10,000. While a higher value may have the effect of providing more even response time to a server's users, this comes at the expense of increased consumption of the server's CPU resource. Also, the processes that run for a long time run even longer when they are asked to yield more frequently.
To make a long-running process yield less frequently, specify a value smaller than 10,000. A lower LRPYEILD= value may make some individual user requests (like an SQL join with a sort) complete sooner, but the server's other users are forced to wait as the long-running process does more work before it yields control. Response time can become more uneven when long-running processes are allowed to yield less frequently.
This option is documented in The SERVER Procedure.
An example of specifying the LRPYIELD= option is:
PROC SERVER LRPYIELD=5000 <other PROC SERVER options>;
This is not an option you specify on a SAS program statement; instead it is a method of managing the workload of concurrent access to SAS data sets.
If you determine that a server is consuming too much of a resource and you can not reduce the server's consumption of that resource any further, creating an additional server allows you to divide your applications' workload among several servers.
SAS/SHARE software includes a family of SAS macros that help you manage SAS file access through multiple servers. Those macros are documented in Using SAS/SHARE Macros for Server Access
SAS System Options |
You might find it useful to balance the pattern in which a file is randomly accessed against the number of observations stored on each page of the file. If most random access sequences access observations in very different locations in the file, then a small page size will improve performance because most of the observations on each page are not used. On the other hand, if most random access sequences are likely to be to observations that are physically near each other in the file, you might be able to take advantage of a large page size to have many of the observations read from the file into the server's memory at once.
If you want to keep all or most of a SAS data file in memory, you can choose a very large page size. Of course, this can consume a lot of the server's memory so you should only use such a page size when you really want to. If you expect that not much data from a large file will need to be in memory at one time, choose a small page size to make reading and writing each page as fast as possible.
If you find that your server is spending a significant amount of time waiting for I/O operations to complete, consider recreating the files that are not used for sequential access with a smaller page size.
An example of using the BUFSIZE= data set option is:
DATA MYLIB.A(BUFSIZE=6K); SET MYLIB.A; RUN;
Chapter Contents |
Previous |
Next |
Top of Page |
Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.