Chapter Contents |
Previous |
Next |
Definition of SAS Indexes |
For example, suppose you want the observation with SSN (social security number) equal to 465-33-8613:
Benefits of an Index |
In general, SAS can use an index to improve performance in the following situations:
Note: If the SORT procedure is used, the index is not
used.
In addition, an index can benefit other areas of the SAS System. In SCL (SAS Component Language), an index improves the performance of table lookup operations. For the SQL procedure, an index enables the software to process certain classes of queries more efficiently, for example, join queries. For the SAS/IML software, you can explicitly specify that an index be used for read, delete, list, or append operations.
Even though an index can reduce the time required to locate a set of observations, especially for a large data file, there are costs associated with creating, storing, and maintaining the index. When deciding whether to create an index, you must consider increased resource usage, along with the performance improvement.
Note: An index is never
used for the subsetting IF statement in a DATA step or for the FIND and SEARCH
commands in the FSEDIT procedure.
Index File |
The index file may show up as a separate file or appear to be part of the data file, depending on the operating environment. In any case, the index file is stored in the same SAS data library as its data file.
The index file consists of entries that are organized hierarchically and connected by pointers, all of which are maintained by SAS. The lowest level in the index file hierarchy consists of entries that represent each distinct value for an indexed variable, in ascending value order. Each entry consists of
That is, in an index file, each value is followed by one or more RIDs, which identifies the observation(s) in the data file containing the value. (Multiple RIDs result from multiple occurrences of the same value.) For example, the following represents index file entries for the variable LASTNAME:
Avery 10 Brown 6,22,43 Craig 5,50 Dunn 1
When an index is used to process a request, such as a WHERE expression, SAS does a binary search on the index file and positions the index to the first entry that contains a qualified value. SAS then uses the value's RID(s) to read the observation(s) that contain the value. Subsequent entries' higher (greater) than the requested value are found by reading the remaining entries and then following the pointers to entries that contain higher values. The result is that SAS can quickly locate the observations that are associated with a value or range of values. For example, using an index to process the WHERE expression,
where age > 20 and age < 35;SAS positions the index to the index entry for the first value greater than 20 and uses the value's RID(s) to read the observation(s). SAS then moves sequentially through the index entries reading observations until it reaches the index entry for the value that is equal to or greater than 35.
SAS automatically keeps the index file balanced as updates are made, which means that it ensures a uniform cost to access any index entry, and all space that is occupied by deleted values is recovered and reused.
Types of Indexes |
In addition to deciding whether you want a simple index
or a composite index, you can also limit an index (and its data file) to unique values and exclude from the index missing values.
The following example shows the DATASETS procedure statements that are used to create two simple indexes for variables CLASS and MAJOR in data file COLLEGE.SURVEY:
proc datasets library=college; modify survey; index create class; index create major; run;
To process a WHERE expression using an index, SAS uses only one index. When the WHERE expression has multiple conditions using multiple key variables, SAS determines which condition qualifies the smallest subset. For example, suppose that COLLEGE.SURVEY contains the following data:
With simple indexes on CLASS and MAJOR, SAS would select MAJOR to process the following WHERE expression:
where class=97 and major='Biology';
The following example shows the DATASETS procedure statements that are used to create a composite index for the data file COLLEGE.MAILLIST, specifying two key variables: ZIPCODE and SCHOOLID.
proc datasets library=college; modify maillist; index create zipid=(zipcode schoolid); run;
Often, only the first variable of a composite index is used. For example, for a composite index on ZIPCODE and SCHOOLID, the following WHERE expression can use the composite index for the variable ZIPCODE because it is the first key variable in the composite index:
where zipcode = 78753;
where zipcode = 78753 and schoolid = 55;
When you are deciding whether to create a simple index
or a composite index, consider how you will access the data. If you often
access data for a single variable, a simple index will do. But if you frequently
access data for multiple variables, a composite index could be beneficial.
The following example creates a simple index for the variable IDNUM and requires that all values for IDNUM be unique:
proc datasets library=college; modify student; index create idnum / unique; run;
The following example creates a simple index for the variable RELIGION and specifies that the index does not maintain missing values for the variable:
proc datasets library=college; modify student; index create religion / nomiss; run;
In contrast to the UNIQUE option, observations with missing values for the key variable can be added to the data file, even though the missing values are not added to the index.
SAS will not use an index that was created with the NOMISS option to process a BY statement or to process a WHERE expression that qualifies observations containing missing values. For example, suppose the index AGE was created with the NOMISS option and observations exist that contain missing values for the variable AGE. SAS will not use the index for the following:
proc print data=mydata.employee; where age < 35; run;
Deciding Whether to Create an Index |
An index exists to improve performance. However, an index conserves some resources at the expense of others. Therefore, you must consider costs associated with creating, using, and maintaining an index. The following topics provide information on resource usage and give you some guidelines for creating indexes.
When you are deciding whether to create an index, you
must consider CPU cost, I/O cost, buffer requirements, and disk space requirements.
When SAS uses an index to read an observation from a data file, there is also increased CPU usage. The increased usage results from SAS using a more complicated process than is used when SAS retrieves data sequentially. Although CPU usage is greater, you benefit from SAS reading only those observations that meet the conditions. Note that this is why using an index is more expensive when there is a larger number of observations that meet the conditions.
Note: To compare CPU usage with
and without an index, for some operating environments, you can issue the STIMER
or FULLSTIMER system options to write performance statistics to the SAS log.
To process a request using an index, the following occurs:
The result is that the more random the data, the more I/Os are required to use the index. If the data is ordered more like the index, which is in ascending value order, fewer I/Os are required to access the data.
The number of buffers determines how many pages of data can simultaneously be in memory. Frequently, the larger the number of buffers, the fewer number of I/Os will be required. For example, if the page size is 4096 bytes and one buffer is allocated, then one I/O transfers 4096 bytes of data (or one page). To reduce I/Os, you can increase the page size but you will need a larger buffer. To reduce the buffer size, you can decrease the page size but you will use more I/Os.
For information on data file characteristics like the data file page size and the number of data file pages, issue the CONTENTS procedure (or use the CONTENTS statement in the DATASETS procedure). With this information, you can determine the data file page size and experiment with different sizes. Note that the information that is available from PROC CONTENTS depends on the operating environment.
The BUFSIZE= data set option (or system option) sets
the page size for a data file when it is created. The BUFNO= data set option
(or system option) specifies how many buffers to allocate for a data file
and for the overall system for a given execution of SAS; that is, BUFNO= is
not stored as a data set attribute.
For information on the index file size, issue the CONTENTS procedure (or the CONTENTS statement in the DATASETS procedure). Note that the available information from PROC CONTENTS depends on the operating environment.
Guidelines for Creating Indexes |
For example, consider a data file with variables LASTNAME and GENDER.
where lastname='LeVoux' and gender='F';Note that when you create a composite index, the first key variable should be the most discriminating.
Methods of Creating an Index |
You can create one index for a data file, which can be either a simple index or a composite index, or you can create multiple indexes, which can be multiple simple indexes, multiple composite indexes, or a combination of both simple and composite. In general, the process of creating an index is as follows:
Note: If a data file's sort assertion
is set from a SORTEDBY= data set option, SAS validates that the data is sorted
as specified by the data set option. If the data is not sorted appropriately,
the index will not be created, and a message displays telling you that the
index was not created because values are not sorted in ascending order.
Methods to create an index are briefly described in
this section; for details, refer to the INDEX= data set option in the SAS Language Reference: Dictionary.
proc datasets library=mylib; modify employee; index delete salary age; index create empnum / unique nomiss; index create names=(lastname frstname);
Note: If you delete and create indexes in the same step, place the INDEX DELETE
statement before the INDEX CREATE statement so that space occupied by deleted
indexes can be reused during index creation.
data finances(index=(stock) /unique);
The next example uses the variables SSN, CITY, and STATE to create a simple index named SSN and a composite index named CITYST:
data employee(index=(ssn cityst=(city state)));
The SQL procedure supports index creation and deletion and the UNIQUE option. Note that the variable list requires that variable names be separated by commas (which is an SQL convention) instead of blanks (which is a SAS convention).
The DROP INDEX statement deletes indexes. The CREATE INDEX statement specifies the UNIQUE option, the name of the index, the target data file, and the variable(s) to be indexed. For example:
drop index salary from employee; create unique index empnum on employee (empnum); create index names on employee (lastname, frstname);
You can also create and delete indexes using other SAS utilities and products, such as the SAS Explorer, SAS/IML software, SAS Component Language, and SAS/Warehouse Administrator software.
Using an Index for WHERE Processing |
To process a WHERE expression, by default SAS decides whether to use an index or read all the observations in the data file sequentially. To make this decision, SAS does the following:
SAS attempts to use an index for the following types of conditions:
Condition | Examples |
---|---|
comparison operators, which include the EQ operator; directional comparisons like less than or greater than; and the IN operator | where empnum eq 3374; where empnum < 2000; where state in ('NC','TX'); |
comparison operators with NOT | where empnum ^= 3374; where x not in (5,10); |
comparison operators with the colon modifier | where lastname gt: 'Sm'; |
CONTAINS operator | where lastname contains 'Sm'; |
fully-bounded range conditions specifying both an upper and lower limit, which includes the BETWEEN-AND operator | where 1 < x < 10; where empnum between 500 and 1000; |
pattern-matching operators LIKE and NOT LIKE | where frstname like '%Rob_%' |
IS NULL or IS MISSING operator | where name is null; where idnum is missing; |
TRIM function | where trim(state)='Texas'; |
SUBSTR function in the form of: WHERE SUBSTR (variable, position, length)='string'; when the following conditions are met: position is equal to 1, length is less than or equal to the length of variable, and length is equal to the length of string |
where substr (name,1,3)='Mac' and (city='Charleston' or city='Atlanta'); |
The following examples illustrate optimizing a single condition:
where major in ('Biology', 'Chemistry', 'Agriculture'); where class=90 and major in ('Biology', 'Agriculture');
where zipcode = 78753;
However, the following condition cannot use the composite index because the variable SCHOOLID is not the first key variable in the composite index:
where schoolid gt 1000;
Note: An index is not supported for arithmetic operators,
a variable-to-variable condition, and the sounds-like operator.
For example, suppose you have a composite index for LASTNAME and FRSTNAME. If you issue the following WHERE expression, SAS uses the concatenated values for the first two variables, then SAS further evaluates each qualified observation for the EMPID value:
where lastname eq 'Smith' and frstname eq 'John' and empid=3374;
For compound optimization to occur, all of the following must be true.
where lastname eq 'Smith' and frstname eq 'John';Any conditions connected using the OR logical operator must specify the same variable:
where frstname eq 'John' and (lastname='Smith' or lastname = 'Jones');
Note: The same conditions that are
acceptable for optimizing a single condition are acceptable for compound optimization
except for the CONTAINS operator, the pattern-matching operators LIKE and
NOT LIKE, and the IS NULL and IS MISSING operators. Also, functions are not
supported.
For the following examples, assume there is a composite index named IJK for variables I, J, and K:
where i = 1 and j not in (3,4) and 10 < k < 12;
where i < 5 and j in (3,4) and k =3;
where i in (1,4) and j = 5 and k like '%c'l
where j = 1 and k = 2;
where x < 5 and i = 1 and j = 2;
Starting with Version 7, the software's ability to estimate the number of observations that will be qualified is improved because the software stores additional statistics called cumulative percentiles (or centiles for short). Centiles information represents the distribution of values in an index so that SAS does not have to assume a uniform distribution as in prior releases. To print centiles information for an indexed data file, include the CENTILES option in PROC CONTENTS (or in the CONTENTS statement in the DATASETS procedure).
Note that, by default, SAS does not update centiles information after every data file change. When you create an index, you can include the UPDATECENTILES option to specify when centiles information is updated. That is, you can specify that centiles information be updated every time the data file is closed, when a certain percent of values for the key variable have been changed, or never. In addition, you can also request that centiles information is updated immediately, regardless of the value of UPDATECENTILES, by issuing the INDEX CENTILES statement in PROC DATASETS.
As a general rule, SAS uses an index if it estimates that the WHERE expression will select approximately one-third or fewer of the total number of observations in the data file.
Note: If SAS estimates that the number of qualified observations is less than 3%
of the data file (or if no observations are qualified), SAS automatically
uses the index. In other words, in this case, SAS does not bother comparing
resource usage.
This decision is much like a reader deciding whether to use an index at the back of a book. A book's index is designed to allow a reader to locate a topic along with the specific page number(s). Using the index, the reader would go to the specific page number(s) and read only about a specific topic. If the book covers 42 topics and the reader is interested in only a couple of topics, then the index saves time by preventing the reader from reading other topics. However, if the reader is interested in 39 topics, searching the index for each topic would take more time than simply reading the entire book.
To compare resource usage, SAS does the following:
If the observations are randomly distributed throughout the data file, the observations will be located on multiple data file pages. This means an I/O will be needed for each page. Therefore, the more random the data in the data file, the more I/Os it takes to use the index. If the data in the data file is ordered more like the index, which is in ascending value order, fewer I/Os are needed to use the index.
Factors that affect the comparison include the size of the subset relative to the size of the data file, data file value order, data file page size, the number of allocated buffers, and the cost to uncompress a compressed data file for a sequential read.
Note: If comparing resource costs results in a tie, SAS chooses the index.
The IDXWHERE= data set option overrides the software's decision regarding whether to use an index to satisfy the conditions of a WHERE expression as follows:
The following example tells SAS to decide which index is the best for optimizing the WHERE expression. SAS will disregard the possibility that a sequential search of the data file might be more resource efficient.
data mydata.empnew; set mydata.employee (idxwhere=yes); where empnum < 2000;
For details, see the IDXWHERE data set option in SAS Language Reference: Dictionary.
The IDXNAME= data set option directs SAS to use a specific index in order to satisfy the conditions of a WHERE expression.
By specifying IDXNAME=index-name, you are specifying the name of a simple or composite index for the data file.
The following example uses the IDXNAME= data set option to direct SAS to use a specific index to optimize the WHERE expression. SAS will disregard the possibility that a sequential search of the data file might be more resource efficient and does not attempt to determine if the specified index is the best one. (Note that the EMPNUM index was not created with the NOMISS option.)
data mydata.empnew; set mydata.employee (idxname=empnum); where empnum < 2000;
For details, see the IDXNAME data set option in SAS Language Reference: Dictionary.
To display information in the SAS log regarding index
usage, change the value of the MSGLEVEL= system option from its default value
of N to I. When you issue
options msglevel=i;
, the following
occurs:
In this example, you create an SQL view named STAT from data file CRIME, which has the key variable STATE. In addition, the view definition includes a WHERE expression:
proc sql; create view stat as select * from crime where murder > 7; quit;
If you issue the following PRINT procedure, which refers to the SQL view, along with a WHERE statement that specifies the key variable STATE, SAS cannot optimize the WHERE statement with the index. SQL views cannot join a WHERE expression that was defined in the view to a WHERE expression that was specified in another procedure, DATA step, or SCL:
proc print data=stat; where state > 42; run;
However, if you issue PROC SQL with an SQL WHERE clause that specifies the key variable STATE, then the SQL view can join the two conditions, which allows SAS to use the index STATE:
proc sql; select * from stat where state > 42; quit;
Using an Index for BY Processing |
For example, if an index exists for LASTNAME, the following BY statement would use the index to order the values by last names:
proc print; by lastname;
When you specify a BY statement, SAS looks for an appropriate index. If one exists, the software automatically retrieves the observations from the data file in indexed order.
A BY statement will use an index in the following situations:
For example, if the variable MAJOR has a simple index, the following BY statements use the index to order the values by MAJOR:
by major; by major state;
If a composite index named ZIPID exists consisting of the variables ZIPCODE and SCHOOLID, the following BY statements use the index:
by zipcode; by zipcode schoolid; by zipcode schoolid name;
However, the composite index ZIPID is not used for these BY statements:
by schoolid; by schoolid zipcode;
In addition, a BY statement will not use an index in these situations:
Note: Using an index to process a
BY statement may not always be more efficient than simply sorting the data
file, particularly if the data file has a high blocking factor of observations
per page. Therefore, using an index for a BY statement is generally for convenience,
not performance.
Using an Index for Both WHERE and BY Processing |
With a BY statement, SAS cannot use an index to optimize a WHERE expression if the optimization would invalidate the BY order. For example, the following statements could use an index on the variable LASTNAME to optimize the WHERE expression because the order of the observations returned by the index does not conflict with the order required by the BY statement:
proc print; by lastname; where lastname >= 'Smith'; run;
However, the following statements cannot use an index on LASTNAME to optimize the WHERE expression because the BY statement requires that the observations be returned in EMPID order:
proc print; by empid; where lastname = 'Smith'; run;
Specifying an Index with the KEY= Option for SET and MODIFY Statements |
The following MODIFY statement shows how to use the KEY= option to take advantage of the fact that the data file INVTY.STOCK has an index on the variable PARTNO. Using the KEY= option tells SAS to use the index to directly access the correct observations to modify.
modify invty.stock key=partno;
Note: A BY statement is not allowed in the same
DATA step with the KEY= option, and WHERE processing is not allowed for a
data file with the KEY= option.
Taking Advantage of an Index |
Applications that typically do not use indexes can be rewritten to take advantage of an index. For example:
Maintaining Indexes |
SAS provides several procedures that you can issue to
maintain indexes, and there are several operations within SAS that automatically
maintain indexes for you.
Note: The available information depends
on the operating environment.
Output of PROC CONTENTS
The SAS System The CONTENTS Procedure Data Set Name: SASUSER.STAFF Observations: 148 Member Type: DATA Variables: 6 Engine: V8 Indexes: 2 Created: 9:59 Tuesday, May 11, 1999 Observation Length: 63 Last Modified: 10:03 Tuesday, May 11, 1999 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 8192 Number of Data Set Pages: 3 First Data Page: 1 Max Obs per Page: 129 Obs in First Data Page: 104 Index File Page Size: 8192 The SAS System The CONTENTS Procedure -----Engine/Host Dependent Information----- Number of Index File Pages: 3 Number of Data Set Repairs: 0 File Name: /remote/obi01/wan0.2/u/sasXXX/sasuser.devn/staff.sas7bdat Release Created: 8.00.00B Host Created: HP-UX Inode Number: 237883 Access Permission: rw-r--r-- Owner Name: XXXXXX File Size (bytes): 32768 The SAS System The CONTENTS Procedure -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos ----------------------------------- 4 city Char 15 34 3 fname Char 15 19 6 hphone Char 12 51 1 idnum Char 4 0 2 lname Char 15 4 5 state Char 2 49 |
The SAS System The CONTENTS Procedure -----Alphabetic List of Indexes and Attributes----- Current # of Unique Update Update Unique # Index Option Centiles Percent Values Variables ---------------------------------------------------------------------------------------- 1 idnum YES 5 0 148 --- 1009 --- 1065 --- 1105 --- 1115 --- 1123 --- 1130 --- 1221 --- 1352 --- 1385 --- 1405 --- 1412 The SAS System The CONTENTS Procedure -----Alphabetic List of Indexes and Attributes----- Current # of Unique Update Update Unique # Index Option Centiles Percent Values Variables ---------------------------------------------------------------------------------------- --- 1421 --- 1429 --- 1436 --- 1475 --- 1521 --- 1616 --- 1739 --- 1845 --- 1919 --- 1995 2 names 5 0 148 fname lname --- ABDULLAH ,ALHERTANI The SAS System The CONTENTS Procedure -----Alphabetic List of Indexes and Attributes----- Current # of Unique Update Update Unique # Index Option Centiles Percent Values Variables ---------------------------------------------------------------------------------------- --- ALICE ,MURPHY --- ANTHONY ,COOPER --- CAROL ,PEARCE --- CLYDE ,HERRERO --- DIANE ,NORRIS --- ELIZABETH ,VARNER --- GRETCHEN ,HOWARD --- JAKOB ,BREWCZAK --- JEFF ,LI --- JOHN ,MARKS --- JULIA ,RODRIGUEZ --- LARRY ,UPCHURCH |
The SAS System The CONTENTS Procedure -----Alphabetic List of Indexes and Attributes----- Current # of Unique Update Update Unique # Index Option Centiles Percent Values Variables ---------------------------------------------------------------------------------------- --- LEVI ,GOLDSTEIN --- MARY ,PARKER --- NADINE ,WELLS --- RANDY ,SANYERS --- ROGER ,DENNIS --- SANDRA ,NEWKIRK --- THOMAS ,BURNETTE --- WILLIAM ,PHELPS |
If you copy from disk to disk, the index is recreated. If you copy from disk to tape, the index is not recreated on tape. However, after copying from disk to tape, if you then copy back from tape to disk, the index can be recreated. Note that if you move a data file with the MOVE option in PROC COPY, the index file is deleted from IN= library and recreated in OUT= library.
The CPORT procedure also has INDEX=YES|NO to specify
whether to export indexes with indexed data files. By default, PROC CPORT
exports indexes with indexed data files. The CIMPORT procedure, however, does
not handle the index file at all, and the index(es) must be recreated.
Task | Result |
---|---|
delete a data set | index file is deleted |
rename a data set | index file is renamed |
rename key variable | simple index is renamed |
delete key variable | simple index is deleted |
add observation | index entries are added |
delete observations | index entries are deleted and space is recovered for resuse |
update observations | index entries are deleted and new ones are inserted |
Note: Use the SAS System to perform additions, modifications
and deletions to your data sets. Using operating system commands to perform
these operations will make your files unusable.
Note: If you sort an indexed data file with the FORCE option,
the index file is deleted.
In Version 7 and later releases, SAS remembers the previous
position in the index so that when inserting more occurrences of the same
value, the end of the RID list is found quickly.
proc datasets library=mylib; repair mydata; run;
Chapter Contents |
Previous |
Next |
Top of Page |
Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.