Chapter Contents |
Previous |
Next |
RXPARSE |
Category: | Character String Matching |
Syntax |
rx=RXPARSE(pattern-expression) |
Syntax Description |
Arguments |
rx=rxparse("$'a-z'");
rx=rxparse("^'a-d'");
Tip: | You can use an exclamation point (!) instead of a vertical bar (|). |
See: | Reusing Character Classes |
See: | Pattern Abbreviations, Default Character Classes |
See: | Matching Balanced Symbols |
See: | Special Symbols |
See: | Scores |
See: | Tag Expression |
See: | Change Expressions |
See: | Change Items |
Character Classes |
$a or $A | matches any alphabetic upper- or lowercase letter in a substring ($'a-zA-Z'). |
$c or $C | matches any character allowed in a version 6 SAS name that is found in a substring ($'0-9a-zA-Z_'). |
$d or $D | matches any digit in a substring ($'0-9'). |
$i or $I | matches any initial character in a version 6 SAS name that is found in a substring ($'a-zA-Z_'). |
$l or $L | matches any lowercase letter in a substring ($'a-z'). |
$u or $U | matches any uppercase letter in a substring ($'A-Z'). |
$w or $W | matches any white space character, such as blank, tab, backspace, carriage return, etc., in a substring. |
See also: | Character Class Complements |
Note: A
hyphen appearing at the beginning
or end of a character class is treated as a member of the class rather than
as a range symbol.
This statement and these values produce these matches.
rx=rxparse("$character-class");
Pattern | Input string | Position of match | Value of match |
---|---|---|---|
$L or $l |
3+Y STRIkeS |
9 |
k |
$U or $u |
0*5x49XY |
7 |
X (uppercase) |
The following example shows how to use a default character class in a DATA step.
data _null_; stringA='3+Y STRIkeS'; rx=rxparse("$L"); matchA = rxmatch(rx,stringA); valueA=substr(stringA,matchA,1); put 'Example A: ' matchA = valueA= ; run; data _null_; stringA2='0*5x49XY'; rx=rxparse("$u"); matchA2 = rxmatch(rx,stringA2); valueA2 = substr(stringA2, matchA2,1); put 'Example A2: ' matchA2 = valueA2= ; run;
The SAS log shows the following results:
Example A: matchA=9 valueA=k Example A2: matchA2=7 valueA2=X
Note: Ranges of values are indicated by a hyphen (-).
This statement and these values produce these matches.
rx=rxparse("$'pattern'");
Pattern | Input string | Position of match | Value of match |
---|---|---|---|
$'abcde' |
3+yE strikes |
11 |
e |
$'1-9' |
z0*549xy |
4 |
5 |
The following example shows how to use a user-defined character class in a DATA step.
data _null_; stringB='3+yE strikes'; rx=rxparse("$'abcde'"); matchB = rxmatch(rx,stringB); valueB=substr(stringB,matchB,1); put 'Example B: ' matchB= valueB= ; run; data _null_; stringB2='z0*549xy'; rx=rxparse("$'1-9'"); matchB2=rxmatch(rx,stringB2); valueB2=substr(stringB2,matchB2,1); put 'Example B2: ' matchB2= valueB2= ; run;
The SAS log shows the following results:
Example B: matchB=11 valueB=e Example B2: matchB2=4 valueB2=5
You can also define your own character class complements.
For details about character class complements, see Character Class Complements.
A character class complement begins with a caret (^) or a tilde (~) and is followed by a string in quotation marks. A character class complement matches any one character that is not matched by the corresponding character class. For details about character classes, see Character Classes.
This statement and these values produce these matches.
rx=rxparse(^character-class | ~character-class);
Pattern | Input string | Position of match | Value of match |
---|---|---|---|
^u or ~u |
0*5x49XY |
1 |
0 |
^'A-z' or ~'A-z' |
Abc de45 |
4 |
the first space |
The following example shows how to use a character class complement in a DATA step.
data _null_; stringC='0*5x49XY'; rx=rxparse('^u'); matchC = rxmatch(rx,stringC); valueC=substr(stringC,matchC,1); put 'Example C: ' matchC = valueC=; run; data _null_; stringC2='Abc de45'; rx=rxparse("~'A-z'"); matchC2=rxmatch(rx,stringC2); valueC2=substr(stringC2,matchC2,1); put 'Example C2: ' matchC2= valueC2= ; run;
The SAS log shows the following results:
Example C: matchC=1 valueC=0 Example C2: matchC2=4 valueC2=
You can reuse character classes you previously defined by using one of the following patterns:
Restriction: | int is a nonzero integer. |
Example: | If you defined a character
class in a pattern and want to use the same character class again in the same
pattern, use $int to refer to the intth character class you defined.
If int is negative, count backwards from the
last pattern to identify the character class for -int.
For example,
rx=rxparse("$'AB' $1 $'XYZ' $2 $-2");is equivalent to rx=rxparse("$'AB' $'AB' $'XYZ' $'XYZ' $'AB'");
|
Restriction: | int is a nonzero integer. |
Example: | This example shows character-class
elements ($'Al', $'Jo', $'Li') and reuse numbers ($1, $2, $3, ~2):
rx=rxparse($'Al' $1 $'Jo' $2 $'Li' $3 ~2);is equivalent to rx=rxparse($'Al' $'Al' $'Jo' $'Jo' $'Li' $'Li' $'Al' $'Li');The ~2 matches patterns 1 (Al) and 3 (Li), and excludes pattern 2 (Jo). |
Pattern Abbreviations |
You can use the following list of elements in your pattern:
$f or $F | matches a floating point number. |
$n or $N | matches a SAS name. |
$p or $P | indicates a prefix option. |
$q or $Q | matches a string in quotation marks. |
$s or $S | indicates a suffix option. |
This statement and input string produce these matches.
rx=rxparse($pattern-abbreviation pattern);
Pattern | Input string | Position of match | Value of match |
---|---|---|---|
$p wood |
woodchucks eat wood |
1 |
characters "wood" in woodchucks |
wood $s |
woodchucks eat wood |
20 |
wood |
The following example shows how to use a pattern abbreviation in a DATA step.
data _null_; stringD='woodchucks eat firewood'; rx=rxparse("$p 'wood'"); PositionOfMatchD=rxmatch(rx,stringD); call rxsubstr(rx,stringD,positionD,lengthD); valueD=substr(stringD,PositionOfMatchD); put 'Example D: ' lengthD= valueD= ; run; data _null_; stringD2='woodchucks eat firewood'; rx=rxparse("'wood' $s"); PositionOfMatchD2=rxmatch(rx,stringD2); call rxsubstr(rx,stringD2,positionD2,lengthD2); valueD2=substr(stringD2,PositionOfMatchD2); put 'Example D2: ' lengthD2= valueD2= ; run;
The SAS log shows the following results:
Example D: lengthD=4 valueD=woodchucks eat firewood Example D2: lengthD2=4 valueD2=wood
Matching Balanced Symbols |
Restriction: | int is a positive integer. | ||||||
Tip: | Using smaller values increases the efficiency of finding a match. | ||||||
Example: | This statement and input
string produces this match.
rx=rxparse("$(2)");
|
The following example shows how to use mathematical symbol matching in a DATA step.
data _null_; stringE='(((a+b)*5)/43)'; rx=rxparse("$(2)"); call rxsubstr(rx,stringE,positionE,lengthE); PositionOfMatchE=rxmatch(rx,stringE); valueE=substr(stringE,PositionOfMatchE); put 'Example E: ' lengthE= valueE= ; run;
The SAS log shows the following results:
Example E: lengthE=9 valueE=((a+b)*5)/43)
Special Symbols |
You can use the following list of special symbols in your pattern:
\ | sets the beginning of a match to the current position. | ||||
/ | sets the end of a match to the current
position.
| ||||
$# | requests the match with the highest
score, regardless of the starting position.
| ||||
$- | scans a string from right to left.
| ||||
$@ | requires the match to begin where
the scan of the text begins.
|
The following table shows how a pattern matches an input string.
Pattern | Input string | Value of match |
---|---|---|
c\ow |
How now brown cow? |
characters "ow" in cow |
ow/n |
How now brown cow? |
characters "ow" in brown |
@3:\ow |
How now brown cow? |
characters "ow" in now |
The following example shows how to use special symbol matching in a DATA step.
data _null_; stringF='How now brown cow?'; rx=rxparse("$'c\ow'"); matchF=rxmatch(rx,stringF); valueF=substr(stringF,matchF,2); put 'Example F= ' matchF= valueF= ; run; data _null_; stringF2='How now brown cow?'; rx=rxparse("@3:\ow"); matchF2=rxmatch(rx,stringF2); valueF2=substr(stringF2,matchF2,2); put 'Example F2= ' matchF2= valueF2= ; run;
The SAS log shows the following results:
Example F= matchF=2 valueF=ow Example F2= matchF2=6 valueF2=ow
Scores |
The score for any substring begins at zero. When #int is encountered in the pattern, the value of int is added to the score. If two or more matching substrings begin at the same leftmost position, SAS selects the substring with the highest score value. If two substrings begin at the same leftmost position and have the same score value, SAS selects the longer substring. The following is a list of score representations:
#int | adds int to the score, where int is a positive or negative integer. |
#*int | multiplies the score by nonnegative int. |
#/int | divides the score by positive int. |
#=int | assigns the value of int to the score. |
#>int | finds a match if the current score exceeds int. |
Tag Expression |
You can assign a substring of the string being searched to a
character variable with the expression
name=<pattern>
, where pattern specifies any pattern expression. The substring matched
by this expression is assigned to the variable name.
If you enclose a pattern in less-than/greater-than symbols (<>) and do not specify a variable name, SAS automatically assigns the pattern to a variable. SAS assigns the variable _1 to the first occurrence of the pattern, _2 to the second occurrence, etc. This assignment is called tagging. SAS tags the corresponding substring of the matched string.
The following shows the syntax of a tag expression:
Change Expressions |
A pattern change operation replaces a matched string by concatenating values to the replacement string. The operation concatenates
You can have multiple parallel operations within the RXPARSE argument. In the following example,
rx=rxparse("x TO y, y TO x");
x
in a substring
is substituted for
y
, and
y
in a substring is substituted
for
x
.
A change expression can include the items in the following list. Each item in the list is followed by the description of the value concatenated to the replacement string at the position of the pointer.
Change Items |
@int | moves the pointer to column int where the next string added to the replacement string will start. |
@= | moves the pointer one column past the end of the matched substring. |
>int | moves the pointer to the right to column int. If the pointer is already to the right of column int, the pointer is not moved. |
>= | moves the pointer to the right, one column past the end of the matched substring. |
<int | moves pointer to the left to column int. If the pointer is already to the left of column int, the pointer is not moved. |
<= | moves the pointer to the left, one column past the end of the matched substring. |
+int | moves the pointer int columns to the right. |
-int | moves the pointer int columns to the left. |
-L | left-aligns the result of the previous item or expression in parentheses. |
-R | right-aligns the result of the previous item or expression in parentheses. |
-C | centers the result of the previous item or expression in parentheses. |
*int | repeats the result of the previous item or expression in parentheses int-1 times, producing a total of int copies. |
Details |
" 'O' '' connor"
matches an uppercase O, followed by a
single quotation mark, followed by the letters "connor" in either upper or
lower case.
Comparisons |
The regular expression (RX) functions and CALL routines work together to manipulate strings that match patterns. Use the RXPARSE function to parse a pattern you specify. Use the RXMATCH function and the CALL RXCHANGE and CALL RXSUBSTR routines to match or modify your data. Use the CALL RXFREE routine to free allocated space.
Note: Use RXPARSE only with other regular expression (RX) functions and CALL
routines.
Example |
The following example uses RXPARSE to parse an input string and change the value of the string.
data test; input string $; datalines; abcxyzpq xyyzxyZx x2z..X7z ; data _null_; set; length to $20; if _n_=1 then rx=rxparse("` x < ? > 'z' to ABC =1 '@#%'"); retain rx; drop rx; put string=; match=rxmatch(rx,string); put @3 match=; call rxsubstr(rx,string,position); put @3 position=; call rxsubstr(rx,string,position,length,score); put @3 position= Length= Score=; call rxchange(rx,999,string,to); put @3 to=; call rxchange(rx,999,string); put @3 'New ' string=; run;
cpu time 0.05 seconds 1 data test; 2 input string $; 3 datalines; NOTE: The data set WORK.TEST has 3 observations and 1 variables. NOTE: DATA statement used: real time 0.34 seconds cpu time 0.21 seconds 7 ; 8 9 data _null_; 10 set; 11 length to $20; 12 if _n_=1 then 13 rx=rxparse("` x < ? > 'z' to ABC =1 '@#%'"); 14 retain rx; 15 drop rx; 16 put string=; 17 match=rxmatch(rx,string); 18 put @3 match=; 19 call rxsubstr(rx,string,position); 20 put @3 position=; 21 call rxsubstr(rx,string,position,length,score); 22 put @3 position= Length= Score=; 23 call rxchange(rx,999,string,to); 24 put @3 to=; 25 call rxchange(rx,999,string); 26 put @3 'New ' string=; 27 run; string=abcxyzpq match=4 position=4 position=4 length=3 score=0 to=abcabcy@#%pq New string=abcabcy@ string=xyyzxyZx match=0 position=0 position=0 length=0 score=0 to=xyyzxyZx New string=xyyzxyZx string=x2z..X7z match=1 position=1 position=1 length=3 score=0 to=abc2@#%..Abc7@#% New string=abc2@#%. NOTE: DATA statement used: real time 0.67 seconds cpu time 0.45 seconds |
See Also |
Functions and CALL routines:
| |||||||||
Aho, Hopcroft, and Ullman, Chapter 9 (See References) |
Chapter Contents |
Previous |
Next |
Top of Page |
Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.