Chapter Contents

Previous

Next
SPEDIS

SPEDIS



Determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words

Category: Character


Syntax
Arguments
Details
Examples

Syntax

SPEDIS(query,keyword)

Arguments

query
identifies the word to query for the likelihood of a match. SPEDIS removes trailing blanks before comparing the value.

keyword
specifies a target word for the query. SPEDIS removes trailing blanks before comparing the value.


Details

SPEDIS returns the distance between the query and a keyword, a nonnegative value usually less than 100, never greater than 200 with the default costs.

SPEDIS computes an asymmetric spelling distance between two words as the normalized cost for converting the keyword to the query word via a sequence of operations. SPEDIS(QUERY, KEYWORD) is NOT the same as SPEDIS(KEYWORD, QUERY).

Costs for each operation that is required to convert the keyword to the query are

Operation Cost Explanation
match 0 no change
singlet 25 delete one of a double letter
doublet 50 double a letter
swap 50 reverse the order of two consecutive letters
truncate 50 delete a letter from the end
append 35 add a letter to the end
delete 50 delete a letter from the middle
insert 100 insert a letter in the middle
replace 100 replace a letter in the middle
firstdel 100 delete the first letter
firstins 200 insert a letter at the beginning
firstrep 200 replace the first letter

The distance is the sum of the costs divided (in integer arithmetic) by the length of the query.


Examples

options nodate pageno=1 linesize=64;
data words;
   input oper $ query $ keyword $;
   dist = spedis(query,keyword);
   cost = dist * length(query);
   put oper $10. query $10. keyword $10. 
       dist 5. cost 5.;
datalines;
match       fuzzy        fuzzy
singlet     fuzy         fuzzy
doublet     fuuzzy       fuzzy
swap        fzuzy        fuzzy
truncate    fuzz         fuzzy
append      fuzzys       fuzzy
delete      fzzy         fuzzy
insert      fluzzy       fuzzy
replace     fizzy        fuzzy
firstdel    uzzy         fuzzy
firstins    pfuzzy       fuzzy
firstrep    wuzzy        fuzzy
several     floozy       fuzzy
;

proc print data = words;
run;

The output from the DATA step is as follows:

                         The SAS System                        1
      OBS    OPER        QUERY     KEYWORD    DIST    COST

        1    match       fuzzy      fuzzy       0        0
        2    singlet     fuzy       fuzzy       6       24
        3    doublet     fuuzzy     fuzzy       8       48
        4    swap        fzuzy      fuzzy      10       50
        5    truncate    fuzz       fuzzy      12       48
        6    append      fuzzys     fuzzy       5       30
        7    delete      fzzy       fuzzy      12       48
        8    insert      fluzzy     fuzzy      16       96
        9    replace     fizzy      fuzzy      20      100
       10    firstdel    uzzy       fuzzy      25      100
       11    firstins    pfuzzy     fuzzy      33      198
       12    firstrep    wuzzy      fuzzy      40      200
       13    several     floozy     fuzzy      50      300


Chapter Contents

Previous

Next

Top of Page

Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.