next up previous contents index CD CD Algorithms
Next: Text Compression Up: Set and String Problems Previous: String Matching

Approximate String Matching

  

  figure21893

Input description: A text string t and a pattern string p. An edit cost bound k.

Problem description: Can we transform t to p using at most k insertions, deletions, and substitutions?

Discussion: Approximate string matching is fundamental to text processing, because we live in an error-prone world. Any spelling correction program must be able to identify the closest match for any text string not found in a dictionary. Genbank has become a fundamental tool for molecular biology by supporting homology (similarity) searches on DNA sequences. Suppose you were to sequence a new gene in man, and you discovered that it is similar to the hemoglobin gene in rats. It is likely that this new gene also produces hemoglobin, and any differences are the result of genetic mutations during evolution.      

I once encountered approximate string matching in evaluating the performance of an optical character recognition system that we built. After scanning and recognizing a test document, we needed to compare the correct answers with those produced by our system. To improve our system, it was important to count how often each pair of letters were getting confused and to identify gibberish when the program was trying to make out letters where none existed. The solution was to do an alignment between the two texts. Insertions and deletions corresponded to gibberish, while substitutions signaled errors in our recognizers. This same principle is used in file difference programs, which identify the lines that have changed between two versions of a file.         

When no errors are permitted, the problem becomes that of exact string matching, which is presented in Section gif. Here, we restrict our discussion to dealing with errors.   

Dynamic programming provides the basic approach to approximate string matching, as discussed in Section gif. Let D[i,j] denote the cost of editing the first i characters of the text string t into the first j characters of the pattern string p. The recurrence follows because we must have done something with tex2html_wrap_inline30771 and tex2html_wrap_inline30773 . The only options are matching them, substituting one for the other, deleting tex2html_wrap_inline30775 , or inserting a match for tex2html_wrap_inline30777 . Thus D[i,j] is the minimum of the costs of these possibilities:

  1. If tex2html_wrap_inline30781 then D[i-1,j-1] else D[i-1,j-1] + substitution cost.
  2. D[i-1,j] + deletion cost of tex2html_wrap_inline30789 .
  3. D[i,j-1] + deletion cost of tex2html_wrap_inline30793 .

Several issues remain before we can make full use of this recurrence:

Implementations: The best available software tools for approximate pattern matching are glimpse and agrep [WM92a, WM92b], developed by Manber and Wu at the University of Arizona and available from http://glimpse.cs.arizona.edu/. Glimpse is a tool for building and using an index to search through file systems, while agrep (approximate general regular expression pattern matcher) is a tool supporting text search with spelling errors.    Both programs are widely used and respected.

ht://Dig is an alternative WWW text search engine from Andrew Scherpbier, which contains implementations of Soundex and Metaphone. It is available from http://htdig.sdsu.edu/ and is released under the GNU general public license.  

Implementations in C of the Soundex and dynamic programming edit-distance algorithms appear in [BR95]. The code for these algorithms is printed in the text and is available on disk for a modest fee.   

Bare bones implementations in C and Pascal of several algorithms for exact and approximate string matching appear in [GBY91]. See Section gif for further details.

Notes: The wide range of applications for approximate string matching was made apparent in Sankoff and Kruskal's book [SK83], which remains a useful historical reference for the problem. Surveys on approximate pattern matching include [HD80]. The basic dynamic programming alignment algorithm is attributed to [WF74], although it is apparently folklore. The edit distance between two strings is sometimes referred to as the Levenshtein distance. Expositions of dynamic programming to compute Levenshtein distance include [Baa88, CLR90, Man89]. Expositions of Hirschberg's linear-space algorithm [Hir75] include [CR94, Gus97].

Masek and Paterson [MP80] compute the edit distance between m- and n-length strings in time tex2html_wrap_inline30845 for constant-sized alphabets, using ideas from the four Russians algorithm for Boolean matrix multiplication [ADKF70].   The shortest path formulation leads to a variety of algorithms that are good when the edit distance is small, including an tex2html_wrap_inline30847 algorithm due to Myers [Mye86]. Longest increasing subsequence can be done in tex2html_wrap_inline30849 time [HS77], as presented in [Man89].  

Soundex was invented and patented by M. K. Odell and R. C. Russell. Expositions on Soundex include [BR95, Knu73b]. Metaphone is a recent attempt to improve on Soundex [BR95, Par90].   

Related Problems: String matching (see page gif), longest common substring (see page gif).    


next up previous contents index CD CD Algorithms
Next: Text Compression Up: Set and String Problems Previous: String Matching

Algorithms
Mon Jun 2 23:33:50 EDT 1997