Approximate String Matching

Next: Text Compression Up: Set and String Problems Previous: String Matching

Approximate String Matching

Input description: A text string t and a pattern string p. An edit cost bound k.

Problem description: Can we transform t to p using at most k insertions, deletions, and substitutions?

Discussion: Approximate string matching is fundamental to text processing, because we live in an error-prone world. Any spelling correction program must be able to identify the closest match for any text string not found in a dictionary. Genbank has become a fundamental tool for molecular biology by supporting homology (similarity) searches on DNA sequences. Suppose you were to sequence a new gene in man, and you discovered that it is similar to the hemoglobin gene in rats. It is likely that this new gene also produces hemoglobin, and any differences are the result of genetic mutations during evolution.

I once encountered approximate string matching in evaluating the performance of an optical character recognition system that we built. After scanning and recognizing a test document, we needed to compare the correct answers with those produced by our system. To improve our system, it was important to count how often each pair of letters were getting confused and to identify gibberish when the program was trying to make out letters where none existed. The solution was to do an alignment between the two texts. Insertions and deletions corresponded to gibberish, while substitutions signaled errors in our recognizers. This same principle is used in file difference programs, which identify the lines that have changed between two versions of a file.

When no errors are permitted, the problem becomes that of exact string matching, which is presented in Section . Here, we restrict our discussion to dealing with errors.

Dynamic programming provides the basic approach to approximate string matching, as discussed in Section . Let D[i,j] denote the cost of editing the first i characters of the text string t into the first j characters of the pattern string p. The recurrence follows because we must have done something with and . The only options are matching them, substituting one for the other, deleting , or inserting a match for . Thus D[i,j] is the minimum of the costs of these possibilities:

If then D[i-1,j-1] else D[i-1,j-1] + substitution cost.
D[i-1,j] + deletion cost of .
D[i,j-1] + deletion cost of .

Several issues remain before we can make full use of this recurrence:

Do I want to match the pattern against the full text, or against a substring? - Appropriately initializing the boundary conditions of the recurrence distinguishes between the algorithms for string matching and substring matching. Suppose we want to align the full text against the full pattern. Then the cost of D[i,0] must be that of deleting the first i characters of the text, so D[i,0] = i. Similarly, D[0,j] = j.
Now suppose that the pattern may occur anywhere within the text. The proper cost of D[i,0] is now 0, since there should be no penalty for starting the alignment in the ith position. The cost of D[0,j] remains j, however, since the only way to match the first j pattern characters with nothing is to delete all of them. The cost of the best pattern match against the text will be given by .
How should I select the substitution and insertion/deletion costs? - The basic algorithm can be easily modified to use different costs for insertion, deletion, and the substitutions of specific pairs of characters. Which costs you use depends upon what you are planning to do with the alignment.
The most common cost assignment charges the same for insertion, deletion, or substitution. Charging a substitution cost of more than insertion + deletion is a good way to ensure that substitutions never get performed, since it will be cheaper to edit both characters out of the string. With just insertion and deletion to work with, the problem reduces to longest common subsequence, discussed in Section . Often, it pays to tweak the edit distance costs and study the resulting alignments until you find a set that does the job.
How do I get the actual alignment of the strings, instead of just the cost? - To obtain the transcript of the editing operations performed in achieving the minimum cost alignment, we can work backwards from the complete cost matrix D. To get to cell D[n,m], we had to come from one of D[n-1,m] (insertion), D[n,m-1] (deletion), or D[n-1,m-1] (substitution/match). Which of the three options was chosen can be reconstructed given these costs and the characters and . By continuing to work backwards from the previous cell, we can trace the entire path and thus reconstruct the alignment.
What if the two strings are very similar to each other? - The dynamic programming algorithm can be thought of as finding a shortest path across an grid, where the cost of each edge depends upon which operation it represents. If we seek an alignment involving a combination of at most d insertions, deletions, and substitutions, we need only traverse the band of O(d n) cells within a distance d of the central diagonal. If no such low-cost alignment exists within this band, then no low-cost alignment can exist in the full cost matrix.
How can I minimize the required storage? - Dynamic programming takes both quadratic time and space. For many applications, the space required to store the dynamic programming table is a much more serious problem. Observe that only space is needed to compute D[m,n], since we need only maintain two active rows (or columns) of the matrix in order to compute the final value. The entire matrix is required only if we need to reconstruct the actual sequence alignment.
To save space, we can use Hirschberg's clever recursive algorithm. During one pass of the linear-space algorithm above to compute D[m,n], we maintain all the values for the (m/2)nd column and identify which middle-element cell D[m/2,x] was used to optimize D[m,n]. This reduces our problem to finding the best paths from D[1,1] to D[m/2,x] and from D[m/2,x] to D[m/2,n], both of which can be solved recursively. Each time we throw away half of the matrix elements from consideration, and so the total time remains O(mn). This linear-space algorithm proves to be a big win in practice on long strings, although it is somewhat more difficult to program.
Does string similarity mean that the strings sound alike? - Other models of approximate pattern matching become more appropriate than edit distance in certain applications. Particularly interesting is Soundex, a hashing scheme that attempts to pair up English words that sound alike. This can be useful in testing whether two names that have been spelled differently are likely to be the same. For example, my last name is often spelled ``Skina'', ``Skinnia'', ``Schiena'', and occasionally ``Skiena''. All of these hash to the same Soundex code, S25.
The algorithm works by dropping vowels and silent letters, removing doubled letters, and then assigning the remaining letters numbers from the following classes: BFPV gets a 1, CGJKQSXZ gets a 2, DT gets a 3, L gets a 4, MN gets a 5, and R gets a 6. The code starts with the first letter and contains at most three digits. Although this all sounds very hokey, experience shows that it works reasonably well. Experience indeed: Soundex has been used since the 1920's.

Implementations: The best available software tools for approximate pattern matching are glimpse and agrep [WM92a, WM92b], developed by Manber and Wu at the University of Arizona and available from http://glimpse.cs.arizona.edu/. Glimpse is a tool for building and using an index to search through file systems, while agrep (approximate general regular expression pattern matcher) is a tool supporting text search with spelling errors. Both programs are widely used and respected.

ht://Dig is an alternative WWW text search engine from Andrew Scherpbier, which contains implementations of Soundex and Metaphone. It is available from http://htdig.sdsu.edu/ and is released under the GNU general public license.

Implementations in C of the Soundex and dynamic programming edit-distance algorithms appear in [BR95]. The code for these algorithms is printed in the text and is available on disk for a modest fee.

Bare bones implementations in C and Pascal of several algorithms for exact and approximate string matching appear in [GBY91]. See Section for further details.

Notes: The wide range of applications for approximate string matching was made apparent in Sankoff and Kruskal's book [SK83], which remains a useful historical reference for the problem. Surveys on approximate pattern matching include [HD80]. The basic dynamic programming alignment algorithm is attributed to [WF74], although it is apparently folklore. The edit distance between two strings is sometimes referred to as the Levenshtein distance. Expositions of dynamic programming to compute Levenshtein distance include [Baa88, CLR90, Man89]. Expositions of Hirschberg's linear-space algorithm [Hir75] include [CR94, Gus97].

Masek and Paterson [MP80] compute the edit distance between m- and n-length strings in time for constant-sized alphabets, using ideas from the four Russians algorithm for Boolean matrix multiplication [ADKF70]. The shortest path formulation leads to a variety of algorithms that are good when the edit distance is small, including an algorithm due to Myers [Mye86]. Longest increasing subsequence can be done in time [HS77], as presented in [Man89].

Soundex was invented and patented by M. K. Odell and R. C. Russell. Expositions on Soundex include [BR95, Knu73b]. Metaphone is a recent attempt to improve on Soundex [BR95, Par90].

Related Problems: String matching (see page ), longest common substring (see page ).

Next: Text Compression Up: Set and String Problems Previous: String Matching

Algorithms
Mon Jun 2 23:33:50 EDT 1997