next up previous contents index CD CD Algorithms
Next: Approximate String Matching Up: Set and String Problems Previous: Set Packing

String Matching

  

  figure21657

Input description: A text string t of length n. A pattern string p of length m.

Problem description: Find the first (or all) instances of the pattern in the text.

Discussion: String matching is fundamental to database and text processing applications. Every text editor must contain a mechanism to search the current document for arbitrary strings. Pattern matching programming languages such as Perl and Awk derive much of their power from their built-in string matching primitives, making it easy to fashion programs that filter and modify text. Spelling checkers scan an input text for words in the dictionary and reject any strings that do not match.       

Several issues arise in identifying the right string matching algorithm for the job:

Implementations: SPARE Parts is a string pattern recognition toolkit, written in C++ by Bruce Watson. It provides production-quality implementations of all major variants of the classical string matching algorithms for single patterns (both Knuth-Morris-Pratt and Boyer-Moore) and multiple patterns (both Aho-Corasick and Commentz-Walter). SPARE Parts is available by anonymous ftp from ftp.win.tue.nl in /pub/techreports/pi/watson.phd/. A greatly improved commercial version is available from www.RibbitSoft.com.    

XTango (see Section gif) provides animations for both the Boyer-Moore and Knuth-Morris-Pratt algorithms. The C source code for each animation is included.   

Implementations in C and Pascal of several algorithms for exact and approximate string matching appear in [GBY91]. Sedgewick provides similar implementations of Knuth-Morris-Pratt, Rabin-Karp, and Boyer-Moore in C++. See Section gif for details on both codes.  

Implementations in C of the Boyer-Moore, Aho-Corasick, and regular expression matching algorithms appear in [BR95]. The code for these algorithms is printed in the text and available on disk for a modest fee.

Notes: All books on string algorithms contain thorough discussions of exact string matching, including [CR94, Ste94, Gus97]. Good expositions on the Boyer-Moore [BM77] and Knuth-Morris-Pratt algorithms [KMP77] include [Baa88, CLR90, Man89].

Aho [Aho90] provides a good survey on algorithms for pattern matching in strings, particularly where the patterns are regular expressions instead of strings, and for the Aho-Corasick algorithm for multiple patterns [AC75]. An algorithm merging Aho-Corasick and Boyer-Moore can be faster for small numbers of patterns [CW79], but the window where it wins is apparently fairly narrow.

Empirical comparisons of string matching algorithms include [DB86, Hor80, dVS82]. Which algorithm performs best depends upon the properties of the strings and the size of the alphabet. For long patterns and texts, I recommend that you use the best implementation of Boyer-Moore that you can find.  

An interesting classical problem is determining the minimum number of comparisons needed to perform exact string matching. A version of Boyer-Moore never makes more than 2n comparisons independent of the number of occurrences of the pattern in the text [AG86]. More recent results are very technical, depending upon the details of the model and the alphabet size. There is a lower bound of n-m+1 text characters that any algorithm must be examine in the worst case [Riv77]. The history of string matching algorithms is somewhat checkered because several published proofs were incorrect or incomplete [Gus97].

The Karp-Rabin algorithm [KR87] uses a hash function to perform string matching in linear expected time. Its worst-case time remains quadratic, and its performance in practice appears somewhat worse than the character comparison methods described above.  

Related Problems: Suffix trees (see page gif), approximate string matching (see page gif).    


next up previous contents index CD CD Algorithms
Next: Approximate String Matching Up: Set and String Problems Previous: Set Packing

Algorithms
Mon Jun 2 23:33:50 EDT 1997