next up previous contents index CD CD Algorithms
Next: Shortest Common Superstring Up: Set and String Problems Previous: Finite State Machine Minimization

Longest Common Substring

  

  figure22785

Input description: A set S of strings tex2html_wrap_inline31069 .

Problem description: What is the longest string S' such that for each tex2html_wrap_inline31073 , tex2html_wrap_inline31075 , the characters of S appear as a subsequence of tex2html_wrap_inline31077 ?

Discussion: The problem of longest common subsequence arises whenever we search for similarities across multiple texts. A particularly important application is in finding a consensus among DNA sequences. The genes for building particular proteins evolve with time, but the functional regions must remain consistent in order to work correctly. By finding the longest common subsequence of the same gene in different species, we learn what has been conserved over time.     

The longest common substring problem is a special case of edit distance (see Section gif), when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations. Under these conditions, the edit distance between p and t is n+m-2 |lcs(p,t)|, since we can delete the missing characters from p to the lcs(p,t) and insert the missing characters from t to transform p to t. This is particularly interesting because the longest common subsequence can be faster to compute than edit distance.   

Issues arising include:

Implementations: MAP (Multiple Alignment Program) [Hua94] by Xiaoqiu Huang is a C language program that computes a global multiple alignment of sequences using an iterative pairwise method. Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang directory.

Combinatorica [Ski90] provides a Mathematica implementation of an algorithm to construct the longest increasing subsequence of a permutation, which is a special case of longest common subsequence. This algorithm is based on Young tableaux rather than dynamic programming. See Section gif.    

Notes: Good expositions on longest common subsequence include [AHU83, CLR90]. A survey of algorithmic results appears in [GBY91]. The algorithm for the case where all the characters in each sequence are distinct or infrequent is due to Hunt and Szymanski [HS77]. Expositions of this algorithm include [Aho90, Man89]. Multiple sequence alignment for computational biology is treated in [Wat95].

Certain problems on strings become easier when we assume a constant-sized alphabet. Masek and Paterson [MP80] solve longest common subsequence in tex2html_wrap_inline31137 for constant-sized alphabets, using the four Russians technique.  

Related Problems: Approximate string matching (see page gif), shortest common superstring (see page gif).    


next up previous contents index CD CD Algorithms
Next: Shortest Common Superstring Up: Set and String Problems Previous: Finite State Machine Minimization

Algorithms
Mon Jun 2 23:33:50 EDT 1997