Longest Common Substring

Next: Shortest Common Superstring Up: Set and String Problems Previous: Finite State Machine Minimization

Longest Common Substring

Input description: A set S of strings .

Problem description: What is the longest string S' such that for each , , the characters of S appear as a subsequence of ?

Discussion: The problem of longest common subsequence arises whenever we search for similarities across multiple texts. A particularly important application is in finding a consensus among DNA sequences. The genes for building particular proteins evolve with time, but the functional regions must remain consistent in order to work correctly. By finding the longest common subsequence of the same gene in different species, we learn what has been conserved over time.

The longest common substring problem is a special case of edit distance (see Section ), when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations. Under these conditions, the edit distance between p and t is n+m-2 |lcs(p,t)|, since we can delete the missing characters from p to the lcs(p,t) and insert the missing characters from t to transform p to t. This is particularly interesting because the longest common subsequence can be faster to compute than edit distance.

Issues arising include:

Are you looking for a common substring or scattered subsequence? - In detecting plagiarism or attempting to identify the authors of anonymous works, we might need to find the longest phrases shared between several documents. Since phrases are strings of consecutive characters, we need the longest common substring between the texts.
The longest common substring of a set of strings can be identified in linear time using suffix trees, discussed in Section . The trick is to build a suffix tree containing all the strings, label each leaf with the set of strings that contain it, and then do a depth-first traversal to identify the deepest node that has descendents from each input string.
For the rest of this discussion, we will restrict attention to finding common scattered subsequences. Dynamic programming can be used to find the longest common subsequence of two strings, S and T, of n and m characters each. This algorithm is a special case of the edit-distance computation of Section . Let M[i,j] denote the number of characters in the longest common substring of and . In general, if , there is no way the last pair of characters could match, so . If S[i] = T[j], we have the option to select this character for our substring, so . This gives a recurrence that computes M, and thus finds the length of the longest common subsequence in O(nm) time. We can reconstruct the actual common substring by walking backward from M[n,m] and establishing which characters were actually matched along the way.
What if there are relatively few sets of matching characters? - For strings that do not contain too many copies of the same character, there is a faster algorithm. Let r be the number of pairs of positions (i,j) such that . Thus r can be as large as mn if both strings consist entirely of the same character, but r = n if the two strings are permutations of . This technique treats the pairs of r as defining points in the plane.
The complete set of r such points can be found in O(n + m + r) time by bucketing techniques. For each string, we create a bucket for each letter of the alphabet and then partition all of its characters into the appropriate buckets. For each letter c of the alphabet, create a point (,t) from every pair and , where and are buckets for c.
A common substring represents a path through these points that moves only up and to the right, never down or to the left. Given these points, the longest such path can be found in time. We will sort the points in order of increasing x-coordinate (breaking ties in favor of increasing y-coordinate. We will insert these points one by one in this order, and for each k, , maintain the minimum y-coordinate of any path going through exactly k points. Inserting a new point will change exactly one of these paths by reducing the y-coordinate of the path whose last point is barely greater than the new point.
What if the strings are permutations? - If the strings are permutations, then there are exactly n pairs of matching characters, and the above algorithm runs in time. A particularly important case of this occurs in finding the longest increasing subsequence of a sequence of numbers. Sorting this sequence and then replacing each number by its rank in the total order gives us a permutation p. The longest common subsequence of p and gives the longest increasing subsequence.
What if we have more than two strings to align? - The basic dynamic programming algorithm can be generalized to k strings, taking time, where n is the length of the longest string. This algorithm is exponential in the number of strings k, and so it will likely be too expensive for more than 3 to 4 strings. Further, the problem is NP-complete, so no better exact algorithm is destined to come along.
This problem of multiple sequence alignment has received considerable attention, and numerous heuristics have been proposed. Many heuristics begin by computing the pairwise alignment between each of the pairs of strings, and then work to merge these alignments. One approach is to build a graph with a vertex for each character of each string. There will be an edge between and if the corresponding characters are matched in the alignment between S and T. Any k-clique (see Section ) in this graph describes a commonly aligned character, and all such cliques can be found efficiently because of the sparse structure of this graph.
Although these cliques will define a common subsequence, there is no reason to believe that it will be the longest such substring. Appropriately weakening the clique requirement provides a way to increase it, but still there can be no promises.

Implementations: MAP (Multiple Alignment Program) [Hua94] by Xiaoqiu Huang is a C language program that computes a global multiple alignment of sequences using an iterative pairwise method. Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang directory.

Combinatorica [Ski90] provides a Mathematica implementation of an algorithm to construct the longest increasing subsequence of a permutation, which is a special case of longest common subsequence. This algorithm is based on Young tableaux rather than dynamic programming. See Section .

Notes: Good expositions on longest common subsequence include [AHU83, CLR90]. A survey of algorithmic results appears in [GBY91]. The algorithm for the case where all the characters in each sequence are distinct or infrequent is due to Hunt and Szymanski [HS77]. Expositions of this algorithm include [Aho90, Man89]. Multiple sequence alignment for computational biology is treated in [Wat95].

Certain problems on strings become easier when we assume a constant-sized alphabet. Masek and Paterson [MP80] solve longest common subsequence in for constant-sized alphabets, using the four Russians technique.

Related Problems: Approximate string matching (see page ), shortest common superstring (see page ).

Next: Shortest Common Superstring Up: Set and String Problems Previous: Finite State Machine Minimization

Algorithms
Mon Jun 2 23:33:50 EDT 1997