next up previous contents index CD CD Algorithms
Next: Algorithmic Resources Up: Set and String Problems Previous: Longest Common Substring

Shortest Common Superstring

  

  figure22965

Input description: A set of strings tex2html_wrap_inline31165 .

Problem description: Find the shortest string S' that contains each element of S as a substring.

Discussion: Shortest common superstring arises in a variety of applications, including sparse matrix compression. Suppose we have an tex2html_wrap_inline31169 matrix with most of the elements being zero. We can partition each row into m/k runs of k elements each and construct the shortest common superstring S' of these runs. We now have reduced the problem to storing the superstring, plus an tex2html_wrap_inline31175 array of pointers into the superstring denoting where each of the runs starts. Accessing a particular element M[i,j] still takes constant time, but there is a space savings when |S| << mn.     

Another application arises in DNA sequencing. It happens to be easy to sequence small fragments of DNA, say up to about 500 base pairs or characters. However, the real interest is in sequencing large molecules. The standard approach to large-scale, ``shotgun'' sequencing clones many copies of the target molecule, breaks them randomly into small fragments, sequences the fragments, and then proposes the shortest superstring of the fragments as the correct sequence. While it is an article of faith that the shortest superstring will be the most likely sequence, this seems to work reasonably well in practice.   

Finding a superstring of all the substrings is not difficult, as we can simply concatenate them all together. It is finding the shortest such string that is problematic. Indeed, shortest common superstring remains NP-complete under all reasonable classes of strings.   

The problem of finding the shortest common superstring can easily be reduced to that of the traveling salesman problem (see Section gif). Create an overlap graph G where vertex tex2html_wrap_inline31181 represents string tex2html_wrap_inline31183 . Edge tex2html_wrap_inline31185 will have weight equal to the length of tex2html_wrap_inline31187 minus the overlap of tex2html_wrap_inline31189 with tex2html_wrap_inline31191 . The path visiting all the vertices of minimum total weight defines the shortest common superstring. The edge weights of this graph are not symmetric, after all, the overlap of tex2html_wrap_inline31193 and tex2html_wrap_inline31195 is not the same as the overlap of tex2html_wrap_inline31197 and tex2html_wrap_inline31199 .   Thus only programs capable of solving asymmetric TSPs can be applied to this problem.

The greedy heuristic is the standard approach to approximating the shortest common superstring. Find the pair of strings with the maximum number of characters of overlap. Replace them by the merged string, and repeat until only one string remains. Given the overlap graph above, this heuristic can be efficiently implemented by inserting all of the edge weights into a heap (see Section gif) and then merging if the appropriate ends of the two strings have not yet be used, which can be maintained with an array of Boolean flags.    

The potentially time-consuming part of this heuristic is in building the overlap graph. The brute-force approach to finding the maximum overlap of two length-l strings takes tex2html_wrap_inline31201 , which must be repeated tex2html_wrap_inline31203 times. Faster times are possible by appropriately using suffix trees (see Section gif). Build a tree containing all suffixes of all reversed strings of S. String tex2html_wrap_inline31205 overlaps with tex2html_wrap_inline31207 if a suffix of tex2html_wrap_inline31209 matches a suffix of the reverse of tex2html_wrap_inline31211 . The longest overlap for each fragment can be found in time linear in its length.     

How well does the greedy heuristic perform? If we are unlucky with the input, the greedy heuristic can be fooled into creating a superstring that is at least twice as long as optimal. Usually, it will be a lot better in practice. It is known that the resulting superstring can never be more than 2.75 times optimal.

Building superstrings becomes more difficult with positive and negative substrings, where negative substrings cannot be substrings of the superstring. The problem of deciding whether any such consistent substring exists is NP-complete, unless you are allowed to add an extra character to the alphabet to use as a spacer.  

Implementations: CAP (Contig Assembly Program) [Hua92] by Xiaoqiu Huang is a C language program supporting DNA shotgun sequencing by finding the shortest common superstring of a set of fragments. As to performance, CAP took 4 hours to assemble 1,015 fragments of a total of 252,000 characters on a Sun SPARCstation SLC. Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang directory.    

Notes: The shortest common superstring problem and its application to DNA shotgun assembly is ably surveyed in [Wat95]. Kececioglu and Myers [KM95] report on an algorithm for the more general version of shortest common superstring, where the strings may have errors. Their paper is recommended reading to anyone interested in fragment assembly.

Blum et al. [BJL tex2html_wrap_inline32732 91] gave the first constant-factor approximation algorithms for shortest common superstring, with a variation of the greedy heuristic. More recent research has beaten the constant down to 2.75, progress towards the expected factor-two result.

Related Problems: Suffix trees (see page gif), text compression (see page gif).    


next up previous contents index CD CD Algorithms
Next: Algorithmic Resources Up: Set and String Problems Previous: Longest Common Substring

Algorithms
Mon Jun 2 23:33:50 EDT 1997