Text Compression

Next: Cryptography Up: Set and String Problems Previous: Approximate String Matching

Text Compression

Input description: A text string S.

Problem description: A shorter text string S' such that S can be correctly reconstructed from S'.

Discussion: Secondary storage devices fill up quickly on every computer system, even though their capacity doubles each year. Decreasing storage prices have only increased interest in data compression, since there is now more data to compress than ever before. Data compression is the algorithmic problem of finding alternative, space-efficient encodings for a given data file. With the rise of computer networks, a new mission for data compression has arisen, that of increasing the effective bandwidth of networks by reducing the number of bits before transmission.

Data compression is a problem for which practical people like to invent ad hoc methods, designed for their particular applications. Sometimes these outperform general methods, but often they do not. The following issues arise in selecting the right data compression algorithm:

Must we exactly reconstruct the input text after compression? - A primary issue in data compression algorithms is the question of lossy versus lossless encodings. Text applications typically demand lossless encodings, because users become disturbed whenever their data files have been corrupted. However, fidelity is not such an issue in image or video compression, where the presence of small artifacts will be imperceptible to the viewer. Significantly greater compression ratios can be obtained using lossy compression, which is why all image/video/audio compression algorithms take advantage of this freedom.
Can I simplify my data before I compress it? - The most effective way to free up space on a disk is to delete files you don't need. Likewise, any preprocessing you can do to a file to reduce its information content before compression will pay off later in better performance. For example, is it possible to eliminate extra blank spaces or lines from the file? Can the document be converted entirely to uppercase characters or have formatting information removed?
Does it matter whether the algorithm is patented? - One concern is that many data compression algorithms are patented, in particular the LZW variation of the Lempel-Ziv algorithm discussed below. Further, Unisys, the owner of the patent, makes periodic attempts to collect. My personal (although not legal) recommendation is to ignore them, unless you are in the business of selling text compression software. If this makes you uncomfortable, note that there are other variations on the Lempel-Ziv algorithm that are not under patent protection and perform about as well. See the notes and implementations below.
How do I compress image data - Run-length coding is the simplest lossless compression algorithm for image data, where we replace runs of identical pixel values with one instance of the pixel and an integer giving the length of the run. This works well on binary images with large regions of similar pixels (like scanned text) and terribly on images with many quantization levels and a little noise. It can also be applied to text with many fields that have been padded by blanks. Issues like how many bits to allocate to the count field and the traversal order converting the two-dimensional image to a stream of pixels can have a surprisingly large impact on the compression ratio.
For serious image and video compression applications, I recommend that you use a lossy coding method and not fool around with implementing it yourself. JPEG is the standard high-performance image compression method, while MPEG is designed to exploit the frame-to-frame coherence of video. Encoders and decoders for both are provided in the implementation section.
Must compression and decompression both run in real time? - For many applications, fast decompression is more important than fast compression, and algorithms such as JPEG exist to take advantage of this. While compressing video for a CD-ROM, the compression will be done only once, while decompression will be necessary anytime anyone plays it. In contrast, operating systems that increase the effective capacity of disks by automatically compressing each file will need a symmetric algorithm with fast compression times as well.

Although there are literally dozens of text compression algorithms available, they are characterized by two basic approaches. In static algorithms, such as Huffman codes, a single coding table is built by analyzing the entire document. In adaptive algorithms, such as Lempel-Ziv, a coding table is built on the fly and adapts to the local character distribution of the document. An adaptive algorithm will likely prove to be the correct answer:

Huffman codes - Huffman codes work by replacing each alphabet symbol by a variable-length code string. ASCII uses eight bits per symbol in English text, which is wasteful, since certain characters (such as `e') occur far more often than others (such as `q'). Huffman codes compress text by assigning `e' a short code word and `q' a longer one.
Optimal Huffman codes can be constructed using an efficient greedy algorithm. Sort the symbols in increasing order by frequency. We will merge the two least frequently used symbols x and y into a new symbol m, whose frequency is the sum of the frequencies of its two child symbols. By replacing x and y by m, we now have a smaller set of symbols, and we can repeat this operation n-1 times until all symbols have been merged. Each merging operation defines a node in a binary tree, and the left or right choices on the path from root-to-leaf define the bit of the binary code word for each symbol. Maintaining the list of symbols sorted by frequency can be done using priority queues, which yields an -time Huffman code construction algorithm.
Although they are widely used, Huffman codes have three primary disadvantages. First, you must make two passes over the document on encoding, the first to gather statistics and build the coding table and the second to actually encode the document. Second, you must explicitly store the coding table with the document in order to reconstruct it, which eats into your space savings on short documents. Finally, Huffman codes exploit only nonuniformity in symbol distribution, while adaptive algorithms can recognize the higher-order redundancy in strings such as 0101010101....
Lempel-Ziv algorithms - Lempel-Ziv algorithms, including the popular LZW variant, compress text by building the coding table on the fly as we read the document. The coding table available for compression changes at each position in the text. A clever protocol between the encoding program and the decoding program ensures that both sides of the channel are always working with the exact same code table, so no information can be lost.
Lempel-Ziv algorithms build coding tables of recently-used text strings, which can get arbitrarily long. Thus it can exploit frequently-used syllables, words, and even phrases to build better encodings. Further, since the coding table alters with position, it adapts to local changes in the text distribution, which is important because most documents exhibit significant locality of reference.
The truly amazing thing about the Lempel-Ziv algorithm is how robust it is on different types of files. Even when you know that the text you are compressing comes from a special restricted vocabulary or is all lowercase, it is very difficult to beat Lempel-Ziv by using an application-specific algorithm. My recommendation is not to try. If there are obvious application-specific redundancies that can safely be eliminated with a simple preprocessing step, go ahead and do it. But don't waste much time fooling around. No matter how hard you work, you are unlikely to get significantly better text compression than with gzip or compress, and you might well do worse.

Implementations: A complete list of available compression programs is provided in the comp.compression FAQ (frequently asked questions) file, discussed below. This FAQ will likely point you to what you are looking for, if you don't find it in this section.

The best general-purpose program for text compression is gzip, which implements a public domain variation of the Lempel-Ziv algorithm. It is distributed under the GNU software licence and can by obtained from ftp://prep.ai.mit.edu/pub/gnu/gzip-1.2.4.tar. Unix compress is another popular compression program based on the patented LZW algorithm. It is available from ftp://wuarchive.wustl.edu/packages/compression/compress-4.1.tar.

A JPEG implementation is available from ftp://ftp.uu.net/graphics/jpeg/jpegsrc.v6a.tar.gz. MPEG can be found at ftp://havefun.stanford.edu/pub/mpeg/MPEGv1.2.2.tar.Z.

Algorithm 673 [Vit89] of the Collected Algorithms of the ACM is a Pascal implementation of dynamic Huffman codes, which is a one-pass, adaptive text compression algorithm. See Section for details on fetching this program.

Notes: Many books on data compression are available, but we highly recoomend Bell, Cleary, and Witten [BCW90] and Storer [Sto88]. Another good source of information is the USENET newsgroup comp.compression. Check out its particularly comprehensive FAQ (frequently asked questions) compendium at location ftp://rtfm.mit.edu/pub/usenet/news.answers/compression-faq.

Good expositions on Huffman codes [Huf52] include [AHU83, BR95, CLR90, Eve79a, Man89]. Expositions on the LZW [Wel84, ZL78] algorithm include [BR95].

There is an annual IEEE Data Compression Conference, the proceedings of which should be studied seriously before attempting to develop a new data compression algorithm. On reading the proceedings, it will become apparent that this is a mature technical area, where much of the current work (especially for text compression) is shooting for fairly marginal improvements on special applications. On a more encouraging note, we remark that this conference is held annually at a world-class ski resort in Utah.

Related Problems: Shortest common superstring (see page ), cryptography (see page ).

Next: Cryptography Up: Set and String Problems Previous: Approximate String Matching

Algorithms
Mon Jun 2 23:33:50 EDT 1997