next up previous contents index CD CD Algorithms
Next: Cryptography Up: Set and String Problems Previous: Approximate String Matching

Text Compression

  

  figure22124

Input description: A text string S.

Problem description: A shorter text string S' such that S can be correctly reconstructed from S'.

Discussion: Secondary storage devices fill up quickly on every computer system, even though their capacity doubles each year. Decreasing storage prices have only increased interest in data compression, since there is now more data to compress than ever before. Data compression is the algorithmic problem of finding alternative, space-efficient encodings for a given data file. With the rise of computer networks, a new mission for data compression has arisen, that of increasing the effective bandwidth of networks by reducing the number of bits before transmission.        

Data compression is a problem for which practical people like to invent ad hoc methods, designed for their particular applications. Sometimes these outperform general methods, but often they do not. The following issues arise in selecting the right data compression algorithm:

Although there are literally dozens of text compression algorithms available, they are characterized by two basic approaches. In static algorithms, such as Huffman codes, a single coding table is built by analyzing the entire document. In adaptive algorithms, such as Lempel-Ziv, a coding table is built on the fly and adapts to the local character distribution of the document. An adaptive algorithm will likely prove to be the correct answer:  

Implementations: A complete list of available compression programs is provided in the comp.compression FAQ (frequently asked questions) file, discussed below. This FAQ will likely point you to what you are looking for, if you don't find it in this section.  

The best general-purpose program for text compression is gzip, which implements a public domain variation of the Lempel-Ziv algorithm. It is distributed under the GNU software licence and can by obtained from ftp://prep.ai.mit.edu/pub/gnu/gzip-1.2.4.tar. Unix compress is another popular compression program based on the patented LZW algorithm. It is available from ftp://wuarchive.wustl.edu/packages/compression/compress-4.1.tar.   

A JPEG implementation is available from ftp://ftp.uu.net/graphics/jpeg/jpegsrc.v6a.tar.gz. MPEG can be found at ftp://havefun.stanford.edu/pub/mpeg/MPEGv1.2.2.tar.Z.   

Algorithm 673 [Vit89] of the Collected Algorithms of the ACM is a Pascal implementation of dynamic Huffman codes, which is a one-pass, adaptive text compression algorithm. See Section gif for details on fetching this program.   

Notes: Many books on data compression are available, but we highly recoomend Bell, Cleary, and Witten [BCW90] and Storer [Sto88]. Another good source of information is the USENET newsgroup comp.compression. Check out its particularly comprehensive FAQ (frequently asked questions) compendium at location ftp://rtfm.mit.edu/pub/usenet/news.answers/compression-faq.

Good expositions on Huffman codes [Huf52] include [AHU83, BR95, CLR90, Eve79a, Man89]. Expositions on the LZW [Wel84, ZL78] algorithm include [BR95].

There is an annual IEEE Data Compression Conference, the proceedings of which should be studied seriously before attempting to develop a new data compression algorithm. On reading the proceedings, it will become apparent that this is a mature technical area, where much of the current work (especially for text compression) is shooting for fairly marginal improvements on special applications. On a more encouraging note, we remark that this conference is held annually at a world-class ski resort in Utah.    

Related Problems: Shortest common superstring (see page gif), cryptography (see page gif).    


next up previous contents index CD CD Algorithms
Next: Cryptography Up: Set and String Problems Previous: Approximate String Matching

Algorithms
Mon Jun 2 23:33:50 EDT 1997