Input description: A set of n records, each identified by one or more key fields.
Problem description: Build and maintain a data structure to efficiently locate, insert, or delete the record associated with any query key q.
Discussion: The abstract data type ``dictionary'' is one of the most important structures in computer science. Dozens of different data structures have been proposed for implementing dictionaries including hash tables, skip lists, and balanced/unbalanced binary search trees - so choosing the right one can be tricky. Depending on the application, it is also a decision that can significantly impact performance. In practice, it is more important to avoid using a bad data structure than to identify the single best option available.
An essential piece of advice is to carefully isolate the implementation of the dictionary data structure from its interface. Use explicit calls to subroutines that initialize, search, and modify the data structure, rather than embedding them within the code. This leads to a much cleaner program, but it also makes it easy to try different dictionary implementations to see how they impact performance. Do not obsess about the cost of the procedure call overhead inherent in such an abstraction. If your application is so time-critical that such overhead can impact performance, then it is even more essential that you be able to easily experiment with different implementations of your dictionary.
In choosing the right data structure for your dictionary, ask yourself the following questions:
Once you understand what your needs are, try to identify the best data structure from the list below:
A particularly interesting and useful variant is a self-organizing list. Whenever a key is accessed or inserted, always move it to head of the list. Thus if the key is accessed again in the near future, it will be near the front and so require only a short search to find it. Since most applications exhibit both uneven access frequencies and locality of reference, the average search time for a successful search in a self-organizing list is typically much better than in a sorted or unsorted list. Of course, self-organizing data structures can be built from arrays as well as linked lists.
A well-tuned hash table will likely outperform a sorted array in most applications. However, several design decisions go into creating a well-tuned hash table:
should work, where is the size of the alphabet and char(x) is the function that maps each character x to its ASCII character code. For long strings, 8 to 10 characters should be sufficient to hash upon, provided they are unlikely to be padded blanks or some other invariant. Use Horner's rule to implement this hash function computation efficiently, as discussed in Section .
Regardless of which hash function you decide to use, print statistics on the distribution of keys per bucket to see how uniform it really is. Odds are the first hash function you try will not prove to be the best. Botching up the hash function is an excellent way to slow down any application.
Balanced search trees use local rotation operations to restructure search trees, moving more distant nodes closer to the root while maintaining the in-order search structure of the tree. Among balanced search trees, AVL and 2/3 trees are now passé, and red-black trees seem to be more popular. A particularly interesting self-organizing data structure is the splay tree, which uses rotations to move any accessed key to the root. Frequently used or recently accessed nodes thus sit near the top of the tree, allowing fast search.
Bottom line: Which binary search tree is best for your application? Probably the balanced tree for which you have the best implementation readily available. See the choices below. Which flavor of balanced tree is probably not as important as how good the programmer was who coded it.
The idea behind a B-tree is to collapse several levels of a binary search tree into a single large node, so that we can make the equivalent of several search steps before another disk access is needed. We can thereafter reference enormous numbers of keys using only a few disk accesses. To get the full benefit from using a B-tree, it is important to understand explicitly how the secondary storage device and virtual memory interact, through constants such as page size and virtual/real address space.
Even for modest-sized data sets, unexpectedly poor performance of a data structure may be due to excessive swapping, so listen to your disk to help decide whether you should be using a B-tree.
Implementations: LEDA (see Section ) provides an extremely complete collection of dictionary data structures in C++, including hashing, perfect hashing, B-trees, red-black trees, random search trees, and skip lists. Given all of these choices, their default dictionary implementation is a randomized search tree [AS89], presumably reflecting which structure they expect to be most efficient in practice.
XTango (see Section ) is an algorithm animation system for UNIX and X-windows that includes animations of such dictionary data structures as AVL trees, binary search trees, hashing, red-black trees, and treaps (randomized search trees). Many of these are interesting and quite informative to watch. Further, the C source code for each animation is included.
The 1996 DIMACS implementation challenge focused on elementary data structures like dictionaries. The world's best available implementations were likely to be identified during the course of the challenge, and they are accessible from http://dimacs.rutgers.edu/ .
Bare bones implementations in C and Pascal of a dizzying variety of dictionary data structures appear in [GBY91], among them several variations on hashing and binary search trees, and optimal binary search tree construction. See Section for details.
Implementation-oriented treatments of a variety of dictionary data structures appear in [BR95], including hashing, splay trees, red-black trees, and what looks like a thorough implementation of B-trees. Code in C for these data structures is included in the text and is available on disk for a modest fee.
Notes: Mehlhorn and Tsakalidis [MT90b] give a thorough survey of the state of the art in modern data structures. Knuth [Knu73a] provides a detailed analysis and exposition on fundamental dictionary data structures but misses such modern data structures as red-black and splay trees. Gonnet and Baeza-Yates [GBY91] provide implementations (in C and Pascal), detailed references, and experimental results for a wide variety of dictionary data structures. We defer to these sources to avoid giving original references for each of the data structures described above.
Good expositions on red-black trees [GS78] include [BR95, CLR90, Woo93]. Good expositions on splay trees [ST85] include [Tar83, Woo93]. Good expositions on B-trees [BM72] include [BR95, CLR90]. Good expositions on hashing includes [Meh84, Woo93].
Several modern data structures, such as splay trees, have been studied via amortized analysis, where we bound the total amount of time used by any sequence of operations. In an amortized analysis, we show that if a single operation is very expensive, this is because we have already benefited from enough cheap operations before it to pay off the higher cost. A data structure realizing an amortized complexity of O(f(n)) is less desirable than one whose worst-case complexity is O(f(n)) (since a very bad operation might still occur) but better than one with an average-case complexity O(f(n)), since the amortized bound will achieve this average on any input.
Newer dictionary data structures that explicitly incorporate randomization into the construction include randomized search trees [AS89] and skip lists [Pug90].
Related Problems: Sorting (see page ), Searching (see page ).