Dictionaries

Next: Priority Queues Up: Data Structures Previous: Data Structures

Dictionaries

Input description: A set of n records, each identified by one or more key fields.

Problem description: Build and maintain a data structure to efficiently locate, insert, or delete the record associated with any query key q.

Discussion: The abstract data type ``dictionary'' is one of the most important structures in computer science. Dozens of different data structures have been proposed for implementing dictionaries including hash tables, skip lists, and balanced/unbalanced binary search trees - so choosing the right one can be tricky. Depending on the application, it is also a decision that can significantly impact performance. In practice, it is more important to avoid using a bad data structure than to identify the single best option available.

An essential piece of advice is to carefully isolate the implementation of the dictionary data structure from its interface. Use explicit calls to subroutines that initialize, search, and modify the data structure, rather than embedding them within the code. This leads to a much cleaner program, but it also makes it easy to try different dictionary implementations to see how they impact performance. Do not obsess about the cost of the procedure call overhead inherent in such an abstraction. If your application is so time-critical that such overhead can impact performance, then it is even more essential that you be able to easily experiment with different implementations of your dictionary.

In choosing the right data structure for your dictionary, ask yourself the following questions:

How many items will you typically have in your data structure? - Will you know this number in advance? Are you looking at a problem small enough that the simple data structure will be best, or will it be so large that we must worry about using too much memory or swapping?
Do you know the relative number of insertions, deletions, and search queries? - Will there be any modifications to the data structure after it is first constructed, or will it be static from that point on?
Do you have an understanding of the relative frequency with which different keys will be accessed? - Can we assume that the access pattern will be uniform and random, or will it exhibit a skewed access distribution (i.e. certain elements are much more popular than others) or a sense of locality (i.e. elements are likely to be repeatedly accessed in clusters, instead of at fairly random intervals). Usually, the world is both skewed and clustered.
Is it critical that individual operations be fast, or only that the total amount of work done over the entire program be minimized? - When response time is critical, such as in a program controlling a heart-lung machine, you can't wait too long between steps. When you have a program that is doing a lot of queries over the database, such as identifying all sex offenders who happen to be Republicans, it is not so critical that you pick out any particular congressman quickly as that you get them all with the minimum total effort.

Once you understand what your needs are, try to identify the best data structure from the list below:

Unsorted linked lists or arrays - For small data sets, say up to 10 to 20 items, an unsorted array is probably the easiest and most efficient data structure to maintain. They are easier to work with than linked lists, and if the dictionary will be kept this small, you cannot possibly save a significant amount of space over allocating a full array. If your dictionary will be too much larger, the search time will kill you in either case.
A particularly interesting and useful variant is a self-organizing list. Whenever a key is accessed or inserted, always move it to head of the list. Thus if the key is accessed again in the near future, it will be near the front and so require only a short search to find it. Since most applications exhibit both uneven access frequencies and locality of reference, the average search time for a successful search in a self-organizing list is typically much better than in a sorted or unsorted list. Of course, self-organizing data structures can be built from arrays as well as linked lists.
Sorted linked lists or arrays - Maintaining a sorted linked list is usually not worth the effort (unless you are trying to eliminate duplicates), since we cannot perform binary search in such a data structure. A sorted array will be appropriate if and only if there are not many insertions or deletions. When the array gets so large that it doesn't fit in real memory, think B-trees instead.
Hash tables - For applications involving a moderate-to-large number of keys (say between 100 and 1,000,000), a hash table with bucketing is probably the right way to go. In a hash table, we use a function that maps keys (be they strings, numbers, or whatever) to integers between 0 and m-1. We maintain an array of m buckets, each typically implemented using an unsorted linked list. For a given key, the hash function immediately identifies which bucket will contain it. If we use a hash function that spreads the keys out nicely and a sufficiently large hash table, each bucket should contain very few items, thus making linear search acceptable. Insertion and deletion from a hash table reduce to insertion and deletion from the bucket/list.
A well-tuned hash table will likely outperform a sorted array in most applications. However, several design decisions go into creating a well-tuned hash table:
- How big should the table be? Typically, m should about the same as the maximum number of items you expect to put in the table. Make sure that m is a prime number, so as to minimize the dangers of a bad hash function.
- What hash function should I use? For strings, something like
  
  should work, where is the size of the alphabet and char(x) is the function that maps each character x to its ASCII character code. For long strings, 8 to 10 characters should be sufficient to hash upon, provided they are unlikely to be padded blanks or some other invariant. Use Horner's rule to implement this hash function computation efficiently, as discussed in Section .
Regardless of which hash function you decide to use, print statistics on the distribution of keys per bucket to see how uniform it really is. Odds are the first hash function you try will not prove to be the best. Botching up the hash function is an excellent way to slow down any application.
Binary search trees - Binary search trees are elegant data structures that support fast insertions, deletions, and queries. The big distinction between different types of trees is whether they are explicitly rebalanced after insertion or deletion, and how this rebalancing takes place. In random search trees, no rebalancing takes place and we simply insert a node at the leaf position where we can find it. Although search trees perform well under random insertions, most applications are not really random. Indeed, unbalanced search trees constructed by inserting keys in sorted order are a disaster, performing like a linked list.
Balanced search trees use local rotation operations to restructure search trees, moving more distant nodes closer to the root while maintaining the in-order search structure of the tree. Among balanced search trees, AVL and 2/3 trees are now passé, and red-black trees seem to be more popular. A particularly interesting self-organizing data structure is the splay tree, which uses rotations to move any accessed key to the root. Frequently used or recently accessed nodes thus sit near the top of the tree, allowing fast search.
Bottom line: Which binary search tree is best for your application? Probably the balanced tree for which you have the best implementation readily available. See the choices below. Which flavor of balanced tree is probably not as important as how good the programmer was who coded it.
B-trees - For data sets so large that they will not fit in main memory (say more than 1,000,000 items) your best bet will be some flavor of a B-tree. As soon as the data structure gets outside of main memory, the search time to access a particular location on a disk or CD-ROM can kill you, since this is several orders of magnitude slower than accessing RAM.
The idea behind a B-tree is to collapse several levels of a binary search tree into a single large node, so that we can make the equivalent of several search steps before another disk access is needed. We can thereafter reference enormous numbers of keys using only a few disk accesses. To get the full benefit from using a B-tree, it is important to understand explicitly how the secondary storage device and virtual memory interact, through constants such as page size and virtual/real address space.
Even for modest-sized data sets, unexpectedly poor performance of a data structure may be due to excessive swapping, so listen to your disk to help decide whether you should be using a B-tree.
Skip lists - These are somewhat of a cult data structure. Their primary benefits seem to be ease of implementation relative to balanced trees. If you are using a canned tree implementation, and thus not coding it yourself, this benefit is eliminated. I wouldn't bother with them.

Implementations: LEDA (see Section ) provides an extremely complete collection of dictionary data structures in C++, including hashing, perfect hashing, B-trees, red-black trees, random search trees, and skip lists. Given all of these choices, their default dictionary implementation is a randomized search tree [AS89], presumably reflecting which structure they expect to be most efficient in practice.

XTango (see Section ) is an algorithm animation system for UNIX and X-windows that includes animations of such dictionary data structures as AVL trees, binary search trees, hashing, red-black trees, and treaps (randomized search trees). Many of these are interesting and quite informative to watch. Further, the C source code for each animation is included.

The 1996 DIMACS implementation challenge focused on elementary data structures like dictionaries. The world's best available implementations were likely to be identified during the course of the challenge, and they are accessible from http://dimacs.rutgers.edu/ .

Bare bones implementations in C and Pascal of a dizzying variety of dictionary data structures appear in [GBY91], among them several variations on hashing and binary search trees, and optimal binary search tree construction. See Section for details.

Implementation-oriented treatments of a variety of dictionary data structures appear in [BR95], including hashing, splay trees, red-black trees, and what looks like a thorough implementation of B-trees. Code in C for these data structures is included in the text and is available on disk for a modest fee.

Notes: Mehlhorn and Tsakalidis [MT90b] give a thorough survey of the state of the art in modern data structures. Knuth [Knu73a] provides a detailed analysis and exposition on fundamental dictionary data structures but misses such modern data structures as red-black and splay trees. Gonnet and Baeza-Yates [GBY91] provide implementations (in C and Pascal), detailed references, and experimental results for a wide variety of dictionary data structures. We defer to these sources to avoid giving original references for each of the data structures described above.

Good expositions on red-black trees [GS78] include [BR95, CLR90, Woo93]. Good expositions on splay trees [ST85] include [Tar83, Woo93]. Good expositions on B-trees [BM72] include [BR95, CLR90]. Good expositions on hashing includes [Meh84, Woo93].

Several modern data structures, such as splay trees, have been studied via amortized analysis, where we bound the total amount of time used by any sequence of operations. In an amortized analysis, we show that if a single operation is very expensive, this is because we have already benefited from enough cheap operations before it to pay off the higher cost. A data structure realizing an amortized complexity of O(f(n)) is less desirable than one whose worst-case complexity is O(f(n)) (since a very bad operation might still occur) but better than one with an average-case complexity O(f(n)), since the amortized bound will achieve this average on any input.

Newer dictionary data structures that explicitly incorporate randomization into the construction include randomized search trees [AS89] and skip lists [Pug90].

Related Problems: Sorting (see page ), Searching (see page ).

Next: Priority Queues Up: Data Structures Previous: Data Structures

Algorithms
Mon Jun 2 23:33:50 EDT 1997