Bucketing Techniques

Next: War Story: Stripping Triangulations Up: Approaches to Sorting Previous: Randomization

Bucketing Techniques

If we were sorting names for the telephone book, we could start by partitioning the names according to the first letter of the last name. That will create 26 different piles, or buckets, of names. Observe that any name in the J pile must occur after every name in the I pile but before any name in the K pile. Therefore, we can proceed to sort each pile individually and just concatenate the bunch of piles together.

If the names are distributed fairly evenly among the buckets, as we might expect, the resulting 26 sorting problems should each be substantially smaller than the original problem. Further, by now partitioning each pile based on the second letter of each name, we generate smaller and smaller piles. The names will be sorted as soon as each bucket contains only a single name. The resulting algorithm is commonly called bucketsort or distribution sort.

Bucketing is a very effective idea whenever we are confident that the distribution of data will be roughly uniform. It is the idea that underlies hash tables, kd-trees, and a variety of other practical data structures. The downside of such techniques is that the performance can be terrible whenever the data distribution is not what we expected. Although data structures such as binary trees offer guaranteed worst-case behavior for any input distribution, no such promise exists for heuristic data structures on unexpected input distributions.

Figure: A small subset of Charlottesville Shiffletts

To show that non-uniform distributions occur in real life, consider Americans with the uncommon last name of Shifflett. The 1997 Manhattan telephone directory, with over one million names, contains exactly five Shiffletts. So how many Shiffletts should there be in a small city of 50,000 people? Figure shows a small portion of the two and a half pages of Shiffletts in the Charlottesville, Virginia telephone book. The Shifflett clan is a fixture of the region, but it would play havoc with any distribution sort program, as refining buckets from S to Sh to Shi to Shif to to Shifflett results in no significant partitioning.

Next: War Story: Stripping Triangulations Up: Approaches to Sorting Previous: Randomization

Algorithms
Mon Jun 2 23:33:50 EDT 1997