Median and Selection

Next: Generating Permutations Up: Combinatorial Problems Previous: Searching

Median and Selection

Input description: A set of n numbers or keys, and an integer k.

Problem description: Find the key that is smaller than exactly k of the n keys.

Discussion: Median finding is an essential problem in statistics, where it provides a more robust notion of average than the mean. The mean wealth of people who have published research papers on sorting is significantly affected by the presence of one William Gates [GP79], although his effect on the median wealth is merely to cancel out one starving graduate student.

Median finding is a special case of the more general selection problem, which asks for the ith element in sorted order. Selection arises in several applications:

Filtering outlying elements - In dealing with noisy data samples, it is usually a good idea to throw out the largest and smallest 10% or so of them. Selection can be used to identify the items defining the tenth and ninetieth percentiles, and the outliers are then filtered out by comparing each item to the two selected elements.
Identifying the most promising candidates - In a computer chess program, we might quickly evaluate all possible next moves, and then decide to study the top 25% more carefully. Selection followed by filtering is the way to go.
Order statistics - Particularly interesting special cases of selection include finding the smallest element (i=1), the largest element (i=n), and the median element (i= n/2).

The mean of n numbers can be easily computed in linear time by summing the elements and dividing by n. However, finding the median is a more difficult problem. Algorithms that compute the median can easily be generalized to arbitrary selection.

The most elementary median-finding algorithm sorts the items in time and then returns the item sitting the (n/2)nd position. The good thing is that this gives much more information than just the median, enabling you to select the ith element (for all ) in constant time after the sort. However, there are faster algorithms if all you want is the median.

In particular, there is an O(n) expected-time algorithm based on quicksort. Select a random element in the data set as a pivot, and use it to partition the data into sets of elements less than and greater than the pivot. From the sizes of these sets, we know the position of the pivot in the total order, and hence whether the median lies to the left or right of the pivot. Now we recur on the appropriate subset until it converges on the median. This takes (on average) iterations, with the cost of each iteration being roughly half that of the previous one. This defines a geometric series that converges to a linear-time algorithm, although if you are very unlucky it takes the same time as quicksort, .

More complicated algorithms are known that find the median in worst-case linear time. However, the expected-time algorithm will likely win in practice. Just make sure to select random pivots in order to avoid the worst case.

Beyond mean and median, a third notion of average is the mode, defined to be the element that occurs the greatest number of times in the data set. The best way to compute the mode sorts the set in time, which places all identical elements next to each other. By doing a linear sweep from left to right on this sorted set, we can count the length of the longest run of identical elements and hence compute the mode in a total of time.

In fact, there is no faster worst-case algorithm possible to compute the mode, since the problem of testing whether there exist two identical elements in a set (called element uniqueness) can be shown to have an lower bound. Element uniqueness is equivalent to asking if the mode is . Possibilities exist, at least theoretically, for improvements when the mode is large by using fast median computations.

Implementations: A bare bones implementation in C of the recursive k-selection algorithm appears in [GBY91]. See Section for further details.

XTango (see Section ) is an algorithm animation system for UNIX and X-windows, which includes an animation of the linear-time selection algorithm. This animation is a good one.

Notes: The linear expected-time algorithm for median and selection is due to Hoare [Hoa61]. Floyd and Rivest [FR75] provide an algorithm that uses fewer comparisons on average. Good expositions on linear-time selection include [AHU74, Baa88, CLR90, Raw92], with [Raw92] being particularly enlightening.

A sport of considerable theoretical interest is determining exactly how many comparisons are sufficient to find the median of n items. The linear-time algorithm of Blum et. al. [BFP 72] proves that suffice, but we want to know what c is. A lower bower bound of 2n comparisons for median finding was given by Bent and John [BJ85]. In 1976, Schönhage, Paterson, and Pippenger [SPP76] presented an algorithm using 3n comparisons. Recently, Dor and Zwick [DZ95] proved that 2.95 n comparisons suffice. These algorithms attempt to minimize the number of element comparisons but not the total number of operations, and hence do not lead to faster algorithms in practice.

Tight combinatorial bounds for selection problems are presented in [Aig88]. An optimal algorithm for computing the mode is presented in [DM80].

Related Problems: Priority queues (see page ), sorting (see page ).

Next: Generating Permutations Up: Combinatorial Problems Previous: Searching

Algorithms
Mon Jun 2 23:33:50 EDT 1997