Searching

Next: Median and Selection Up: Combinatorial Problems Previous: Sorting

Searching

Input description: A set S of n keys, a query key q.

Problem description: Where is q in S?

Discussion: Searching means different things to different people. Searching for the global maximum or minimum of a function is the problem of unconstrained optimization and is discussed in Section . Chess playing programs search for the best move to make next by using alpha-beta minimax search, which is an exhaustive search of the possible moves using a variation of backtracking (see Section ).

Here we consider the simple task of searching for a key in a list or in an array, which is a fundamental problem associated with information retrieval. Dictionary data structures maintain efficient access to sets of keys under insertion and deletion and are discussed in Section . Typical dictionaries include binary trees and hash tables.

We treat searching here as a problem distinct from dictionaries because simpler and more efficient solutions emerge when our primary interest is static searching. These little data structures can yield large performance improvements when properly employed in an innermost loop. Also, several of the ideas from list and array searching, such as binary search and self-organization, apply to other problems and justify our attention.

There are two basic approaches to array searching: sequential search and binary search. Both are simple, yet have interesting and subtle variations. In sequential search, we simply start from the front of our list or array of keys and compare each successive item against the key until we find a match or reach the end. In binary search, we start with a sorted array of keys. To search for key q, we compare q to the middle key . If q is before , it must reside in the top half of our set; if not, it must reside in the bottom half of our set. By repeating this process on the correct half, we find the key in a total of comparisons. This is a big win over the n/2 comparisons we expect with sequential search. See Section for more on binary search.

Sequential search is the simplest algorithm, and likely to be fastest on up to 10-20 elements. For such tiny problems, forget about binary search. Beyond 50-100 elements, there is no question that binary search will be more efficient than sequential search, even factoring in the cost of the sorting (assuming multiple queries). Other issues do come into play, however, particularly in identifying the proper variant of the algorithm:

How much time can you spend programming? - Binary search is a notoriously tricky algorithm to program correctly. It took seventeen years after its invention until the first correct version of binary search was published! Don't be afraid to start from one of the implementations described below. Test it completely by writing a driver that searches for every key in the set S as well as between the keys.
Are certain items accessed more often than other ones? - Certain English words (such as ``the'') are much more likely to occur than others (such as ``defenestrate''). We can reduce the number of comparisons in a sequential search by putting the most popular words at the top of the list and the least popular ones at the bottom. Further, nonuniform access is typically the rule, not the exception. Many real-world distributions, such as word use in English, are more accurately modeled by Zipf's law. Under Zipf's law, the ith most frequently accessed key is selected with probability (i-1)/i times the probability of the (i-1)st most popular key, for all .
However, preordering the list to exploit a skewed access pattern requires knowing the access pattern in advance. For many applications, it can be difficult to obtain such information. Far easier are self-organizing lists, where the order of the keys changes in response to the queries. The simplest and best self-organizing scheme is move-to-front; that is, we move the most recently searched-for key from its current position to the front of the list. Popular keys keep getting boosted to the front, while unsearched-for keys drift towards the back of the list. There is no need to keep track of the frequency of access; just move the keys on demand. Self-organizing lists also exploit locality of reference, since accesses to a given key are likely to occur in clusters. Any key will be maintained near the top of the list during a cluster of accesses, even if other keys have proven more popular in the past.
Self-organization can extend the useful size range of sequential search. However, you should switch to binary search beyond 50-100 elements.
Is the key close by? - Suppose we know that the target key is to the right of position p, and we think it is close by. Sequential search is fast if we are correct, but we will be punished severely whenever we guess wrong. A better idea is to test repeatedly at larger intervals (p+1, p+2, p+4, p+8, p+16, ) to the right until we find a key to the right of our target. After this, we have a window containing the target and we can proceed with binary search.
Such a one-sided binary search finds the target at position p+l using at most comparisons, so it is faster than binary search when l < < n, yet it can never be much worse. One-sided binary search is particularly useful in unbounded search problems, such as in numerical root finding.
Is my data structure sitting on external memory? - Once the number of keys grows too large, as in a CD-ROM telephone directory of all the people in the United States, binary search loses its status as the best search technique. Binary search jumps wildly around the set of keys looking for midpoints to compare, and it becomes very expensive to read in a new page from a secondary storage device for each comparison. Much better are data structures such as B-trees (see Section ), which cluster the keys into pages so as to minimize the number of disk accesses per search.
Can I guess where the key should be? - In interpolation search, we exploit our understanding of the distribution of keys to guess where to look next. Interpolation search is probably a more accurate description of how we use a telephone book than binary search. For example, suppose we are searching for Washington, George in a sorted telephone book. We would certainly be safe making our first comparison three-fourths of the way down the list, essentially doing two comparisons for the price of one.
Although interpolation search is an appealing idea, we caution against it for three reasons: First, you have to work very hard to optimize your search algorithm before you can hope for a speedup over binary search. Second, even if you get lucky and beat binary search, it is unlikely to be by enough to have justified the exercise. Third, your program will be much less robust and efficient when the distribution changes, such as when your application gets put to work on French words instead of English.

Implementations: The basic sequential and binary search algorithms are simple enough to implement that you should likely do them yourself. Still, the routines described below may be useful as models.

Gonnet and Baeza-Yates provides code fragments in C and Pascal for sequential, binary, and interpolation search, as well as for related dictionary structures. LEDA (see Section ) provides a sorted array data type in C++ that supports binary search. Many textbooks include implementations of binary search, including [MS91]. See Section for details.

Notes: Mehlhorn and Tsakalidis [MT90b] give a through survey of the state-of-the-art in modern data structures. Knuth [Knu73a] provides a detailed analysis and exposition on all fundamental search algorithms and dictionary data structures but omits such modern data structures as red-black and splay trees. Gonnet and Baeza-Yates [GBY91] provide detailed references and experimental results for a wide variety of search algorithms.

Manber [Man89] provides an interesting discussion of variations of binary search, including one-sided binary search and searching for an index in A where .

In linear interpolation search on an array of sorted numbers, the next position probed is given by

where q is the query numerical key and S the sorted numerical array. If the keys are drawn independently from a uniform distribution, the expected search time is [YY76]. Expositions on interpolation search include [Raw92].

Nonuniform access patterns can be exploited in binary search trees by structuring them so that popular keys are located near the root, thus minimizing search time. Dynamic programming can be used to construct such optimal search trees in time [Knu73b]. Expositions include [AHU74]. Splay trees are self-organizing tree structures, as discussed in Section .

Related Problems: Dictionaries (see page ), sorting (see page ).

Next: Median and Selection Up: Combinatorial Problems Previous: Sorting

Algorithms
Mon Jun 2 23:33:50 EDT 1997