Nearest Neighbor Search

Next: Range Search Up: Computational Geometry Previous: Voronoi Diagrams

Nearest Neighbor Search

Input description: A set S of n points in d dimensions; a query point q.

Problem description: Which point in S is closest to q?

Discussion: The need to quickly find the nearest neighbor to a query point arises in a variety of geometric applications. The classic example in two dimensions is designing a system to dispatch emergency vehicles to the scene of a fire. Once the dispatcher learns the location of the fire, she uses a map to find the firehouse closest to this point so as to minimize transportation delays. This situation occurs in any application mapping customers to service providers.

Nearest-neighbor search is also important in classification. Suppose we are given a collection of data about people (say age, height, weight, years of education, sex, and income level) each of whom has been labeled as Democrat or Republican. We seek a classifier to decide which way a different person is likely to vote. Each of the people in our data set is represented by a party-labeled point in d-dimensional space. A simple classifier can be built by assigning to the new point the party affiliation of its nearest neighbor.

Such nearest-neighbor classifiers are widely used, often in high-dimensional spaces. The vector-quantization method of image compression partitions an image into pixel regions. This method uses a predetermined library of several thousand pixel tiles and replaces each image region by the most similar library tile. The most similar tile is the point in 64-dimensional space that is closest to the image region in question. Compression is achieved by reporting the identifier of the closest library tile instead of the 64 pixels, at some loss of image fidelity.

Issues arising in nearest-neighbor search include:

How many points are you searching? - When your data set contains only a small number of points (say ) or if only few queries are ever destined to be performed, the simple approach is best. Compare the query point q against each of the n data points. Only when fast queries are needed for large numbers of points does it pay to consider more sophisticated methods.
How many dimensions are you working in? - Nearest neighbor search gets slowly but progressively harder as the dimensionality increases. The kd-tree data structure, presented in Section , does a very good job in moderate-dimensional spaces, even the plane. Still, above 20 or so dimensions, you might as well do a linear search through the data points. Search in high-dimensional spaces becomes hard because a sphere of radius r, representing all the points with distance from the center, progressively fills up less volume relative to a cube as the dimensionality increases. Thus any data structure based on partitioning points into enclosing volumes will become less and less effective.
In two dimensions, Voronoi diagrams (see Section ) provide an efficient data structure for nearest-neighbor queries. The Voronoi diagram of a point set in the plane decomposes the plane into regions such that the cell containing each data point consists of the part of the plane that is nearer to that point than any other in the set. Finding the nearest neighbor of query point q reduces to finding which cell in the Voronoi diagram contains q and reporting the data point associated with it. Although Voronoi diagrams can be built in higher dimensions, their size rapidly grows to the point of unusability.
Is your data set static or dynamic? - Will there be occasional insertions or deletions of new data points in your application? If these are just rare events, it might pay to build your data structure from scratch each time. If they are frequent, select a version of the kd-tree that supports insertions and deletions.

The nearest neighbor graph on a set S of n points links each vertex to its nearest neighbor. This graph is a subgraph of the Delaunay triangulation and so can be computed in . This is quite a bargain since it takes time just to discover the closest pair of points in S.

As a lower bound, the closest pair problem in one dimension reduces to sorting. In a sorted set of numbers, the closest pair corresponds to two numbers that lie next to each other in sorted order, so we need only check which is the minimum gap between the n-1 adjacent pairs. The limiting case of this occurs when the closest pair are distance zero apart, meaning that the elements are not unique.

Implementations: Ranger is a tool for visualizing and experimenting with nearest-neighbor and orthogonal-range queries in high-dimensional data sets, using multidimensional search trees. Four different search data structures are supported by Ranger: naive kd-trees, median kd-trees, nonorthogonal kd-trees, and the vantage point tree.

For each of these, Ranger supports queries in up to 25 dimensions under any Minkowski metric. It includes generators for a variety of point distributions in arbitrary dimensions. Finally, Ranger provides a number of features to aid in visualizing multidimensional data, best illustrated by the accompanying video [MS93]. To identify the most appropriate projection at a glance, Ranger provides a matrix of all two-dimensional projections of the data set. Ranger is written in C, using Motif. It runs on Silicon Graphics and HP workstations and is available in the algorithm repository http://www.cs.sunysb.edu/ algorith.

See Section for a complete collection of Voronoi diagram implementations. In particular, LEDA (see Section ) provides an implementation of 2D Voronoi diagrams in C++, as well as planar point location to make effective use of them for nearest-neighbor search.

A Pascal implementation of the divide-and-conquer algorithm for finding the closest pair of points in a set of n points appears in [MS91]. See Section .

Notes: The best reference on kd-trees and other spatial data structures is two volumes by Samet [Sam90b, Sam90a], where all major variants are developed in substantial detail. Good expositions on finding the closest pair of points in the plane [BS76] include [CLR90, Man89]. These algorithms use a divide-and-conquer approach instead of just selecting from the Delaunay triangulation.

A recent development in higher-dimensional nearest-neighbor search is algorithms that quickly produce a point that if not the nearest neighbor lies provably close to the query point. A sparse weighted graph structure is built from the data set, and the nearest neighbor is found by starting at a random point and greedily walking in the graph towards the query point. The closest point found over several random trials is declared the winner. Similar data structures hold promise for other problems in high-dimensional spaces. See [AM93, AMN 94].

Related Problems: Kd-trees (see page ), Voronoi diagrams (see page ), range search (see page ).

Next: Range Search Up: Computational Geometry Previous: Voronoi Diagrams

Algorithms
Mon Jun 2 23:33:50 EDT 1997