Input description: A universe of items and a collection of subsets , where .
Problem description: Represent each subset so as to efficiently (1) test whether , (2) find the union or intersection of and , and (3) insert or delete members of S.
Discussion: In mathematical terms, a set is an unordered collection of objects drawn from a fixed universal set. However, it is usually useful for implementation to represent each set in a single canonical order, typically sorted, so as to speed up or simplify various operations. Sorted order turns the problem of finding the union or intersection of two subsets into a linear-time operation - just sweep from left to right and see what you are missing. It also makes possible element searching in sublinear time. Finally, printing the elements of a set in a canonical order paradoxically reminds us that order really doesn't matter.
We distinguish sets from two other kinds of objects: strings and dictionaries. If there is no fixed-size universal set, a collection of objects is best thought of as a dictionary, as discussed in Section . If the order does matter in a subset, i.e. if is not the same as , then your structure is more profitably thought of as a string, so see Sections and .
When each subset has cardinality exactly two, they form edges in a graph whose vertices are the universal set. A system of subsets with no restrictions on the cardinality of its members is called a hypergraph. It often can be profitable to consider whether your problem has a graph-theoretical analogy, like connected components or shortest path in a hypergraph.
Your primary alternatives for representing arbitrary systems of subsets are:
In many applications, the subsets are all pairwise disjoint, meaning that each element is in exactly one subset. For example, consider maintaining the connected components of a graph or the party affiliations of politicians. Each vertex/hack is in exactly one component/party. Such a system of subsets is called a set partition. Algorithms for constructing partitions of a given set are provided in Section .
For data structures, the primary issue is maintaining a given set partition as things change over time, perhaps as edges are added or party members defect. The queries we are interested in include ``which set is a particular item in?'' and ``are two items in the same set?'' as we modify the set by (1) changing one item, (2) merging or unioning two sets, or (3) breaking a set apart. Your primary options are:
Certain details remain, such as which subset should be the ultimate root of a union, but these are described in most every algorithms text. Union-Find is a fast, extremely simple data structure that every programmer should know about. It does not support breaking up subsets created by unions, but usually this is not an issue.
Neither of these options provides access to all of the items in a particular subset without traversing all the items in the set. However, both can be appropriately augmented with extra pointers if it is important that this operation be fast.
Implementations: LEDA (see Section ) provides dictionary data structures to maintain sets and the union-find data structure to maintain set partitions, all in C++.
LINK is an environment for combinatorial computing that provides special support for hypergraphs, including visualization of hypergraphs. Although written in C++, it provides an additional Scheme language interface for interacting with the graphs. LINK is available from http://dimacs.rutgers.edu/Projects/LINK.html.
Many textbooks contain implementations of the union-find data structure, including [MS91] (see Section ). An implementation of union-find underlies any implementation of Kruskal's minimum spanning tree algorithm. Section contains a selection of minimum spanning tree codes.
Notes: Optimal algorithms for such set operations as intersection and union were presented in [Rei72]. Good expositions on set data structures include [AHU83].
Galil and Italiano [GI91] survey data structures for disjoint set union. Expositions on the union-find data structure appear in most algorithm texts, including [CLR90, MS91]. The upper bound of on m union-find operations on an n-element set is due to Tarjan [Tar75], as is a matching lower bound on a restricted model of computation [Tar79]. The inverse Ackerman function grows notoriously slowly, so this performance is close to linear. Expositions on the Ackerman bound include [AHU74]. An interesting connection between the worst-case of union-find and the length of Davenport-Schintzl sequences, a combinatorial structure that arises in computational geometry, is established in [SA95].
The power set of a set S is the collection of all subsets of S. Explicit manipulation of power sets quickly becomes intractable due to their size. Implicit representations of power sets in symbolic form becomes necessary for nontrivial computations. See [BCGR92] for algorithms for and computational experience with symbolic power set representations.
Related Problems: Generating subsets (see page ), generating partitions (see page ), set cover (see page ), minimum spanning tree (see page ).