Binary array set
Introduction
The binary array set^{[0]} is a very spaceefficient data structure that allows adding elements and testing membership reasonably quickly. It basically works as a collection of sorted arrays with powerof2 sizes.
Adding one element to a BAS takes amortized \(Θ(1)\) time (worstcase \(Θ(n)\)), searching/testing for an element takes worstcase \(Θ((\log n)^2)\) time, and removing an element is not supported at all (although the entire set can be cleared easily). Despite the lack of support for deletion, the data structure is still useful in applications that only add and test but don’t delete – for example, breadthfirst search maintains an evergrowing set of visited nodes that shouldn’t be revisited. To compare time complexities with a popular alternative, a balanced binary search tree takes worstcase \(Θ(\log n)\) time alike for adding, testing, or removing one element.
The binary array set data structure is succinct because it only uses \(Θ(\log n)\) extra space for overhead (i.e. the space not used for storing the element values themselves). For modest sizes of \(n\) (e.g. over 1000), the amount of overhead space used by a BAS is practically negligible. The data structure that best minimizes overhead space would be the single flat array (which uses only \(Θ(1)\) extra space), but it doesn’t support both fast insertion and testing simultaneously. Note that despite the word “array” in the data structure’s name, the set is in fact orderless and can present elements in any order.
By comparison, a balanced BST uses \(Θ(n)\) extra space for pointers to all the tree nodes. (Even Btrees use \(Θ(n)\) extra space, but the scaling constant approaches zero as the degree increases.) This extra space usage for node pointers would be problematic in the situation where the array of raw data values (without pointers and other overheads) barely fit in system memory. This disadvantage of nodebased data structures is especially painful in languages like Java, where each object (i.e. the tree nodes) inherently has tens of bytes of overhead.
Internal details
How this data structure works is that it has a sequence of \(\lfloor \log_2 n \rfloor + 1\) slots, where the slot at index \(k\) (0based counting) is either null or an ascendingsorted array of length \(2^k\). For example, the abstract conceptual set {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13} could be represented as a binary array set in one way like this:
Number of slots: \(\lfloor \log_2 13 \rfloor + 1 = 4\).
Slot 0: [5]. (length 1)
Slot 1: Null.
Slot 2: [2, 3, 9, 13]. (length 4)
Slot 3: [1, 4, 6, 7, 8, 10, 11, 12]. (length 8)
Note that the size of the set is 13, whose binary representation is 1101_{2}. The binary representation corresponds with which slots are filled. Also note that the union of all the arrays in the binary array set is exactly equal to the set of elements being represented, no more, no less, without duplicates.
To test if a given value is an member of the set, we simply perform standard binary search in each array (in the nonnull slots) and return whether a result was found.
To add a new element, we wrap the element in a new array of length 1. If slot 0 is empty, we simply store the new array there. Otherwise we merge the new array with the array already at slot 0, which forms a new sorted array of length 2. Next we check if slot 1 is empty; we repeat this merge/check process on increasing slot indexes until we find an empty slot.
Source code
 Java

The class implements the
java.util.Set
interface, so it can be used as a dropin replacement ofHashSet
orTreeSet
as long asremove()
is not called. The iterator is not failfast on concurrent modifications. The maximum set size isInteger.MAX_VALUE
(i.e. \(2^{31}  1\)).  Python

This implementation only supports the methods {
add()
,clear()
}, builtin functions {len()
,iter()
}, and operatorin
– so it only supports a fraction of the methods that the builtinset
type has. Compatible with Python 2 and 3.  C++

The class is modeled after the Java version, supporting the methods
size()
,add()
,contains()
,empty()
, andclear()
. It has far fewer features thanstd::set
. Requires C++11.
License: MIT (open source)
Notes
This data structure is described in Introduction to Algorithms 2nd and 3rd editions, chapter 17 “Amortized Analysis”, problem 172 “Making binary search dynamic” (page 473 in the 3rd edition).
The binarymerge behavior of this data structure is like a simplified version of binomial heaps with flat arrays instead of binary tree nodes.
The binary array set can be readily adapted to a map/dictionary if needed.
[0]: I asked a question on Stack Overflow if there is a name for this data structure, but the result was negative. So “binary array set” is a name I made up. (The name can’t be used to find existing literature about the data structure.)