Project Nayuki

AA tree set

The AA (Arne Andersson) tree is a self-balancing binary tree data structure. It guarantees fast operations in Θ(log n) time, and the implementation code is perhaps the shortest among all the balanced trees. Here is an implementation of a sorted set abstracted data type (ADT) using the AA tree as the basis.

Source code

Java (SE 7+)

The class AaTreeSet implements java.util.Set, making it easy to try this AA tree in existing code that uses standard classes from the Java collections framework. The iterator is guaranteed to output values in sorted order. (This class doesn’t implement java.util.SortedSet because the required methods like subSet() are expected to have lengthy code.) Note that the iterator is not fail-fast unlike the ones in the Java collections framework; take care to not modify the set when iterating over it.

Python (compatible with 2 and 3)

This implementation supports the methods add(), remove(), discard(), len(), and in. Thus the interface mimics the built-in set type.

License: MIT (open source)

Compared to red-black trees

If self-balancing trees (which have Θ(log n) time complexity) are taught in a computer science curriculum at all, the classic choice is the red-black tree. Also, RB trees are deployed pervasively in practice, such as in the Linux kernel (C code) and in the class java.util.TreeSet of the Java collections framework.

AA trees can be viewed as RB trees with two modifications: The left child cannot be red but the right child can be; and instead of using the colors red/black, each node stores an integer denoting its level, and a right child is conceptually considered to be red if it is at the same level. What stays the same are that both are binary search trees, every path from the root to a leaf has the same number of black nodes, and a red node cannot have a red child.

Red-black trees have many cases to consider when rebalancing after a node insertion or deletion; AA trees were designed to cut down the cases significantly. AA trees are superficially similar to left-leaning red-black trees, but LLRB trees still seem to have many cases and to achieve less simplicity than AA trees.

Compared to AVL trees

The first self-balancing binary tree structure to be invented is the AVL tree. It’s probably the second most popular type of binary tree after RB trees. (Binary trees types other than RB, AVL, and splay have very little mindshare.)

The implementation size of AVL trees is only a bit longer than AA trees, to my surprise. Conversely, AA trees turned out to be not as short as I had hoped (when I was reading papers and pseudocode, but before writing a concrete implementation). This makes AVL trees very competitive with AA trees, and makes it hard to pick a decisive winner.

To illustrate with real code written in the same style and measured the same way, is 95 lines of code whereas is 99 lines. Among the two tree types, the outer wrapper classes are identical, the node fields differ by one member, the node add() methods are nearly the same except for the final rebalancing, and the node remove() methods have an identical prefix. For AVL trees, add() and remove() have the same rebalancing logic at the end. Whereas for AA trees, add() has a simple rebalance at the end, but remove() has relatively long and difficult-to-prove rebalancing logic. The AA tree’s skew() and split() are short methods, and are comparable to the AVL tree’s rotateLeft() and rotateRight() in terms of conceptual complexity. The AVL tree has some complexity in the balance() method, though all the tree-balancing logic has left-right symmetry (absent from AA trees), which reduces the cognitive burden.

There is more to an algorithm / data structure than just the implementation code. For any piece of non-trivial logic, it’s necessary to mathematically prove the correctness of said code to ensure that it will always behave as intended. The proof is not always explicitly published alongside the implementation, but it still affects how easy or hard a human would consider the code to be. In this light, analyzing and proving each case for AA tree rebalancing, especially for deletion, was quite difficult based on my experience. By contrast, proving the AVL tree cases was straightforward and involved much less work. So even though the code for AA trees is slightly shorter than AVL trees, the AA tree’s more difficult proof suggests that it should be easier to create a robust AVL tree implementation and to modify it for custom applications.

If a tree is designed so that auxiliary data is maintained in each tree node (e.g. size of the subtree), then a node needs to be updated whenever its children change. Adding this update logic is easy for AVL trees because the tree rotations already call an existing updater method in 6 places, so only one function needs to be changed. Whereas for AA trees, there is no existing updater method, hence it’s necessary to add one method and roughly 6 calls, after which the code becomes longer than AVL trees. (Moreover, figuring out all the correct places to call the updater is a potential source of bugs, and there is no existing test case coverage – unlike my AVL tree code.)

Overall, I recommend AVL trees over AA trees because the code is negligibly longer, the underlying math is easier to prove, and it’s easier to augment nodes with extra information and recompute it correctly as the tree gets rebalanced.

More info