Compact hash map (Java)

Introduction

When storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. The reasons are many: Primitives like int must be boxed as objects when used in the collections framework, strings are stored in UTF-16, every object has a header and needs to be reached via a pointer, linked list and tree nodes cannot store the payload data inline but must hold a pointer to the object (whereas C++ templates can store data inline), etc.

Because of these internal details of Java and the libraries, using something like HashMap<String,Integer> can consume 4× the theoretical minimum storage requirement. This becomes a problem when millions of objects need to be stored in RAM and this consumption approaches the JVM or computer’s memory limit.

Here I provide a much more memory-efficient implementation of a hash table based map/dictionary abstract data structure. It has certain caveats and it takes more coding work to utilize this data structure, but it can provide huge wins in memory-constrained situations.

Source code

Compact hash map code:

CompactMapTranslator.java: The interface that users must implement.
CompactHashMap.java: The core data structure library classes.
CompactHashMapTest.java: A JUnit test suite.
CompactHashMapDemo.java: A runnable main program for benchmarking.

There are a couple of basic ideas behind how the compact hash map works:

Every key-value pair is serialized as a byte array (byte[]) and stored in the hash table array. The original objects given to the put(K, V) method are forgotten. These keys and values are deserialized on the fly back into objects when retrieved.
The translator object is a stateless functor that provides the necessary methods to make the compact hash map work. It serializes a key-value pair into a byte array, deserializes a byte array back to a key and value, and provides miscellaneous required functionality.
The hash table uses open addressing instead of separate chaining, which saves the cost of node objects and next pointers.
The hash value of each key is not cached (unlike java.util.HashMap.Entry), which saves space but increases the CPU time when the hash table is resized.

All of these design choices make the data structure more memory-efficient at the expense of consuming more CPU time and having more complex handling logic (more lines of code to verify and maintain).

If the Java language evolves to add features like value types and generics specialization (under the Project Valhalla umbrella) and existing libraries are retrofitted to reduce objects and save space, then my compact data structure implementations become less useful.

As an extra, the compact hash map code can be reduced and specialized to implement a compact hash set:

License: MIT (open source)

Benchmark

The demo program was run on “regular” mode versus “compact” mode using the same system environment, waiting until program termination with an OutOfMemoryError. This was tested on Windows XP SP3 32-bit, Oracle Java 1.7.0_25, memory limit -mx300M. The data being stored is pairs of (random 10-character ASCII string, random signed 32-bit integer), which has a theoretical minimum size of 14 bytes.

We see that the compact implementation can store 2.4× the amount of data as the regular implementation while using the same amount of memory. Note that after the first few minutes, the benchmark runs slower and slower as the JVM garbage collector executes more and more frequently.

Quantity	Regular hash map	Compact hash map
Count Number of key-value pairs added to the map, an internal measurement reported on standard output.	2 990 755	7 399 340
Memory Amount of memory used, as measured internally by the program, reported on standard output.	290.00 MiB	290.00 MiB
Mem usage An external measurement reported by Windows Task Manager.	354.42 MiB	312.85 MiB
VM size An external measurement reported by Windows Task Manager.	367.81 MiB	326.21 MiB
CPU time As reported by Windows Task Manager. It gives you an idea of how long the benchmark takes. Over 80% of the time is spent on garbage collection anyway, so these numbers do not reflect the speed of the map implementations.	44 s	260 s

Analysis

Exactly how does CompactHashMap use memory differently compared to the standard java.util.HashMap? Let’s look at the internal data structures and count the bytes in the data fields.

To make the example concrete, assume that we are storing 10-character ASCII strings as keys and 32-bit integers as values. Also assume that pointers are 4 bytes and each object has 4 bytes of overhead for a class pointer.

Compact hash map

Map entry: (String s, int k): Serialized into byte[] as {s in UTF-8, k in big endian} = 10 + 4 bytes = 14 bytes.
byte[]: 14-byte payload, int length, object header = 14 + 4 + 4 bytes = 22 bytes.
CompactHashMap: At the recommended load factor of 0.5, the hash table array length is about double the number of entries stored. Each array slot is a 4-byte pointer to the byte[] data. Thus on average 8 bytes are used per entry due to the free space provisioning.
Total: Storage per entry: 22 (byte[] object) + 8 (pointers in hash table) = 30 bytes.

Standard hash map

Map entry data: char[] (inside String): 10-char payload, int length, object header = 20 + 4 + 4 bytes = 28 bytes.
String (Java 1.7.0_06 or above): Pointer to char[], int hash, int hash32, header = 4 + 4 + 4 + 4 bytes = 16 bytes.
Integer: int value, header = 4 + 4 bytes = 8 bytes.
HashMap.Entry: Pointer to key, pointer to value, int hash, pointer to next entry, header = 4 + 4 + 4 + 4 + 4 bytes = 20 bytes.
HashMap: The hash table array length is approximately a multiple of the number of entries stored (for the default load factor of 0.75). Each array slot is a 4-byte pointer to a HashMap.Entry object. On average 5.33 bytes are used per entry due to the load factor.
Total: Storage per entry: 28 (char[]) + 16 (String) + 8 (Integer) + 20 (HashMap.Entry) + 5 (pointers in hash table) = 77 bytes.

More info

David Galles: Data Structure Visualizations - Closed Hashing