Project Nayuki


Large arrays proposal for Java

Introduction

Java’s native array types use a 32-bit int for length and indexing. This means the maximum length of each array is 231 − 1 ≈ 2 billion elements.

The design decision made sense in the first decade of Java’s life (starting 1995), where 32-bit CPUs were the norm and memory was under 1 GiB. But in 2016 now that 64-bit CPUs are ubiquitous in desktops and laptops, and even consumer-grade PCs have at least 4 GiB of RAM, this array size restriction makes it difficult to continue writing simple Java code to operate on ever-growing datasets like text and images.

In this article I will sketch a no-frills and practical way to modify the Java programming language to add support for arrays with 64-bit addressing, while maintaining compatibility with existing code.

Current workarounds

If you need to work with huge lists of items today and can’t wait for the Java specification to evolve, you can choose to take one of these approaches:


My proposal

Quick summary:

The terminology I will use is as follows:

Changes to Java virtual machine and class bytecode

Firstly to preserve compatibility with existing code, old-style small arrays must behave exactly the same way as they always did. In particular, the length and index must still be int values.

We would add a new parameterized type for large arrays, to distinguish them from small arrays. We add new JVM bytecode instructions to create such a large array, get its length, get an element, and set an element. All four of these instructions work with a long value as the length or index.

Speaking internally though, the JVM can use the same memory layout for small and large arrays. For example, the reference to an array can be a raw pointer to its zeroth element, and the 8 bytes in front of this address can hold the 64-bit value that denotes the number of elements in the array (even if the array is an old-style small one). Although the raw representation of large and small arrays is the same, it is extremely important to keep the two segregated at the conceptual level. Existing code that works with small arrays will use int for length and indexing, so it cannot address all the elements of a large array. Therefore it would be wrong pass a large array into a legacy variable/field/argument that expects a small array.

However, we can relax the strict segregation between small and large arrays with two conversion operations. Firstly, it is always okay to pass a small array into a variable that expects a large array, because the new-style code will use long for indexing and length, so it can correctly address all its elements. Secondly, we should be able to explicitly cast a large array object into a small array object by first checking if its length <= Integer.MAX_VALUE, and then either allowing the conversion or throwing a new kind of runtime exception (e.g. ArrayLengthExceedsIntMaxException). Hence, we can pass large arrays into legacy small-array code if and only if the array size is small enough to fit an int. This is why it is much preferred to internally implement small and large arrays using the same memory layout in the JVM.

Miscellaneous notes:

Changes to Java programming language syntax

There are two aspects to changing the Java programming language: One is to add the new syntax for large arrays, and the other is to change the mapping between source code constructs and bytecode constructs. The latter is straightforward given the JVM change description above, so we will only focus on the syntax part.

Let me begin by saying that I don’t know what is the best (or even reasonable) syntax for large arrays, but I do know that the underlying feature will work correctly even if the syntax happens to be ugly.

To mesh nicely with small array types, we need to consider the three existing major syntax cases:

One idea for syntax is to use double square brackets for types:

byte[[]] a;  // Large array type
a = new byte[[4_000_000_000L]];  // New large array object

// Indexing does not need double square brackets
// because we can define the []-operator to be overloaded
System.out.println(a[3_999_999_999L]);

// Explicitly cast from large to small array for legacy methods.
// Casting may throw ArrayLengthExceedsIntMaxException.
System.in.read((byte[])a);

Alternative syntaxes

The syntax above appears to be a reasonable proposal. The only risk is that if Java ever adopts square brackets for an array literal syntax (like Python and JavaScript), it could create a syntactical ambiguity:

// Hypothetical array literal syntax
byte[] x = [];
byte[] y = [0];
byte[] z = [5, 7];

// Now does this line mean "create small-array byte[] with
// literal array value [0] as the length" (type error anyway),
// or does it mean "create large-array byte[[]] with length 0"?
foo(new byte[[0]]);

So if the array literal ambiguity is a problem, then here are some brief and uglier proposals I imagined:


Aside: Comparison with C/C++

One of the strengths of C and C++ is that it adapts its integer type widths to the native sizes supported by the machine/platform. For example, size_t maps to uint32 on a 32-bit platform or uint64 on a 64-bit platform. This makes it easier to write code that both runs efficiently (e.g. avoiding unnecessary 64-bit indexing on 32-bit machines) and can address the largest arrays available.

But this capability comes at a large burden to the programmer. Because now if you write code using size_t or long or similar, you need to ensure that it behaves correctly not only for your current platform but on all possible platforms. You need to carefully consider value ranges, arithmetic overflow, over-long shifts, sizes of fields in a struct, etc. It dramatically increases the effort and attention to detail needed to write completely correct and portable code.

Thus to keep the Java programming platform sane and non-broken, we cannot and must not change int from 32 to 64 bits. Legacy code will still only be able to access 31-bit(sic) small arrays, and only new code (by explicitly opting in) will be able to access 63-bit(sic) large arrays using long variables as loop counters and indexes.

Side proposal: Array slicing

I find that the use of offsets and sub-lengths to accomplish array slicing in Java is verbose (the boilerplate int off and int len variables) and error-prone (no implicit bounds checking for slices). Other languages like Python support list slicing natively. Because here we are proposing to add a new native array type to Java with new (but very familiar) semantics, now might be a good time to add proper array slicing support to Java too. Note that the slicing proposal is orthogonal to the large arrays proposal, and it can be implemented whether or not large arrays are implemented.

The syntax for Java array slicing can be like Python – for example:

int[] x = new int[100];
int[] y = x[5 : 25];  // Classic notation: off = 5, len = 20
// Note that taking a slice incurs an immediate
// bounds check to detect problems early

System.out.println(y.length);  // "20", not "100"
// y is bounds-checked, so accessing y[-1] or y[20]
// is illegal, even though the hidden underlying
// array x does have those elements.

As for the JVM internals, let’s assume that currently a reference to an array is simply a native pointer to the zeroth element. To achieve the slicing proposal, we need to change this scalar reference into a tuple of {void *originalBase, void *sliceBase, uint64 sliceLength} (in C notation). Clearly we need sliceBase to access the offsetted array and sliceLength to check against the new bounds. The less obvious part is that we need a hidden pointer to the original array base so that the JVM can track reference marks for garbage collection, do monitor synchronization, invoke methods like getClass(), etc.

Notes:

Related work

I am by far not the first person to suggest large / long-based arrays in Java. Here is a handful of other people who wrote on the topic: