Project Nayuki


Large arrays proposal for Java

Introduction

Java’s native array types use a 32-bit int for length and indexing. This means the maximum length of each array is 231 − 1 ≈ 2 billion elements.

The design decision made sense in the first decade of Java’s life (starting 1995), where 32-bit CPUs were the norm and memory was under 1 GiB. But in 2016 now that 64-bit CPUs are ubiquitous in desktops and laptops, and even consumer-grade PCs have at least 4 GiB of RAM, this array size restriction makes it difficult to continue writing simple Java code to operate on ever-growing datasets like text and images.

In this article I will sketch a no-frills and practical way to modify the Java programming language to add support for arrays with 64-bit addressing, while maintaining compatibility with existing code.

Current workarounds

If you need to work with huge lists of items today and can’t wait for the Java specification to evolve, you can choose to take one of these approaches:

  • (Safe, low-level) Use an array of arrays. For example, new byte[16][1073741824] will hold just over 16 billion elements. If the inner array has a length that is a power of 2, then you can use masking and shifting to derive the two indices.

  • (Safe, high-level) Write a generic collection class like MyLargeArray<E> that provides the methods long size(), E get(long index), and void set(long index, E value). The underlying data storage can use an array of arrays as noted above, or it can use linked lists, balanced trees, etc.

  • (Unsafe, Java-ish) Use the sun.misc.Unsafe class to allocate a large block of memory, then use direct memory addressing to read and write elements. This is highly unsafe because you need to check array bounds manually and compute memory addresses carefully. Moreover it is not portable and will not work outside of Oracle/Sun JVM implementations. But once you get it up and running, this solution is fast (no double indexing) and concise (not much structural code).

  • (Unsafe, native) A more correct but effortful solution is to use the Java Native Interface. Specifically, you write a wrapper class like MyLargeFloatArray, provide a {constructor, length method, element getter, element setter}, then switch to native C code to implement the memory allocation and array indexing.


My proposal

Quick summary:

  • In the JVM, add large-array bytecode instructions alongside existing small-array instructions.

  • Strictly preserve the existing behavior of small arrays and don’t extend them.

  • Carry over the (unfortunate) type covariance behavior to large arrays for consistency.

  • Internally make all arrays large, and provide new conversion (casting) operations.

  • In the JLS, add a different syntax for large array types, implicit conversion from small to large array, and explicit conversion from large to small array.

The terminology I will use is as follows:

  • “Small array” refers to the current specification of arrays in the JVM and JLS. The length and index are values of type int.

  • “Large array” refers to my proposed improvement to the Java language and virtual machine. The length and index are values of type long. The length of each large array doesn’t need to exceed Integer.MAX_VALUE; the large array types merely allow the length to exceed the int limit.

  • By default, small and large array types are not interchangeable. You cannot give a large array reference to a variable that is declared as a small array, or vice versa. However, these restrictions will be carefully relaxed in the detailed explanation.

Changes to Java virtual machine and class bytecode

Firstly to preserve compatibility with existing code, old-style small arrays must behave exactly the same way as they always did. In particular, the length and index must still be int values.

We would add a new parameterized type for large arrays, to distinguish them from small arrays. We add new JVM bytecode instructions to create such a large array, get its length, get an element, and set an element. All four of these instructions work with a long value as the length or index.

Speaking internally though, the JVM can use the same memory layout for small and large arrays. For example, the reference to an array can be a raw pointer to its zeroth element, and the 8 bytes in front of this address can hold the 64-bit value that denotes the number of elements in the array (even if the array is an old-style small one). Although the raw representation of large and small arrays is the same, it is extremely important to keep the two segregated at the conceptual level. Existing code that works with small arrays will use int for length and indexing, so it cannot address all the elements of a large array. Therefore it would be wrong to pass a large array into a legacy variable/field/argument that expects a small array.

However, we can relax the strict segregation between small and large arrays with two conversion operations. Firstly, it is always okay to pass a small array into a variable that expects a large array, because the new-style code will use long for indexing and length, so it can correctly address all its elements. Secondly, we should be able to explicitly cast a large array object into a small array object by first checking if its length <= Integer.MAX_VALUE, and then either allowing the conversion or throwing a new kind of runtime exception (e.g. Array­Length­Exceeds­Int­Max­Exception). Hence, we can pass large arrays into legacy small-array code if and only if the array size is small enough to fit an int. This is why it is much preferred to internally implement small and large arrays using the same memory layout in the JVM.

Miscellaneous notes:

  • The two conversions are inherently asymmetric – casting from small to large array is a free no-op, but casting from large array to small array incurs a runtime check and a potential exception.

  • This proposal (unlike some other ones) does not attempt to extend small arrays with a .longLength field or something of that nature, and large arrays don’t have a legacy .intLength field. Instead we start off by introducing a completely separate typeclass and then reconcile large and small arrays only to the maximum degree allowed.

  • No special consideration is needed for multidimensional arrays. In particular, it is permissible to mix small and large array types among the different dimensions.

Changes to Java programming language syntax

There are two aspects to changing the Java programming language: One is to add the new syntax for large arrays, and the other is to change the mapping between source code constructs and bytecode constructs. The latter is straightforward given the JVM change description above, so we will only focus on the syntax part.

Let me begin by saying that I don’t know what is the best (or even reasonable) syntax for large arrays, but I do know that the underlying feature will work correctly even if the syntax happens to be ugly.

To mesh nicely with small array types, we need to consider the three existing major syntax cases:

  • Variable declaration:
    byte[] a;
    byte[][] b;
    byte c[];    // An unfortunate legacy
    byte[] d[];  // An unfortunate legacy
    class Foo<E> { Set<E>[] e; }
    static void bar(byte[] f) { }
    // Note: We will keep method varargs as small arrays.
  • Array creation:
    g = new int[5]; // Length given, fill with zeros
    h = new int[]{2, 7, 1}; // No length, fill with values
  • Array indexing:
    g[0] = h[1];

One idea for syntax is to use double square brackets for types:

byte[[]] a;  // Large array type
a = new byte[[4_000_000_000L]];  // New large array object

// Indexing does not need double square brackets
// because we can define the []-operator to be overloaded
System.out.println(a[3_999_999_999L]);

// Explicitly cast from large to small array for legacy methods.
// Casting may throw ArrayLengthExceedsIntMaxException.
System.in.read((byte[])a);

Alternative syntaxes

The syntax above appears to be a reasonable proposal. The only risk is that if Java ever adopts square brackets for an array literal syntax (like Python and JavaScript), it could create a syntactical ambiguity:

// Hypothetical array literal syntax
byte[] x = [];
byte[] y = [0];
byte[] z = [5, 7];

// Now does this line mean "create small-array byte[] with
// literal array value [0] as the length" (type error anyway),
// or does it mean "create large-array byte[[]] with length 0"?
foo(new byte[[0]]);

So if the array literal ambiguity is a problem, then here are some brief and uglier proposals I imagined:

  • Use braces like byte{} a;. This could cause syntactical headaches and ambiguity with the two existing uses of braces, which are lexical scopes and array literals. Also let’s try not to mimic C++11’s syntax too eagerly.

  • Use a decoration symbol inside the brackets, like byte[*] b or byte[!] c. This would work fine for the type specification, but would hurt in the instantiation expression because new byte[*8] and new byte[!2] are ugly and possibly ambiguous.

  • Use a decoration symbol outside, like byte*[] d. But this makes the symbol look detached from the array notation, hurts for multidimensional arrays, and contradicts the meaning in the C programming language (for programmers who also work in C).

  • Use generic class notation like LargeArray<byte> e and new LargeArray<byte>(9). This essentially means no new syntax scheme, but is very bad in numerous ways. These pseudo-generics behave differently from ordinary generics, because arrays are implicitly covariant while generics are invariant, and because the type parameter can take primitive types.

  • Give up on syntax entirely and create a bunch of new classes like LargeByteArray, LargeFloatArray, etc., and LargeObjectArray<E>. This sort of style, creating one class per primitive type, has precedent in examples like java.nio.ShortBuffer and java.util.concurrent.atomic.AtomicInteger. However, it doesn’t address the problem of implicit covariance in array types.


Aside: Comparison with C/C++

One of the strengths of C and C++ is that it adapts its integer type widths to the native sizes supported by the machine/platform. For example, size_t maps to uint32 on a 32-bit platform or uint64 on a 64-bit platform. This makes it easier to write code that both runs efficiently (e.g. avoiding unnecessary 64-bit indexing on 32-bit machines) and can address the largest arrays available.

But this capability comes at a large burden to the programmer. Because now if you write code using size_t or long or similar, you need to ensure that it behaves correctly not only for your current platform but on all possible platforms. You need to carefully consider value ranges, arithmetic overflow, over-long shifts, sizes of fields in a struct, etc. It dramatically increases the effort and attention to detail needed to write completely correct and portable code.

Thus to keep the Java programming platform sane and non-broken, we cannot and must not change int from 32 to 64 bits. Legacy code will still only be able to access 31-bit(sic) small arrays, and only new code (by explicitly opting in) will be able to access 63-bit(sic) large arrays using long variables as loop counters and indexes.

Side proposal: Array slicing

I find that the use of offsets and sub-lengths to accomplish array slicing in Java is verbose (the boilerplate int off and int len variables) and error-prone (no implicit bounds checking for slices). Other languages like Python support list slicing natively. Because here we are proposing to add a new native array type to Java with new (but very familiar) semantics, now might be a good time to add proper array slicing support to Java too. Note that the slicing proposal is orthogonal to the large arrays proposal, and it can be implemented whether or not large arrays are implemented.

The syntax for Java array slicing can be like Python – for example:

int[] x = new int[100];
int[] y = x[5 : 25];  // Classic notation: off = 5, len = 20
// Note that taking a slice incurs an immediate
// bounds check to detect problems early

System.out.println(y.length);  // "20", not "100"
// y is bounds-checked, so accessing y[-1] or y[20]
// is illegal, even though the hidden underlying
// array x does have those elements.

As for the JVM internals, let’s assume that currently a reference to an array is simply a native pointer to the zeroth element. To achieve the slicing proposal, we need to change this scalar reference into a tuple of {void *originalBase, void *sliceBase, uint64 sliceLength} (in C notation). Clearly we need sliceBase to access the offsetted array and sliceLength to check against the new bounds. The less obvious part is that we need a hidden pointer to the original array base so that the JVM can track reference marks for garbage collection, do monitor synchronization, invoke methods like getClass(), etc.

Notes:

  • This slicing proposal is almost fully backward-compatible with legacy code that has no awareness of array slices. When a variable is given an array slice, the code simply sees a smaller array, and it is still bounds-checked precisely.

  • Two incompatibilities I can think of may or may not impact real-world code. When slicing is disallowed, we can be sure that if array x != y then none of their elements overlap. But with slicing this property is violated, and it could affect some niche code that depends on it. This problem is not fixable because no matter how we define the == operator, because some previously implied property will be violated. For example if we only compare the original base pointer instead of the slice base pointer, then x == y does not imply that x[0] == y[0]. Along the same lines, defining hashCode() has the same problem as ==, and the notion of synchronizing on an array slice becomes highly dubious.

  • Slicing by reference is more useful than slicing by value (i.e. cloning a subarray, which is what Python’s list does), because it allows slices to be modified by a subroutine and seen by the caller. But one danger is that after taking a short slice of a big array, this prevents the rest of the array from being garbage-collected. (This is a problem in the real world – the old implementation of java.lang.String.substring() previously used integer slicing, but now it copies the subrange to a new array.)

Related work

I am by far not the first person to suggest large / long-based arrays in Java. Here is a handful of other people who wrote on the topic:

  • Proposal: Large arrays (and take two) by James Lowden in 2009 for Project Coin. Detailed and thorough, and overlaps with my proposal in many places. It explores more about the consequences on the bytecode format and standard library, but doesn’t cover small vs. large array conversions or JVM internal representation.

  • Arrays [ 2.0 64 ] – opportunities and challenges, a slideshow by Oracle that proposes many wonderful improvements to arrays, but unfortunately is complex and unlikely to achieve consensus and implementation.

  • Numerous tickets in the JDK Bug Database with no substantial details provided: #4880587, #8061420, #6292967.