Project Nayuki


Unsigned int considered harmful for Java

Introduction

Every now and then, a developer comes along and complains about the lack of unsigned integer types in Java, demanding that they be implemented in the language. Oftentimes it’s a novice coming from C/C++ or C# who has worked with unsigned types before, or one who wants to guarantee no negative numbers in certain situations. Said novice C/C++/C# programmer typically does not fully understand the semantics and ramifications of unsigned integers in C/C++/C# to begin with.

While on the surface this feature request seems reasonable – extra types will be added but one can choose to ignore them if not explicitly needed – it would actually cause deep problems for all Java developers whether they like it or not. Moreover, arithmetic operations on unsigned integers can already be quite easily and efficiently emulated using signed integers, as proven in practice in numerous low-level libraries. Here I will argue why Java supporting unsigned integer types would be not only unnecessary but also harmful.

Emulating unsigned arithmetic

Let’s assume that for whatever reason, you must perform unsigned integer arithmetic in Java. Maybe you’re writing a cryptographic algorithm, decoding a multimedia stream, or implementing a file format / network protocol where data fields are specified as unsigned.

Straightforward emulation

The conceptually simple way to emulate unsigned arithmetic using only Java’s signed integer types is to widen the number while masking off the new high bits, perform the arithmetic, and narrow down the result.

Suppose we have this C/C++ code:

uint16_t a = (...);  // Input
uint16_t b = (...);  // Input
uint16_t c = a + b;  // Output

Then this is one way to make it work in Java:

short a = (...);           // Treat bits as unsigned value
short b = (...);           // Treat bits as unsigned value
int wideA = a & 0xFFFF;    // This value is now correct
int wideB = b & 0xFFFF;    // This value is now correct
int temp = wideA + wideB;  // Wide result
short c = (short)temp;     // Discard top bits; treat bits as unsigned value

Or more compactly (Java):

short a = (...);  // Input
short b = (...);  // Input
short c = (short)((a & 0xFFFF) + (b & 0xFFFF));

Notes:

  • In a compound expression like (a + b) / (c * d), the widenings and narrowings must be performed per operation, otherwise the result may be incorrect.

  • To operate on two uint32_t values, we need to widen them to Java long using val & 0xFFFFFFFFL. Unfortunately this technique does not allow us to operate on two uint64_t values because there is no wider signed type in Java; we’d need to step up to the clumsy BigInteger library.

  • When we say “treat bits as unsigned value”, it refers to the fact that we can put a fixed-width sequence of n bits into an integer variable without caring about the meaning of its value. At least, this usage is acceptable for unsigned and two’s complement integers.

    (This assumption of an integer being a plain sequence of bits may not be true in sign-magnitude, ones’ complement, and floating-point because of problems like negative zero, trap values, collapsed NaN values, and flushing subnormals to zero.)

Efficient emulation

The Java language specification requires its signed integers to be represented in two’s complement format. Because of this, many basic operations are exactly the same whether the integer type is signed or unsigned. For example, the sum of two 8-bit integers giving an 8-bit result will have the same bit pattern whether all three of these values are treated as signed or unsigned. The same goes for subtraction, multiplication, left shift, equality, and of course bitwise AND/OR/XOR/NOT operations. The only ones that are signedness-sensitive are division, remainder, right shift, inequalities, casting, and string conversion. To summarize:

  • These operations are the same: +, -, *, ~, &, |, ^, <<, ==, !=, down-casting.

  • These operations are different: /, %, >>, <, <=, >, >=, up-casting, parseInt(), toString().

Java’s advantage of having mostly unified signed/unsigned arithmetic operations (from a bit-level standpoint) is not enjoyed by C/C++ users. This is because signed integers can be implemented on the machine in two’s complement, ones’ complement, or sign-magnitude, and the C/C++ standards allow the low-level detail of the bit format to be exposed to the programmer. Furthermore, while signed overflow is silent and well-defined in Java, it is undefined behavior in C/C++ and leads to all sorts of nasty problems. (In fact, if you stretch the argument you could say that if you wanted safe and reliable two’s complement signed overflow in C/C++, you should emulate the signed arithmetic using unsigned arithmetic – the reverse situation of what’s being argued here for Java.)

Continuing the previous example, this is the efficient Java translation:

short a = (...);           // Input, treat bits as unsigned value
short b = (...);           // Input, treat bits as unsigned value
short c = (short)(a + b);  // Output, treat bits as unsigned value

We can ignore the mandatory widening from 16 bits to 32 bits (promotion from short to int with sign extension), and we can ignore the arithmetic that occurs in the high 16 bits because it has no effect at all on the low 16 bits.

This technique is even easier for 32- and 64-bit unsigned arithmetic because no narrowing in Java is needed. C/C++ example:

uint32_t a = (...);  // Input
uint32_t b = (...);  // Input
uint32_t c = a * b;  // Output

Java translation (look ma, no extra effort!):

int a = (...);  // Input, treat bits as unsigned value
int b = (...);  // Input, treat bits as unsigned value
int c = a * b;  // Output, treat bits as unsigned value

As for the operations that behave differently for signed versus unsigned types, let’s discuss how to emulate them without resorting to widening:

  • The Java language already defines signed (>>) and unsigned (>>>) right shift operators. The rationale behind these operators is that for unsigned integers, right shifting is equivalent to flooring division by a power of 2. To emulate this behavior for signed integers, copies of the sign bit (instead of zero) are shifted in from the left side.

  • Unsigned division/remainder has no easy solution. You can either cast to a wider type like long to do the arithmetic, use BigInteger, or implement the grade school division algorithm from scratch. Luckily, Java SE 8 adds standard library functions to take care of this: Integer.divideUnsigned(), remainderUnsigned().

    Otherwise, here is some non-trivial code I wrote for unsigned 32-bit division:

    int unsignedDivide(int x, int y) {
      if (y >= 0) {
        if (x >= 0)
          return x / y;  // Easy case
        else {  // x >= 2^31
          int i = Integer.numberOfLeadingZeros(y);  // At least 1
          if ((x >>> i) < y)
            i--;
          // Perform one step of long division and offload the rest
          return (1 << i) + (x - (y << i)) / y;
        }
      } else  // y >= 2^31
        return unsignedLessThan(x, y) ? 0 : 1;
    }
  • Unsigned inequalities are also covered by Java SE 8 functions. Otherwise, they can be emulated with a clever trick: Flip the top bits of both operands and proceed with the same inequality operator. For example:

    // x and y are treated as uint32
    boolean unsignedLessThan(int x, int y) {
      return (x ^ 0x80000000) < (y ^ 0x80000000);
    }

    (The mathematical proof is left as an exercise for the reader.)

  • To cast down (narrowing) to an unsigned type, nothing special needs to be done. To cast up (widening) from a pseudo-unsigned type to native Java signed type, the upper bits need to be masked off to suppress sign extension. For example:

    byte  a = (...);  // Input, treat as uint8
    short b = (...);  // Input, treat as uint16
    int   c = (...);  // Input, treat as uint32
    long  d = (...);  // Input, treat as uint64
    
    int aInt = a & 0xFF;           // Correct value
    int bInt = b & 0xFFFF;         // Correct value
    long cLong = c & 0xFFFFFFFFL;  // Correct value
    
    int  cIntWrong  = c & 0xFFFFFFFF;  // Wrong, does absolutely nothing!
    long cLongWrong = c & 0xFFFFFFFF;  // Wrong, does not suppress sign extension!
    long dLongWrong = d & 0xFFFFFFFFFFFFFFFFL;  // Wrong, does nothing!
  • To convert a uint32 to a string, this algorithm can be used:

    String unsignedToString(int x) {
      if (x >= 0)  // Easy case
        return Integer.toString(x);
      else {
        int temp = (x >>> 1) / 5;  // Unsigned division by 10
        // Extract the last digit and offload the prefix
        return Integer.toString(temp) + Integer.toString(x - temp * 10);
      }
    }

    The algorithm above is readily adaptable to uint64/long.

Notes

  • This discussion of emulating basic arithmetic operations does not cover special operations like signed/unsigned saturating arithmetic, vector operations, etc.

  • Fancy functions such as these in java.lang.Integer only care about the bits, not signedness: reverse(), reverseBytes(), rotateLeft(), bitCount(), highestOneBit(), numberOfLeadingZeros(), etc.

Mixed-type arithmetic pitfalls

Type conversion rules

When a programming language has both signed and unsigned integer types, it means that they will interact with each other in some way or another. At the very least, there must exist explicit conversions between types and rules that govern these conversions. The rules tend to become complicated when implicit conversions are implemented in the language specification. For example, this is what C has to say about mixed-type integer arithmetic (C11 standard draft, page 71):

  1. If both operands have the same type, then no further conversion is needed.

  2. Otherwise, if both operands have signed integer types or both have unsigned integer types, the operand with the type of lesser integer conversion rank is converted to the type of the operand with greater rank.

  3. Otherwise, if the operand that has unsigned integer type has rank greater than or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

  4. Otherwise, if the type of the operand with signed integer type can represent all of the values of the type of the operand with unsigned integer type, then the operand with unsigned integer type is converted to the type of the operand with signed integer type.

  5. Otherwise, both operands are converted to the unsigned integer type corresponding to the type of the operand with signed integer type.

Examples:

  • (long OP int) → (long OP long). (Rule 2)

  • (int OP unsigned int) → (unsigned int OP unsigned int). (Rule 3)

  • (unsigned long long OP short) → (unsigned long long OP unsigned long long). (Rule 3)

  • If the compiler implements sizeof(int) < sizeof(long), then:
    (unsigned int OP long) → (long OP long). (Rule 4)

  • Else the compiler implements sizeof(int) == sizeof(long), then:
    (unsigned int OP long) → (unsigned long OP unsigned long). (Rule 5)

(Note: These examples avoid showing the promotion of char and short to int.)

Usage examples

  • Mixed-type computations easily lead to subtle problems. For example:

    unsigned int size = (...);
    for (int i = 0; i < size; i++) { ... }

    If size > INT_MAX, then i will eventually have a signed overflow (undefined behavior). Furthermore, indexing an array with i being negative is also undefined behavior (assuming that the array pointer doesn’t have a positive offset). However by dumb luck, this loop idiom will probably work for large sizes – because even after i overflows, it gets converted to unsigned int for the comparison operation, and after i is incremented sufficiently many times the comparison will eventually yield false and terminate the loop.

    I admit that I’m tempted to declare my loop counters as int instead of unsigned int because it saves typing, most programming tutorials use int for simplicity, and most data sizes are small enough that it doesn’t matter. But when the upper bound is unsigned int, I have to choose either the laziness of using int or the correctness of using unsigned int.

    And speaking of loops, the decreasing loop idiom for (int i = n - 1; i >= 0; i--) does not work if i is an unsigned type.

  • The PVS-Studio C/C++ linter has caught many, many examples of code containing if (x < 0) where x is an unsigned type, in real open-source projects. (This pointless check always yields false, of course.) This gives evidence that programmers make mistakes in carefully distinguishing signed and unsigned integer types in practice.

  • In Java’s number system, there is almost no good reason to perform two casts consecutively. In C/C++ however, there are legitimate examples of this ugly construct emerging from signed vs. unsigned types:

    // 'char' type may be signed or unsigned, implementation-defined
    char *array = (...);
    // First force each byte to be unsigned, then widen to do packing
    uint32_t packed =
          (uint32_t)(uint8_t)array[0] << 24
        | (uint32_t)(uint8_t)array[1] << 16
        | (uint32_t)(uint8_t)array[2] <<  8
        | (uint32_t)(uint8_t)array[3] <<  0;

    (Alternatively we could write array[0] & UINT32_C(0xFF), but this involves mixed-type arithmetic which is harder to reason about.)

  • When you interact with someone else’s API that uses unsigned integer types, it forces you to use unsigned integer types in your own code as well. For example:

    struct Image {
      uint8_t *pixels;
      uint32_t width;
      uint32_t height;
    };

    Now if you want to write loops to perform signal processing on the pixels, you must use unsigned variables to count the ranges [0, width) and [0, height). In other words, you must write: for (uint32_t i = 0; i < img.width; i++), not for (int i = 0; i < img.width; i++).

    This syndrome of unsigned sizes is pervasive in C/C++, appearing in places like sizeof, malloc(size_t), vector.size(), and many others.

Other considerations

  • A wider signed type can always represent narrower unsigned types. For example, int32 can represent all uint31 values. But the reverse is not true; no unsigned type can represent all the values of a signed type (namely the negative ones are always left out). Thus if we have access to sufficiently wide signed types, we don’t need any unsigned types at all. (This is more or less the situation in Java.)

    People have made the observation that programming primarily involves unsigned numbers (starting from 0), which is indeed true. However, one needs to make use of negative numbers occasionally (such as when subtracting two unsigned indexes), so signed types must exist in a programming language too. Unsigned and signed types are both useful, but signed types have strictly more functionality than unsigned types. Thus it is clear that if language simplicity is desired, then unsigned types are the ones that can be safely omitted.

  • Bytes are a different story however. I argue that signed bytes make programming needlessly annoying, and that only unsigned bytes should have been implemented in the first place: Java’s signed byte type is a mistake. And the fact that C/C++ specify char to be signed or unsigned depending on the implementation is even more annoying, because now you need to ensure that your code handles both alternatives correctly.

  • It’s true that using unsigned types gains you 1 bit of extra indexing headroom, i.e. doubling the range. For example, Java array lengths go up to 231 − 1 (~2 billion) because int is signed. If instead unsigned int is used for sizes and indexes, then up to 232 − 1 (~4 billion) elements can be addressed.

    However in light of this small gain, there are some serious drawbacks. First, negative numbers can be used to signal error conditions, such as String.indexOf() returning −1 if not found. They also provide headroom to detect some instances of accidental overflows, such as NegativeArraySizeException when creating a negative-sized array or ArrayIndexOutOfBoundsException when accessing a negative array index. Second, it detracts from the more important long-term goal of upgrading to fully 64-bit addressing.

    (There are some examples of small addressing gains outside of Java. x86 CPUs originally had 32-bit memory addressing, but added PAE to extend to 36 bits. Then this small step was made useless when AMD’s x86-64 implementation was introduced. Another example is that in the 32-bit editions of the Windows NT family of operating systems, you can change the user/kernel address space split so that user programs can use up to 3 GiB of memory instead of the default 2 GiB. But again this is a small gain compared to switching to a fully 64-bit kernel and user processes.)

  • Being able to outlaw negative values can be useful in certain cases. For example, data sizes are never negative, so when putting these values in signed types it takes extra effort to explicitly check that the values are not negative. Also, bounds checking on unsigned types only needs to check the upper bound, whereas signed types also need to check non-negativity. However, I do not believe these small benefits are worth the huge risks of the programmer doing mixed-type arithmetic/comparisons/conversions incorrectly. It’s better to be just a little more verbose than to have subtle bugs hiding implicitly in the code.

  • One legitimate concern is when a file/network format specifies an unsigned data field – how should we represent it in Java? Let’s suppose that the Foo image file format defines unsigned 32-bit fields for width and height. There are a couple of reasonable ways to implement a Foo library in Java:

    class FooImageFileChoice0 {
      // Make it the user's responsibility
      // to treat these fields as uint32
      int width;
      int height;
    }
    
    class FooImageFileChoice1 {
      // Values shall be uint32 only.
      // Values outside [0,0xFFFFFFFF] considered illegal.
      long width;
      long height;
    }
    
    class FooImageFileChoice2 {
      // Design a custom type with its own
      // arithmetic and conversion methods
      Uint32 width;
      Uint32 height;
    }
    
    class FooImageFileChoice3 {
      // To hell with fixed-width integer types!
      BigInteger width;
      BigInteger height;
    }

    None of these alternatives is as correct as using uint32_t in C/C++, but I don’t think this situation comes up often enough to matter. And it’s not like C/C++ is perfect either, because it’s difficult to express more esoteric value constraints like: uint24, range [10, 99], 4k + 3, must be a prime number, etc.

  • Adding unsigned types would increase the already-cluttered Java standard library API. Because Java has method overloading (like C++) but not templates (unlike C++) or unified types (unlike C#), certain library functions must define a separate method for each primitive type. For example, java.util.Arrays defines these method variants for filling an entire array:

    void fill(byte[] arr, byte val)
    void fill(short[] arr, short val)
    void fill(int[] arr, int val)
    void fill(long[] arr, long val)
    void fill(char[] arr, char val)
    void fill(float[] arr, float val)
    void fill(double[] arr, double val)
    void fill(Object[] arr, Object val)

    The set of methods above is already a mouthful. If we hypothetically add unsigned types, we boost the list by 50%:

    void fill(byte[] arr, byte val)
    void fill(ubyte[] arr, ubyte val)
    void fill(short[] arr, short val)
    void fill(ushort[] arr, ushort val)
    void fill(int[] arr, int val)
    void fill(uint[] arr, uint val)
    void fill(long[] arr, long val)
    void fill(ulong[] arr, ulong val)
    void fill(char[] arr, char val)
    void fill(float[] arr, float val)
    void fill(double[] arr, double val)
    void fill(Object[] arr, Object val)

    Other parts of Java that must change include the reflection library, the JVM specification, the Java Native Interface specification, etc.

  • I have successfully implemented in Java numerous low-level libraries that conceptually require unsigned arithmetic. These libraries include cryptographic ciphers and hash functions, random number generators, big integer operations, and data compression. Even though these situations require uint32 arithmetic or something like it, I had no problem whatsoever expressing these operations in terms of Java’s signed integer types. Based on this experience, I expect that a competent programmer (who understands two’s complement binary representation) will have no trouble doing unsigned arithmetic in Java (which only superficially seems to not support unsigned math).

    Examples:

  • At a C++ conference, a question about unsigned vs. signed integer types was asked. Three panelists (including Stroustrup, the inventor of C++) unanimously answered that unsigned types are a bad idea.

    GoingNative 2013 – Interactive Panel: Ask Us Anything

    Video available on: YouTube, MSDN. The relevant excerpt starts at 42min 39sec:

    <Questioner> Second question, to Bjarne: Could you elaborate on not using unsigned ints, unsigned types, unless you want to do bit arithmetic? Because it means I would need to revise my code review policy.

    <Bjarne Stroustrup> Whenever you mix signed and unsigned numbers, you get trouble. The rules are just very surprising, and they turn up in code in strange places; they correlate very strongly with bugs. Now, when people use unsigned numbers, they usually have a reason. And the reason will be something like, “well it can’t be negative”, or “I need an extra bit”. If you need an extra bit, I am very reluctant to believe you that you really need it, and I don’t think that’s a good reason. When you think you can’t have negative numbers, you will have somebody who initialize[sic] your unsigned with −2, and think they get −2, and things like that. It is just highly error-prone. I think one of the sad things about the standard library is that the indices are unsigned whereas array indices are signed, and you’re sort of doomed to have confusion and problems with that. There are far too many integer types, there are far too lenient rules for mixing them together, and it’s a major bug source. Which is why I’m saying, stay as simple as you can, use integers ’til you really really need something else.

    <Herb Sutter> Use int unless you need something different. Then still use something signed until you really need something different. Then resort to unsigned. And yes, it’s unfortunately a mistake in the STL, in the standard library, that we use unsigned indices. If you are writing a very large data structure, the only case where you would really care about the unsigned is if you know it’s an array of characters that’s going to be bigger than half of memory. I don’t know of any other case where it matters. And that is only on a 32-bit system or less, and it just does not come up in practice very much.

    <Chandler Carruth> I’d like to add one other thing. Think about this: How frequently have you written arithmetic and wanted mod-2 behavior? All right, it’ll take you a long time to remember the last time you wanted 232-modular behavior; that’s really unusual. Why would you select a type which gives you that, and prevents any tool from ever finding a bug because it thinks you “need” mod-2 behavior?

Conclusion

Unsigned integer types do not belong in the Java programming language. They add considerable complexity to the type system, type conversion rules, library APIs, and the language and virtual machine specifications. Yet for developers who truly need unsigned arithmetic, this functionality can already be achieved effectively using signed integer types with only a small amount of knowledge and effort. Hence unsigned ints would add almost no expressive power but considerable pain – a rather unwise trade-off, isn’t it?

More info