Project Nayuki


Summary of C/C++ integer rules

This is my own collection of hard-earned knowledge about how integers work in C/C++, and how to use them carefully and correctly. In this article, I try to strike a balance between brevity (easing the reader) and completeness (providing absolute correctness and extensive detail).

Whenever I read or write C/C++ code, I need to recall and apply these rules in many situations, such as: Choosing an appropriate type for a local variable / array element / structure field, converting between types, and doing any kind of arithmetic or comparison. Note that floating-point number types will not be discussed at all, because that mostly deals with how to analyze and handle approximation errors that stem from rounding. By contrast, integer math is a foundation of programming and computer science, and all calculations are always exact in theory (ignoring implementations issues like overflow).

Data types

Basic integer types

A basic integer type is specified using some legal sequence of keywords drawn from the set {char, short, int, long, signed, unsigned}.

Although the bit width of each basic integer type is implementation-defined (i.e. depends on the compiler and platform), the following properties are guaranteed by the language standards:

  • char: At least 8 bits wide.

  • short: At least 16 bits, and at least as wide as char.

  • int: At least 16 bits, and at least as wide as short.

  • long: At least 32 bits, and at least as wide as int.

  • long long: At least 64 bits, and at least as wide as long.

Signedness

  • Unqualified char may be signed or unsigned, which is implementation-defined.

  • Unqualified short, int, long, and long long are signed. Adding the unsigned keyword makes them unsigned.

  • Signed numbers may be encoded in binary as two’s complement, ones’ complement, or sign-magnitude; this is implementation-defined. Note that ones’ complement and sign-magnitude each have distinct bit patterns for negative zero and positive zero, whereas two’s complement has a unique zero.

  • Character literals (in single quotes) have the type (signed) int in C, but (signed or unsigned) char in C++.

Additional rules

  • sizeof(char) always equals 1, regardless of the bit width of char.

  • The bit widths need not be distinct. For example, it’s legal to have char, short, and int all be 32 bits wide.

  • The bit widths need not be powers of 2. For example, int could be 36 bits wide.

  • There are various ways to write an integer type. For example, each line is a set of synonyms:

    • int, signed, signed int, int signed.

    • short, short int, short signed, short signed int.

    • unsigned long long, long unsigned int long, int long long unsigned.

Standard library types

  • size_t (defined in stddef.h) is unsigned and at least 16 bits wide. It is not guaranteed to be at least as wide as int.

  • ptrdiff_t (defined in stddef.h) is a signed integer type. Subtracting two pointers will yield this type. Do not assume that subtracting two pointers will yield int.

  • Specific type widths are defined in stdint.h: uint8_t, int8_t, 16, 32, and 64. Be careful with arithmetic promotions – for example, uint8_t + uint8_t yields int (which is signed and at least 16 bits wide), not uint8_t as one would reasonably expect.

Conversions

Suppose that a value of some source integer type needs to be converted to a value of some destination integer type. This situation might arise from an explicit cast expression, an implicit cast upon assignment, or an arithmetic promotion. What happens in the conversion?

A primary principle is that if the destination type can hold the value of the source type, then the value will be semantically preserved. To elaborate:

  • When a source type is widened to a destination type of the same signedness (e.g. signed charint; unsigned shortunsigned long), then every source value will be preserved after conversion.

  • Even if the source and destination types have different ranges, all values in the overlapping part of their ranges will be preserved. (For example, an int that holds a value in the range [0, 255] will be losslessly converted to unsigned char.)

More specifically, the rules are as follows:

  • When converting to an unsigned type, the new value equals the old value modulo 2destination bit width. Explanations:

    • If the source type is unsigned and wider than the destination, then effectively the top bits are discarded.

    • If the source type is signed, then the conversion effectively takes the source value and repeatedly adds/subtracts 2destination bit width until the new value fits in the destination type’s range. Moreover if the signed number system is two’s complement, then the conversion effectively discards the top bits, just like the unsigned case.

  • When converting to a signed type, these are the cases:

    • If the source value fits in the destination type’s range, then the conversion procedure (e.g. sign extension) produces a destination value that is semantically equal to the source value.

    • Otherwise if it doesn’t fit, then the behavior is implementation-defined, and could raise an exception (e.g. overflow trap).

Arithmetic

Promotions/conversions

  • A unary arithmetic operator has one operand, e.g. -, ~.

  • A binary arithmetic operator has two operands, e.g. +, *, &, <<.

  • If any operand of an operator has type bool, char, or short (whether signed or unsigned), then it is promoted to (signed) int if int can hold all values of the source type; otherwise it is promoted to unsigned int; the promotion is designed to be lossless. Examples:

    • An implementation has 16-bit short and 24-bit int. Letting variables x and y have the type unsigned short, the expression x & y promotes both operands to signed int.

    • An implementation has 32-bit char and 32-bit int. Letting variables x and y have the type unsigned char, the expression x - y promotes both operands to unsigned int.

  • For binary operators, both operands are implicitly converted to the same common type before the arithmetic happens. Consider the list of conversion ranks in ascending order: int, long, long long. The rank of the common type is the highest rank among the two operand types. If both operands have the same signedness, then the common type has that signedness. Else if the operand with the unsigned type has greater or equal rank than the other operand type, then the common type is unsigned. Else if the signed operand type can represent all values of the other operand type, then the common type is signed. Otherwise the common type is unsigned. Examples:

    • (long) + (long) → (long).

    • (unsigned int) * (int) → (unsigned int).

    • (unsigned long) / (int) → (unsigned long).

    • If int is 32-bit and long is 64-bit: (unsigned int) % (long) → (long).

    • If int is 32-bit and long is 32-bit: (unsigned int) % (long) → (unsigned long).

Conversion rules simulator

Implementation-defined integer type bit widths:

Type Bit width
char
short
int
long
long long

Input variables:

  • typeof(x) =
  • typeof(y) =

Output expressions:

  • Unary promotion: typeof(+x) =
  • Unary promotion: typeof(+y) =
  • Binary operation: typeof(x * y) =

Undefined behavior

Signed overflow:

  • When arithmetic is performed on a signed integer type, overflow is undefined behavior. UB can produce correct, inconsistent, and/or incorrect behavior, now or anytime in the future.

  • When arithmetic is performed on an unsigned integer type (after promotions and conversions), any overflow is guaranteed to wrap around. For example, UINT_MAX + 1 == 0.

  • Performing arithmetic on unsigned fixed-width integer types can lead to subtle bugs. For example:

    • Let uint16_t = unsigned short and let int be 32-bit. Then uint16_t x=0xFFFF, y=0xFFFF, z=x*y; would have x and y promoted to int, and x * y would overflow int – hence undefined behavior.

    • Let uint32_t = unsigned char and let int be 33-bit. Then uint32_t x=0xFFFFFFFF, y=0xFFFFFFFF, z=x+y; would have x and y promoted to int, and x + y would overflow int – hence undefined behavior.

    • To force safe unsigned arithmetic, either add 0U or multiply by 1U as a no-op. For example: 0U + x + y or 1U * x * y. This ensures that the operands are promoted to at least int rank and still kept as unsigned.

Divide/remainder:

  • Division by zero and remainder with a zero divisor are undefined behavior.

  • Unsigned division/remainder has no other special cases.

  • Signed division can overflow – e.g. INT_MIN / -1.

  • Signed remainder with negative operands seems to be tricky; some parts are uniform while others are implementation-defined.

Bit shifts:

  • It is undefined behavior to bit shift (<< and >>) by an amount that is either negative, equal to the bit width, or greater than the bit width.

  • Left-shifting an unsigned operand (after promotion/conversion) is well-defined and behaves normally.

  • Left-shifting a signed operand type containing a non-negative value, such that a 1 bit would go into the sign bit, is undefined behavior.

  • Left-shifting a negative value is undefined behavior.

  • Right-shifting a non-negative value (in a signed or unsigned operand type) is well-defined and behaves normally.

  • Right-shifting a negative value is implementation-defined.

Loop counting

Type choice

Suppose we have an array where we want to process each element sequentially, and the array length is stored in the variable len of type T0. How should we declare our loop counter variable i of type T1?

  • The easiest rule is to use the same type as the length variable. For example:

    uint8_t len = (...);
    for (uint8_t i = 0; i < len; i++) { ... }
  • More generally, a counter variable of type T1 will work correctly if the range of T1 is a (non-strict) superset of the range of T0. For example if len has type uint16_t, then counting using signed long (at least 32 bits) works.

  • But more specifically, the loop counter merely needs to cover all possible actual lengths. For example if len of type int is guaranteed to have a value in the range [3, 50] (due to some application-specific logic), then it is okay to count the loop using a signed or unsigned char (where the range [0, 127] is guaranteed to be representable).

  • It’s a bad idea to use different signedness for the length variable versus the counter variable. The comparison would force an implicit conversion, and signed-unsigned conversions are hard to understand and come with subtle platform-dependent pitfalls. The following example should not be written in real code:

    size_t len = (...);  // Unsigned
    for (int i = 0; i < len; i++) { ... }

Counting down

For downward-counting loops, using a signed counter is more natural because we can write:

for (int i = len - 1; i >= 0; i--) {
    process(array[i]);
}

Whereas an unsigned counter would require code like:

for (unsigned int i = len; i > 0; i--) {
    process(array[i - 1]);
}

Note: The comparison i >= 0 makes sense if i is a signed type, but always yields true if i is unsigned. So if this expression ever appears in an unsigned context, then the code writer very likely made a mistake in the logic.

Misconceptions

All points in the following list are myths. Do not rely on these false beliefs if you want to write correct and portable code!

  • char is always 8 bits wide. int is always 32 bits wide.

  • sizeof(T) represents the number of 8-bit bytes (octets) needed to store a variable of type T. (This is false because if say char is 32 bits, then sizeof(T) measures in 32-bit words.)

  • We can use int everywhere in a program and ignore nuanced types like size_t, uint32_t, etc.

  • Signed overflow is guaranteed to be wrap around. (e.g. INT_MAX + 1 == INT_MIN.)

  • Character literals are equal to their ASCII code values, e.g. 'A' == 65. (This is false due to EBCDIC, etc.)

  • Converting a pointer to an int and back to a pointer will be lossless.

  • Converting {a pointer to one integer type} to {a pointer to another integer type} is safe. (e.g. int *p = (...); long *q = (long*)p;.) (See type punning and strict aliasing.)

  • Whenever all the operand(s) of an arithmetic operator (unary or binary) have unsigned types, the arithmetic operation will be performed in unsigned mode (thus never triggering undefined behavior) and the result type will be unsigned. For example: Suppose uint8_t x; uint8_t y; uint32_t z;, then x + y should have a type like uint8_t or unsigned int or something reasonable, and +z would still be uint32_t. (This is false because promotions favor signed types.)

My critique

  • In short, it’s a heavy mental burden to know and use all these rules all the time. Failing to adhere to the rules results in a risk of writing unportable or incorrect code, and such errors might surface immediately or lie latent for days or years.

  • The difficulties begin with the implementation-defined bit widths for basic integer types. For example, int might be 16 bits wide, 32, 64, or some other number. One always needs to choose a type with enough range for the application. But sometimes, having too much range (e.g. unusual 128-bit int) could cause headaches or introduce vulnerabilities too. To make matters worse, standard library types like size_t bear no relationship with other types like unsigned int or uint32_t; the standard allows them to be wider or narrower.

  • The conversion rules are completely insane. Making matters worse, implicit conversions are allowed almost everywhere, which makes human auditing much harder. Unsigned types are straightforward, but signed types have so many possible legal implementations (e.g. ones’ complement, raising exceptions). Types with a lower rank than int are automatically promoted, causing unintuitive behavior with ranges and overflow. When operands have different signedness and ranks, they get converted to a common type in a way that depends on the implementation-defined bit widths. For example, performing arithmetic on two operands, where at least one operand has an unsigned type, will convert both of them to either a signed or unsigned type depending on the implementation.

  • Arithmetic operations abound with undefined behaviors: signed overflow in add/sub/mul/div, division by zero, bit shifts. It’s easy to accidentally trigger these UB conditions, not easy to trigger them deliberately or detect them at run time, and hard to identify their root causes. It takes conscious awareness and effort to design and implement arithmetic code that avoids overflow/UB. It’s difficult to audit and fix existing code that wasn’t written with overflow/UB safety in mind.

  • Having signed and unsigned variants of every integer type essentially doubles the number of options to choose from. This adds to the mental burden, yet has little payoff because signed types can do almost everything that unsigned ones can.

  • No other mainstream language has as many rules and pitfalls regarding integer types as C and C++. For example:

    • Java integers behave the same in every environment. The language defines exactly 5 integer types (unlike C/C++ having at least 10), the types have fixed bit widths, almost all of these types are signed (except for char), signed numbers must be in two’s complement, implicit conversions only allow lossless ones, and all arithmetic and conversions are well-defined (no UB). Java’s fixed-width integer types support fast computations and efficient array packing, compared to languages like Python which only have (variable-width) bigint.

      Java is built heavily around the 32-bit int type, especially for counting arrays. This means Java can’t run efficiently on underpowered 16-bit CPUs (common in embedded microcontrollers), and Java can’t directly address large arrays on 64-bit systems. By comparison, C/C++ makes it possible to write code that runs efficiently on 16-bit, 32-bit, and/or 64-bit CPUs, but requiring a great deal of care from the coder.

    • Python only has one integer type, which is a signed bigint. Compared to C/C++, this renders moot all discussions about bit widths, signedness, and conversions – one type rules all the code. But the price to pay includes slow execution and inconsistent memory usage.

    • JavaScript doesn’t even have an integer type, and instead expresses everything in terms of float64 math (corresponding to C/C++ double). Because of this, the bit width and numerical range are fixed, numbers are always signed, there are no conversions, and overflow is well-defined.

    • The assembly language for any particular machine architecture (x86, MIPS, etc.) defines a set of fixed-width integer types, arithmetic operations, and conversions – with few or no undefined behaviors.

More info