Project Nayuki


My code style

Introduction

I’ve been coding continually since around year 2003, which has given me a lot of time to write many programs, use different languages, and reflect on what code style pleases me the most. My code style is a conscious choice, controlled down to every single character including whitespace. I optimize for clarity, consistency, conciseness, in that order.

Many developers seem to blindly accept whatever behaviors their text editor or IDE produces. Among the tools that I use (e.g. Eclipse for Java), I hit countless situations where the editor generates text that is objectively styled wrong (e.g. mis-indented), as well as situations where the code contradicts my own rules (e.g. wrong spacing, superfluous code). Hence, I always watch the output of my tools and make any needed modifications to the text to fit my style; I don’t just accept the defaults as is. (And of course, I configure my tools to generate a style as close as possible to my desired one, to reduce my manual effort.)

In case you edit my code and contribute it back to me, or you want to add a new piece of code to one of my projects, you must follow my style rules. Contributions that break my rules will be viewed as sloppy, lacking attention to detail, and disrespectful to the author’s existing collection of work. Giving me mis-styled code will force me to spend time and effort to clean it up to match my quality standards. A good starting point is to examine the surrounding code for style examples and then to imitate them.

Character encoding

I firmly support UTF-8 as the one true character encoding for all text file formats and network protocols. All other encodings are bad choices for various reasons:

  • ASCII only defines 128 characters and is barely sufficient for just English. Besides, UTF-8 is a strict superset of ASCII, so all ASCII text is automatically valid UTF-8.

  • Regional character encodings like Latin-1 (ISO-8859-1), GB 2312, Shift JIS, etc. were popular before year 2010, but have numerous problems. Each encoding only supports a small set of scripts/languages rather than the whole Unicode (with the exception of GB 18030). This necessitates the burdens of explicitly denoting the encoding of every document and frequently switching encodings when loading different documents. If a document contains characters (e.g. Korean + Arabic) that no single regional encoding supports, then you must use Unicode anyway. And finally, the chance of mishandling encodings (e.g. guessing unmarked documents, or just plain buggy software) proves to be high in past experience, resulting in unwanted mojibake.

  • UTF-16 is also plagued with issues. It doesn’t support constant-time seeking to a character by index (a disadvantage shared by UTF-8) because characters beyond the basic multilingual plane (BMP) need two UTF-16 code units. To make matters worse, programmers might forget to test UTF-16-related logic on non-BMP characters, resulting in latent software bugs; whereas the variable-length nature of UTF-8 is manifestly apparent. UTF-16 generates a lot of NUL bytes, which breaks the vast majority of C/C++ programs because they use null-terminated strings. Also, while sorting UTF-8 strings lexicographically is equivalent to sorting Unicode strings lexicographically, this property does not hold for UTF-16. UTF-16 is not self-synchronizing, so inserting or deleting a byte in a stream will garble all the subsequent text. Another problem is that UTF-16 comes in two versions, big-endian and little-endian – which then leads to using a BOM or autodetection or explicit charset tags. Unfortunately, we can’t completely avoid UTF-16 because it is deeply embedded in the Java programming language, JavaScript language, and Windows APIs / NTFS / system binary files / high-profile application software.

  • UTF-32 / UCS-4 can be useful internally when a program manipulates text in memory, but it has essentially zero usage as a serialization format. It shares some disadvantages that UTF-16 has.

More info: Wikipedia: Character encoding

Byte order mark

I never use the BOM with UTF-8 text. It is unnecessary when we can assume that all incoming and outgoing text is encoded in UTF-8.

The BOM frustrates standard Unix tools like grep, diff, sort, etc., because it requires special treatment for the first line versus subsequent lines, for both the input and the output of the program. For example, sorting the lines of a text file containing a BOM might create a text file with no BOM at the beginning but have a BOM somewhere in the middle.

The UTF-8 BOM is promoted by default in the .NET standard library and by older versions of Windows Notepad. The BOM is important for distinguishing UTF-16BE and UTF-16LE, but we shouldn’t be serializing UTF-16 in the first place.

More info: Wikipedia: Byte order mark

Newline sequence

I prefer Unix newlines in all text files everywhere, even when working on Windows. A reminder:

  • Windows uses the two-character sequence CR (U+0D) + LF (U+0A).

  • Any operating system from the Unix world (which includes Linux, BSD, and Mac OS X / macOS) uses the one-character sequence LF (U+0A).

  • Classic Mac OS (version 9 and below) uses the one-character sequence CR (U+0D).

When I write software that parses text character by character, it’s easy to process text that exclusively uses LF as newlines. It’s harder to write logic that handles CR+LF, because the code needs to remember the previous character, and also signal erroneous sequences (like CR+EOF, CR+other). It’s also needless painful to write a universal newline parser (one that behaves the same on any of CR, LF, or CR+LF).

Unfortunately, the CR+LF sequence is mandated in a bunch of network protocols like HTTP and IRC. So it’s nearly impossible to avoid text with CR+LF newlines, even if all the actual files coming into the system are in LF format.

More info: Wikipedia: Newline

Trailing newline

I require every non-empty text file to have its last character be a newline (U+0A).

In Windows programs and some Unix and Mac programs, a newline character is treated as something that goes between lines of text (i.e. separating lines), like this:

Alpha
<NL>
Bravo
<NL>
Charlie

Whereas traditional Unix programs interpret a newline character as something that goes at the end of every line of text (i.e. terminating lines), like this:

Alpha<NL>
Bravo<NL>
Charlie<NL>

Treating the newline character as a line terminator simplifies programs. When writing a program that generates lines of text, simply use println() for each line. Whereas with the other convention of treating newline as a separator, we would need to suppress the newline character in the final line.

When a non-blank text file doesn’t end in a newline, there are several consequences:

  • The Unix “diff” program (which also gets used in Git) shows the annoying text “\ No new line at end of file”.

  • When concatenating multiple files together (such as with the Unix program “cat”), the first line of a file will be appended directly onto the last line of another file.

  • If the text file is C source code, then compiling it technically results in undefined behavior.

Various notes:

Blank lines

I am deliberate about how many blank lines I use in my code. Generally speaking, I put:

  • 1 line between the header comment and the package/imports/includes.

  • 1 line between the Java package and imports.

  • 2 lines after the imports/includes.

  • 1 line at the start and end of each class. This might be abbreviated to 0 lines for some nested classes.

  • 1 line between groups of related statements within a function. What counts as a group is discretionary, but is usually around 10 lines long.

  • 2 lines between functions. For certain short or repetitive functions, I may shorten the gap to 1 or 0 blank lines.

  • 3 lines between classes. For inner classes, I may shorten the gap to 2 blank lines.

For example:

/* 
 * This is a header comment.
 */
// one
package io.nayuki.example;
// one
import java.lang.Integer;
import java.lang.System;
// two
// two
public class Main {
	// one
	public static void main(String[] args) {
		System.out.println(theAnswer());
		System.out.println(new Helper(378).toString());
	}
	// two
	// two
	private static int theAnswer() {
		return 42;
	}
	// one
}
// three
// three
// three
class Helper {
	// one
	public int value;
	// two
	// two
	public Helper(int val) {
		value = val;
	}
	// two
	// two
	public String toString() {
		return Integer.toString(value);
	}
	// one
}

Indentation

I always indent code with tabs, never spaces. Tabs are superior for the following reasons:

  • The user can change their text editor to show a tab as any number of spaces, whether it’s 2, 4, 7, etc.

  • If for some reason you need to permanently convert tabbed code to spaces, it’s easy to search-and-replace each tab with your preferred number of spaces. Meanwhile, doing the reverse conversion (spaces to tabs) is harder and more error-prone.

  • With tabs, it’s impossible to have a partial indentation level. Whereas with spaces – let’s say you use 4 spaces per indent – it’s common to encounter a sloppy line of code that accidentally has 3 spaces for indentation.

  • There exist basic/dumb text editors (e.g. Windows Notepad, various old Unix programs) that offer no support for indentation. Typing a tab character inserts a literal tab (not a particular number of spaces per indent), and typing a backspace deletes the previous single character (not a group of spaces). Editing tab-indented code is much easier than space-indented code in these text editors.

  • Tabs take up less storage space than multiple spaces. This is just icing on the cake though; there are other situations where I choose to waste some space but gain consistency/readability.

I indent blank lines so that all lines are indented according to their nesting level. When moving the caret (text cursor) up and down lines, the practice of indenting blank lines prevents the caret from unnecessarily jumping left and right. Also, if I put code in an existing blank line, I don’t need to first indent it. I dislike text editors that delete all the whitespace from blank lines upon saving a file. (That behavior arises from delete trailing whitespace, which I mostly agree with except for a few cases.) For example:

class Good {
	int foo;
	
	void bar() {
		alpha();
		
		beta();
	}
}
class Bad {
	int foo;

	void bar() {
		alpha();

		beta();
	}
}

Various notes:

  • I firmly reject the practice of mixing tabs and spaces for indentation. For example, older versions of the Java standard library source code uses 4 spaces per indent, but also substitutes groups of 8 spaces with 1 tab (the classic definition of a tab). Some C/C++ codebases may exhibit this style too.

  • However, I consider it correct to use tabs for indentation followed by spaces for alignment. For example:

    void myFunc() {
    ―――→return longExpression0
    ―――→·····+ longExpression1
    ―――→·····+ longExpression2;
    }
  • When a string literal needs to contain a tab character (e.g. when outputting human-friendly code), I prefer to use the escape sequence \t rather than a real tab character. For example:

    String good = "\t\t<p>Lorem ipsum</p>";
    String bad = "		<p>Lorem ipsum</p>";
  • If I’m contributing to someone else’s codebase, I will use their indentation rules. For example, the majority of Python code is written using 4 spaces per indent (endorsed by PEP 8), and much Ruby code uses 2 spaces. But if I’m writing my own standalone program/library, I will ignore the language’s recommended/popular indentation style so that I always use tabs everywhere.

More info:

Braced blocks

I follow the “one true brace style” (1TBS), which happens to be the standard in Java. I also apply this style to all other brace-based programming languages, like JavaScript/TypeScript, C, C++, C#, Rust, CSS, and POV-Ray. 1TBS makes the most sense to me because the opening brace is implied by the first token on the line (e.g. class, for, if) so it doesn’t deserve its own line, the trailing brace needs its own line because otherwise it would look weird after a semicolon and will generate superfluous diffs when the last statement is edited, and the indentation looks the most logical.

In languages (Java, C, etc., but not Rust) where braces are optional for a single statement, I almost always omit them for single-line statements to reduce visual noise. For example:

if (cond)  // Prefer
	xyz = abc;

if (cond) {  // Avoid
	xyz = abc;
}

for (int x : array) {  // Prefer
	if (cond)
		xyz = abc;
	else
		fed = wvu;
}

for (int x : array)  // Avoid
	if (cond)
		xyz = abc;
	else
		fed = wvu;

More info: Wikipedia: Indentation style - Brace placement in compound statements

Comment formatting

In almost all cases, I put a single space after the comment marker. Examples:

Zulu  //Bad
Yank  //·Good
Xray  //··Bad
Whis  //···Bad

An inline comment on its own line will be set at the indentation level (without preceding spaces):

// This is a no-op function
function f() {}

An inline comment on a line of code will have two spaces before the comment:

foo();··// Java
bar()··# Python
qux··-- Haskell

A block comment will have every line indented properly, aligned with spaces, and having a star; remember to keep trailing whitespace for consistency:

interface Sample {
	/* 
	 * Above: For symmetry of top and bottom,
	 * don't put text in starter line
	 * 
	 * Above: Keep trailing whitespace
	 */
}

Don’t start block comments with more than one star (e.g. /****) – unless writing Javadoc, which starts with exactly two stars (/**).

Quoted strings

Many programming languages require double quotation marks for strings: Java, C, C++, C#, Rust, and others. But also many languages allow both single quotes and double quotes for strings: JavaScript, Python, HTML, XML, CSS, etc. For the sake of consistency and reducing cognitive load, I use double quotes wherever possible.

Double quotes are more visually distinctive, but require an extra keystroke (Shift) to type. Single quotes are easier to type, and are popular in the JavaScript and Python communities. But I still prefer double quotes to aid the reader rather than the writer.

When two layers of quotes are needed, I prefer to use double and then single: <button onclick="alert('Hello')"/>. Definitely avoid double and then escaped-double: <button onclick="alert(&quot;Hello&quot;)"/>. Another solution is to use triple-double and then double (Python): htmlcode = """<p class="lorem">ipsum</p>""".

Shell scripting languages like Bash treat quotes differently. Single-quoted strings are literals (which is usually what I want), whereas double-quoted strings are still subjected to variable substitution (which I use less often).

Scoping

I practice the principle of least privilege when it comes to exposing variables/fields/methods/classes. This means:

  • Inside a Java/JavaScript/C++/Python class, a field / method / nested class that doesn’t need a “current object instance” of its own type shall be declared static.

  • When access modifiers are available (e.g. Java, C++, TypeScript, Rust), use private whenever possible.

  • If a variable is not needed between loop iterations, then declare it inside the loop rather than outside.

  • If nested functions are supported (JavaScript, Python, etc.), then sometimes use it as an alternative to declaring everything at the top level.

Variable declarations

I declare each variable at the point of its first assignment. I reject the old K&R C or C89 practice of declaring all variables at the very top of every function. I fully condone declaring variables in for-loop initializers as well as within conditionals and loops. Examples:

int goodFib(int n) {
	int a = 0;
	int b = 1;
	for (int i = 0; i < n; i++) {
		int c = a + b;
		a = b;
		b = c;
	}
	return a;
}
int badFib(int n) {
	int a = 0;
	int b = 0;
	int c;
	int i;
	for (i = 0; i < n; i++) {
		c = a + b;
		a = b;
		b = c;
	}
}

Alignment

I occasionally align syntactic structures that repeat many times:

Mode.NUMERIC      = Mode(0x1, (10, 12, 14))
Mode.ALPHANUMERIC = Mode(0x2, ( 9, 11, 13))
Mode.BYTE         = Mode(0x4, ( 8, 16, 16))
Mode.KANJI        = Mode(0x8, ( 8, 10, 12))
Mode.ECI          = Mode(0x7, ( 0,  0,  0))

But otherwise, I don’t bother with aligning things that are not closely related, because it increases effort but adds no clarity:

void good(
	int a,
	float bcd,
	char *efg);
void bad(
	int    a
	float  bcd
	char  *efg);

Name casing

  • In Java, JavaScript, C, C++, I follow mostly camelCase rules: packagename.subpackage.ClassName.methodName(localVariable.fieldName + Class.CONSTANT_NAME)

  • In C#, I follow almost entirely UpperCamelCase rules as per convention (even though I much prefer the Java rules): NamespaceName.ClassName.MethodName(localVariable.FieldName + Class.ConstantName)

  • In Python, Rust, I follow mostly snake_case rules: modulename.submodule.Class.method_name(localvariable.field_name + Class.CONSTANT_NAME)

  • In CSS, I follow kebab case rules: #multi-word-id.class-name

  • In SQL, I follow snake_case, capitalize clause keywords and other important keywords, and decapitalize names and expressions:

    SELECT ifnull(max(id)+1,0), some_field
    FROM table_name JOIN other_table ON table_name.a=other_table.b
    WHERE x > 0 and y = 'thing'
    ORDER BY column_name ASC

Various notes:

  • When I name things in code based on real-life words, I completely ignore the casing of the original word. For example, I would name XMLHttpRequest (“XML HTTP request”) as class XmlHttpRequest, “scrape eBay” would be function scrapeEbay(), and “PowerPoint file” would be File powerpointFile = open(...).

  • More info: Wikipedia: Letter case - Special case styles

Naming

Out of necessity, this is probably the least consistent or rigid aspect of my style rules. This is simply because different programs work with different problems, ideas, and structures. Nevertheless, I strive to create order wherever possible:

  • I often use s to denote a string, x to denote an integer or real number, n to denote the size (unsigned integer) of some list/collection, etc.

  • Loop counter variables are usually i, and sub-loops use j and k. Occasionally, index or count is a more appropriate name.

  • In statically typed languages, I heavily abbreviate local variable names because I can rely on the type: QrCode qr; Hasher h;

  • In dynamically typed languages, I tend to write out long names more often: qrcode = QrCode(); worker = Worker();

  • I usually give shorter names to variables in tighter scopes. For example:

    class Demo {
    	Object object;
    	void demo() {
    		Object obj = getObject();
    		for (Object o : getList()) { ... }
    	}
    }
  • Java doesn’t have array slices, so it’s emulated frequently with this pattern of three variables: void foo(int[] arr, int off, int len).

  • The variable names used on my page Good Java idioms are indeed used in many pieces of my actual Java code.

List trailing comma

Whenever I write a list such that each item gets its own line, I put a comma after the last item as long as the language syntax allows it. This reduces diffs when editing the list, because every item ends with comma. For example:

// Java
int[] singleLine = {1, 2, 4, 8, 16};
String multiLine = {
	"kilo",
	"mega",
	"giga",
	"tera",
};

# Python
stuff = {
	"a": 11,
	"b": 23,
	"c": 58,
}

// JavaScript
acme = [
	0,
	1,
	2,
];

// JSON (disallowed)
[
	0,
	1,
	2
]

-- Haskell (disallowed)
widget = [
	9,
	8,
	7,
	6
]

I also use trailing commas in rare cases of calling a function (if the language allows it), such that the list of arguments is long, repetitive, and/or expected to change:

# Python
result = func(
	fibonacci(0),
	fibonacci(1),
	fibonacci(2),
	fibonacci(3),
)

// JavaScript
let answer = func(
	fibonacci(0),
	fibonacci(1),
	fibonacci(2),
	fibonacci(3),
);

Temporary code

This topic is more about the code that you don’t see from me, rather than my published content and its permanent history. Like all developers, I frequently add and edit some code temporarily to progressively build a feature, disable things, audit/debug behaviors, and test ideas that I’m unsure of. Note that I allow major style deviations (e.g. misindented lines) for code that I don’t intend to keep. The difference between myself and typical developers is that before committing the changes to version control, I very carefully remove hacky code and revise anything worth keeping.

I primarily use print statements for debugging – both for confirming the expected behavior of a correct program and for revealing problems in an incorrect program. But once I finish gathering insights and fixing the program (if applicable), I delete all the print statements. I’ve seen too many novices litter their programs with prints and rarely remove them, even long after the relevant program feature was developed or debugged. Pursuant to the Unix philosophy, I don’t like needlessly noisy programs that dump cryptic internal details that don’t help the end user. Also, I don’t bother commenting out print statements because I type quickly and accurately enough (even for long phrases like Java’s System.out.println()) that I’m willing to reproduce it on demand. Sporadically I choose to print some information (e.g. progress status for long-running programs) in interactive programs that are hopefully informative to the user.

I heavily avoid committing commented-out code, with the exception of occasionally showing usage examples in documentation text. Commented-out code reeks of an inexperienced programmer who is unconfident about their own design choices or their understanding of algorithms/APIs. Code that is no longer needed should be deleted and not commented out, because version control systems are the superior way to record and retrieve old text. Otherwise, commented-out code is syntactical noise that wastes human attention/comprehension, it usually has no documentation to explain why the code is disabled (e.g. wrong, slow, out-of-date, experimental), and it can go out of date as its surroundings change.

Language-specific rules

Although many parts of this page discuss or exemplify somewhat language-specific style rules, this section has a bunch of miscellaneous points that are narrower and don’t generalize well:

  • In C, C++, C#, Java, JavaScript switch statements, I indent the case and further indent its following code:

    switch (value) {
    	case 0:
    	case 1:
    		doFoo();
    		break;
    	default:
    		doBar();
    		break;
    }
  • In Java, C++, C#, I almost never explicitly qualify instance method calls or field accesses with the this keyword, because it’s optional and I feel it doesn’t add clarity. By contrast, using self is mandatory in Python and Rust, and using this is mandatory in JavaScript.

  • In C++, I write out public: or private: in front of every class member, which happens to be mandatory in Java and C#. This practice ensures that the access modifier is readily visible at each member, without needing to scan upward. Also, this reduces diffs when adding, reordering, or removing members. Examples:

    class Good {
    	private: int x;
    	private: int y;
    	public: Good() {}
    	public: int sum();
    	private: static double helper(double z);
    };
    
    class Bad {
    private:
    	int x;
    	int y;
    public:
    	Bad() {}
    	int sum();
    private:
    	static double helper(double z);
    };
  • In C++, I mainly use native C++ language syntax and standard library features, rather than the ones inherited from C. This is discussed on my other page: Near-duplicate features of C++.

  • Some people advocate for writing JavaScript code without semicolons, which does look cleaner. But it’s a mistake in my opinion. JavaScript is a Java-inspired language, and using semicolons lets JavaScript code match Java, C, C++, C# code more closely. JavaScript’s formal syntax definition involves semicolons, so it is canonical to use them. Code that lacks semicolons is conceptually preprocessed by means of “automatic semicolon insertion”, which tries to repairs syntax errors at line boundaries by synthesizing virtual semicolons.

    Unfortunately, ASI sometimes doesn’t activate when you want:

    a = b + c
    (d + e).print()
    // ---- becomes ----
    a = b + c(d + e).print();
    // Note that c(...) is a function call

    And sometimes ASI is overzealous when you don’t want it:

    return
    {x: 0, y: 1};
    // ---- becomes ----
    return;  // Value of undefined
    {x: 0, y: 1};  // Unreachable statement which
    // creates an object and immediately discards it

Consistency

  • Overall, I treat my style rules as an abstract entity that can be applied to many programming languages. For example, I will write a + b instead of a+b in any language that supports infix operators and optional whitespace, whether it’s Java or Python or Haskell. And as mentioned earlier, I will use the same indentation rules and brace formatting rules in every language where these constructs exist.

  • Because every language is different (to various extents), I will necessarily have language-specific rules. For example in C++, I am forced to choose whether to put public:/private: in front of every member or only use one modifier for each related block of members.

  • I ignore rules that add no value to my work. Sure, many languages have community conventions that say each indent must be 2 or 4 or whatever number of spaces. But if I’m not contributing to an existing codebase, I’m free to use tabs for my own code.

  • Although I have a large body of published code spanning over a decade, I do my best to keep it all in a consistent style. For the most part, my style hasn’t changed much; I made a bunch of choices early on and still agree with them today. But when my preferences change, I do try to update all existing code to fit new conventions. This kind of ongoing code revision/maintenance has actually happened quite a few times because of new language features and APIs, but rarely due to my aesthetics changing.