Project Nayuki


What are binary and text files?

Introduction

On a computer, every file is a long string of ones and zeros. Specifically, a file is a finite-length sequence of bytes, where each byte is an integer between 0 and 255 inclusive (represented in binary as 00000000 to 11111111). Files can be broadly classified as either binary or text. These categories have different characteristics and need different tools to work with such files. Knowing the differences between binary and text files can save you time and mistakes when reading or writing data.

Here is the primary difference: Binary files have no inherent constraints (can be any sequence of bytes), and must be opened in an appropriate program that knows the specific file format (such as Media Player, Photoshop, Office, etc.). Text files must represent reasonable text (explained later), and can be edited in any text editor program.

Remember that all files, whether binary or text, are composed of bytes. The difference between binary and text files is in how these bytes are interpreted. Every text file is indeed a binary file, but this interpretation gives us no useful operations to work with. The reverse is not true, and treating a binary file as a text file can lead to data corruption. As a method of last resort, a hex editor can always be used to view and edit the raw bytes in any file.

File extensions

We can usually tell if a file is binary or text based on its file extension. This is because by convention the extension reflects the file format, and it is ultimately the file format that dictates whether the file data is binary or text.

Common extensions that are binary file formats:

Common extensions that are text file formats:

Binary file characteristics

Binary file in application (good)

Binary file in hex editor (okay)

Binary file in text editor (bad)

For most software that people use in their daily lives, the software consumes and produces binary files. Examples of such software include Microsoft Office, Adobe Photoshop, and various audio/video/media players. A typical computer user works with mostly binary files and very few text files.

A binary file always needs a matching software to read or write it. For example, an MP3 file can be produced by a sound recorder or audio editor, and it can be played in a music player or audio editor. But an MP3 file cannot be played in an image viewer or a database software.

Some binary formats are popular enough that a wide variety of programs can produce or consume it. Image formats like JPEG are the best example – not only can they be used in image viewers and editors, they can be viewed in web browsers, audio players (for album art), and document software (such as adding a picture into a Word doc). But other binary formats, especially for niche proprietary software, might have only one program in the world that can read and write it. For example, a high-end video editing software might let you save your project to a file, but this software is the only one that can understand its own file format; the binary file will never be useful anywhere else.

If you use a text editor to open a binary file, you will see copious amounts of garbage, seemingly random accented and Asian characters, and long lines overflowing with text – this exercise is safe but pointless. However, editing or saving a binary file in a text editor will corrupt the file, so never do this. The reason corruption happens is because applying a text mode interpretation will change certain byte sequences – such as discarding NUL bytes, converting newlines, discarding sequences that are invalid under a certain character encoding, etc. – which means that opening and saving a binary file will almost surely produce a file with different bytes.

Text file characteristics

Text file in text editor (good)

Text file in hex editor (inconvenient)

By convention, the data in every text file obeys a number of rules:

  • The text looks readable to a human or at least moderately sane. Even if it contains a heavy proportion of punctuation symbols (like HTML, RTF, and other markup formats), there is some visible structure and it’s not seemingly random garbage.

  • The data format is usually line-oriented. Each line could be a separate command, or a list of values could put each item on a different line, etc. The maximum number of characters in each line is usually a reasonable value like 100, not like 1000.

  • Non-printable ASCII characters are discouraged or disallowed. Examples include the NUL byte (0x00), DEL byte (0x7F), and most of the range 0x01 to 0x1F (except tab, carriage return, newline, etc.). Some text editors silently convert or discard these bytes, which is why binary files should never be edited in a text editor.

  • The reading of newline sequences is usually universal – namely, CR (classic Mac OS), LF (Unix), or CR+LF (Windows) all mean the same thing, which is to end the current line and start the next one. The writing of newline sequences usually normalizes to the preferred one on the current platform, regardless of which variant was read. (For example, a text file “Hello CR LF world CR Lorem LF ipsum” read in a Unix text editor would likely be written out as “Hello LF world LF Lorem LF ipsum”.)

  • There is some character encoding that governs how extended-ASCII bytes are handled. Byte values from 0x80 to 0xFF are not covered by the universally accepted ASCII standard, and the interpretation of these bytes depends on the choice of character encoding – such as UTF-8, ISO-8859-1, Shift JIS. Thus the interpretation of a text file depends on the character encoding used (unless the file format is known to be pure-ASCII), whereas a binary file is just a sequence of plain bytes with no inherent notion of character encoding.

  • In most text file formats, some flexibility is given to whitespace characters. For example, using one space or two space might not change the meaning of a command. And in C-like programming languages, one whitespace character has the same meaning as any positive number of whitespace or newline characters (except within strings).

Observations regarding the general computing environment around text files:

  • Any text file can be viewed or edited in any text editor. The creator of a text-based file format doesn’t need to know or care about what editor program a future user will use. This contrasts with binary formats, where each format is generally coupled to a specific program that can handle it.

  • Software developers work with many kinds of text files all the time – program source code, setup scripts, technical documentation, program-generated output and logs, configuration files, you name it. Much of the world of computer programming revolves around text files, since they are easy to create, edit, and consume. This contrasts with how an average computer user mainly works with binary files.

  • Text files are usually viewed and edited in a monospaced font. The historical reason is that the earliest computer terminals could only display monospaced text. A modern reason is that monospaced allows an author to visually align characters in different lines for ASCII art, tabular data, etc.

  • On the Unix platform, text files are so ubiquitous that they often have no file name extension at all. For example, “README” and “config” are the full names of popular files that exist in many places.

  • Beyond text editors, there exist many standard Unix tools to generate, manipulate, or view textual data: ls, more, sort, grep, awk, etc.

  • Version control software (Git, Mercurial, SVN, Perforce, etc.) work best with text files. For binary files they can only tell you whether a file has changed or not. For text files, they can show line-by-line differences, perform automatic merges, and do many other useful things.

Programming considerations

Every practical programming language provides separate facilities for working with binary versus text files. Generally speaking, if you read a binary file in text mode you will get unhelpful data that looks like garbage, if you write a binary file in text mode it will probably be corrupt, if you read a text file in binary mode you can’t perform any useful text operations on the bytes, and if you write a text file in binary mode you will need to manually convert characters to bytes. So it pays to use the right tools for the right job. To illustrate with concrete examples, let’s look briefly at how binary vs. text files work in three popular programming languages.

  • In C, you specify binary or text mode when opening a file stream with fopen(). The only difference this makes is that in text mode, reading a newline sequence always converts from universal newlines on disk to '\n' in memory, and writing '\n' always produces the platform's preferred newline sequence for the file on disk (such as "\r\n"). C only supports 8-bit characters, so text mode does not actually help with interpreting file bytes as Unicode code points.

  • In Python, you specify binary or text mode when opening a file stream with open(). In Python 3, if a file is opened in binary mode then read() and write() work with byte sequences of the bytes type. Otherwise a file as opened in text mode, then read() and write() apply a character encoding to convert the underlying bytes to/from a Unicode string of the str type. (The behavior is somewhat different in Python 2.)

  • In Java, InputStream and OutputStream work with bytes, whereas Reader and Writer work with Unicode characters (as UTF-16). You can open a FileInputStream (which works with bytes), and create an InputStreamReader on top of it with a specific character encoding (such as UTF-8). Using this Reader, you can get the text characters in the file. An analogous process applies to writing text to a file.