Project Nayuki


Subtle ways to lose data

After working with computers, files, and digital devices for decades, I noticed some non-obvious ways of losing data. Even though in theory digital data can be copied perfectly, in reality there are obstacles such as poor specifications, silent translation, incorrect error handling, and other phenomena that can alter data in an undesired way. Here is a non-exhaustive list of subtle data-losing scenarios that I can personally attest to.

Long file paths

In most circumstances, the maximum path length on Windows is 260 UTF-16 code units (including the null terminator). However, some Windows APIs allow programs to manipulate files with paths up to about 32000 characters in length. Surprisingly, Java can easily access files with long paths, even when most Windows applications, C#/.NET, and Python can’t. Older versions of Windows have very few programs that support long paths, but Windows 10 is slowly pushing the adoption of this feature. On Unix there is no consensus on the maximum path length, but a concrete example is that on Linux the maximum path length seems to be about 4000 bytes.

In Windows Explorer before Windows 10, it is easy to inadvertently move a file such that its path length exceeds 260 characters. Suppose we have two folders and one file, each with a 100-character-long name. We have the empty folder C:\dirA, the empty folder C:\dirB, and the file C:\file.txt. First, move file.txt into dirB, which is a legal operation. Now move dirB into dirA, which Windows Explorer allows because it only checks the immediate items being moved and doesn’t strictly check all its recursive subitems – namely, C:\dirA\dirB is about 200 characters long which is legal. But now the file is at the path C:\dirA\dirB\file.txt, which is over 300 characters long, thus exceeding the limit. When viewing dirB in Windows Explorer, it simply doesn’t list the file because its path is too long. If you use Windows Explorer to copy dirA to another place, then it will skip the file with the long path, thus resulting in silent data loss.

Disallowed file names

On Unix, no file name can contain the characters / (slash) or \0 (NUL). Also, no file/directory/item can be named “.” or “..” because they are special names that perform path navigation. These restrictions are not too onerous, but can create interoperability problems when transferring files to Windows, where file names are more restricted.

On Windows, no file name can contain the characters "*:/<>?\| (double quote, asterisk, colon, slash, less than, greater than, question mark, backslash, vertical bar). No file name can be a device like CON, AUX, NUL, LPT1, etc., not even with an extension (e.g. con.txt, aUx.abc.PNG). Windows Explorer forbids creating files with leading or trailing spaces, or a leading period – but other programs can create such file names. More details can be found in a Microsoft article.

File name escaping

When using a command line shell or writing a script, certain characters need to be escaped so that they are not interpreted in a special way – for example, space separates arguments, question mark and asterisk are for globbing, parentheses are for subcommands, et cetera. Moreover, the programs being invoked need to parse the command line argument strings to distinguish between flags (usually starting with hyphen) and argument values – so files whose names start with a hyphen are problematic. It is a well-known Unix prank to ask someone to delete a file literally named - (hyphen) or literally named * (star).

If a file has problematic characters (such as hyphen, space, star, backslash, dollar) and a user or script fails to escape names properly, then this can result in data corruption or loss. Erroneous programs could create files under the wrong name, fail to read or copy every file, fail to delete files that should be deleted, etc.

Large files and folders

All versions of FAT limit any single file to slightly less than 4 gibibytes (GiB). They also limit any directory to have less than 65536 items (and far less if files have long names). If you attempt to transfer large files or folders from NTFS to FAT, you will either get an error message or silent data loss.

Windows/Unix file name length mismatch

On Windows, a file name can be approximately 255 UTF-16 code units long. On Unix, a file name can be at most 255 bytes long. But for example, a string of 255 Chinese characters in UTF-16 would produce 765 UTF-8 bytes. This means that when reading an NTFS volume on Unix, it is impossible for the operating system to handle these over-long file names, so those files silently disappear from listings. This is a problem when inspecting an NTFS drive on Unix to gather data, and especially when trying to make an accurate and complete copy of all the files.

Legacy 8.3 file names

On Windows FAT and NTFS, each file and folder has an 8.3-compatible name to support legacy applications from Windows 3, DOS, etc. Those apps can access files using the short path, then read and write the file data, but cannot write new files with long names. Hence, legacy applications cannot faithfully reproduce a folder that contains files with long names. For example, the file at “C:\Program Files\Acme\docuMENT.html” could have an 8.3 path of “C:\PROGRA~1\ACME\DOCUMENT.HTM”.

Truncated download

When transferring a file over a network, premature termination could result in a number of behaviors: A truncated file is saved and the user must manually resume the transfer, the partial file is deleted, or the transfer program will automatically try to resume the download. For example, Windows Explorer deletes partial files when copying between local drives or SMB servers. For example, FTP and old HTTP (without the Content-Length header) data transfers can result in silently truncated files which can only be detected by the user.

Timestamp semantics

Just because a file timestamp field exists doesn’t mean it behaves the way you want or expect it to. Basic properties of each field:

  • Modification time: This denotes the most recent time a file was modified. For a directory, it denotes when any of its direct subitems (file/subdirectory) was added, renamed, or deleted – but it does not denote when any file content or subfolder was updated.

  • Creation time (Windows only): This denotes when a file or folder was created. When a file/folder is copied to another place, this timestamp is reset to the current time by default, but the modification time is copied. Hence it is entirely possible to have a file whose creation time is after its modification time. The creation timestamp is subject to the tunneling feature, where if a file is deleted and a new file is created with the same name within a few seconds, it will adopt the creation timestamp of the old file instead of resetting to the current time.

  • Access time: In principle, this timestamp should be updated any time a file is opened for reading or writing. This timestamp can be useful for deciding which infrequently accessed files can be compressed, moved to slower media, deleted from cache, etc. But in practice, a system might be configured to not update access times to reduce the I/O workload. Also, this timestamp cannot be updated if the file system is mounted as read-only (in software) or is actually read-only (like a CD-ROM).

File timestamps are usually preserved in archive formats (ZIP, tar, etc.) and are usually restored when unzipping, but the devil is in the details; each format and each tool has its own idiosyncrasies. File timestamps are conveyed by network file system protocols like NFS and SMB, by Unix tools like scp and rsync, but generally not conveyed by popular protocols like HTTP and FTP.

File timestamps interact poorly with version control systems. For example, Git only stores a timestamp per commit, which applies to an entire snapshot of a tree of directories and files; it does not have per-file timestamps. When retrieving an old version of a file, Git writes out the file contents but sets the modification timestamp to the current time, which seems counterintuitive. But it makes sense in conjunction with a build system like Make, which compares timestamps of source files versus target files to see if anything needs to be rebuilt.

In summary, even though file timestamp fields have simple-sounding goals, their detailed semantics, exceptional behavior, and lossiness mean that their data are not very reliable.

Timestamp granularity

A file or directory on a file system has various properties such as the last modification timestamp. Every timestamp has an inherent resolution/precision/granularity such as 1 second or 1 microsecond.

  • NTFS: 100 nanoseconds for creation, modification, access

  • FAT: 10 milliseconds for creation, 2 seconds for modification, 1 day for access

  • ext2, ext3: 1 second for ctime, mtime, atime

  • ext4, Btrfs: 1 nanosecond for ctime, mtime, atime

When moving/copying files from one file system to another, essentially every piece of software will silently convert timestamps, even if it involves a loss of resolution. For example, a file on NTFS with a timestamp of 01:23:45.6789012 being moved to a FAT32 volume would have its timestamp rounded to 01:23:46. I am not sure about the timestamp granularity in archive formats like ZIP, RAR, 7-Zip, tar, etc., but there are surely more opportunities for mismatches.

Time zone of timestamps

For files stored on a FAT file system, the timestamps are stored in terms of the local time of the machine that wrote the files. For files stored on NTFS and Unix file systems, timestamps are stored in UTC, independent of the machine’s time zone. When manipulating files under any timestamp scheme, there is no problem if the local time zone of the machine stays constant when creating, updating, and transferring files.

But when the machine’s time zone changes, data loss can occur. If the location experiences daylight saving time (DST), then each year when the clock springs forward there is an invalid hour in local time which does not exist, and when the clock falls back there is a duplicated hour where the time zone is ambiguous. Additionally, if the file timestamp was written in one time zone but the file is transferred from FAT to NTFS on a machine with a different time zone, then the translation from local time to UTC would be incorrect. On Windows, local timestamps are interpreted in the current time zone without regard to DST. So if a file was created in UTC−4, then half a year passes and DST changes the time zone to UTC−5, and the file gets transferred from FAT to NTFS, then the old file timestamp is interpreted as UTC−5, not UTC−4.

Generally speaking for any information system (not limited to file systems), storing timestamps in local time easily results in data loss; it causes headaches if the goal is to achieve long-term data accuracy. Every serious system should be designed to work with UTC timestamps, with optional functions to display them in local time.

File creation timestamp

Windows has a notion of creation time for files, but Unix does not. Copying a file from say, NTFS to ext3 will incur the loss of creation time metadata. Unix has a concept of ctime which reflects the last time that a file’s inode metadata was changed, but I find it nearly useless compared to the concept of file creation time.

Application configuration

Almost every computer application stores some state that carries over to future invocations of the app. The easiest way to test this idea is that if you take your favorite app and install it on a new PC/phone/device, how does its behavior differ from the main copy of the app that you are accustomed to using, and how many configuration knobs do you need to turn?

Examples: Windows Explorer lets you configure how the icons of each folder are sorted and displayed. A music player program has a user-created playlist that persists even after you restart the program. A word processor has a dictionary of custom words that you added. A mail program has a local cache of email messages downloaded from a server. A photo editor maintains a library/catalog of all the image files that you imported. A web browser stores your bookmarks and browsing history.

App config can be stored in a variety of places, and usually aren’t documented clearly to the end user. On Windows, an app could store its configuration data somewhere in the current user’s profile folder directly, in the AppData folder of the user’s profile, in the Windows Registry, at an online server, in the program directory (bad practice), or in the Windows system directory (very bad practice). On Unix, a program could store its configuration in the user’s home directory, in /var, in /etc, or possibly other places.

Alternate data streams / forks

A file has a primary stream of bytes, which is addressed by the main path used to access the file. But it can have other named data streams on the side. All these streams move together when a file is renamed, moved, or copied. But the mechanisms of alternate data streams / resource forks are specific to each operating system (Windows vs. Mac vs. Unix) and file system, and generally transfer poorly between different systems. For example, a file’s non-primary data streams are not carried by protocols such as HTTP or FTP, nor by old file systems like FAT (unlike NTFS).

File permissions

On Windows NTFS, every file and directory can have an access control list with many users and many security attributes. On Unix, every file/dir has either a dozen permission bits and a user and a group, or a full ACL like Windows. Many tools for transferring or archiving files/folders discard permission information. But even if the file security information is retained, it is almost certainly meaningless when it is blindly reproduced on a different computer system.

File sparseness

Most file systems support the concept of a file where most of the data is zero but some parts are filled in, and only those parts are stored on disk. For example, it is possible to create a logical 100 GB file where a couple of 1 MB blocks have real data and the rest is zero, and the file system will only consume a few megabytes of actual disk space to represent the file. If a volume has many sparse files, the total logical size could far exceed the real size of the volume. It can be assumed that backup tools and file transfer tools do not archive or restore sparse files accurately, and end up storing the entire logical file contents literally and consuming massive amounts of space.

Line separators in text files

There are at least three conventions for line separators or terminators in text files: LF (Unix, Linux, Mac OS X), CR+LF (Windows, most Internet protocols), CR (Mac OS 9 and below). Mac OS 9 and CR are just a historical note and there are essentially no such text files in the wild today. LF and CR+LF are both very popular, so care is needed to consciously choose a convention. Haphazard editing of a file can result in an unpleasant mixture of LF and CR+LF line separator sequences. Careless editing of a project can leave some files with LF and some files with CR+LF. I advocate for LF instead of CR+LF because it is easier to parse, because CR+LF confers no additional meaning, and because Unix diff tools work natively with LF and could show noisy control codes if CR characters are present.

File hard links

On a file system, it is possible to have multiple file paths pointing to the same piece of mutable file data; each path is called a hard link. But storing these hard links in archives or transferring them to another file system can be difficult. If an immutable file with multiple hard links is transferred naively, then this results in a waste of storage space. But if a mutable file with multiple hard links is transferred without preserving links, then this subtly changes the semantics because now each copy of the file can be changed independently and the changes cannot be seen at the other file paths.

ASCII mode file transfers

Some protocols like FTP can transfer a file in ASCII mode or binary mode. It never hurts to transfer a file in binary mode because all bytes are preserved literally, but for text files it is sometimes desirable to convert the line separators after the file transfer. Transferring a file in text mode will convert the line separators to the native sequence on the receiving system, but this action noticeably corrupts binary files.

Git branches

When making a commit in the Git version control system, the branch name is not recorded in the commit. As branch pointers move and as branches merge, there is no preserved history of which commits took place on which branch. This behavior probably differs from a user’s mental model of how a branch should work.

Well-known ways to lose data

These phenomena are less interesting, but they are listed for the sake of completeness:

Recommendations

I have some high-level tips to offer on the topic of avoiding subtle data loss. These points are not specific to any software or brand, but are general principles that should withstand the test of time.

Avoid non-file storage

Computer-based files are simple to understand and essentially universal. Avoid storing precious data in less universal formats such as CD audio, digital video tape, etc. Also avoid storing very important data in format-specific places – such as metadata fields (title, author, date, etc.) that can be supplied when burning a data CD.

Use simple semantics for files

At the most basic level, a file is an unnamed finite sequence of bytes. Everything beyond that is incrementally adding complexity: file names, directories, paths, attributes (such as author and title), timestamps, permissions (such as read-only), compression, encryption, hard links, symlinks, etc.

Define a clear data model

The shorter and clearer a data model is, the easier it is to understand and reason about. A data model that is designed to reduce data loss will define what each field means, what values are and aren’t allowed, how programs should interpret data, and specify every edge case.

As an anti-example, the Windows model of files is very complex; many people do not know about behaviors such as: Reserved file names like NUL and CON, file name length limits, timestamp granularity, time zones, alternate data streams, etc. Storing or manipulating data that the user is not aware of is a recipe for unintended data loss if the user eventually understands why that piece of data is important and useful.

Use cryptographic hashes

Putting files and folders into a cryptographic hash tree is a proven way to uphold strong file integrity. If any file is added, modified, or deleted, then the root hash is essentially guaranteed to change and be noticeable. This sets a very high bar for integrity because every data byte that goes into the hash computation must be well-defined and visible, and it is impossible to change the data silently. The best and most popular example for this model is the Git version control system. It is not hard to write code to compute and verify the hash of every Git blob, tree, and commit. Hence it is possible to verify that the actual stored data exactly matches what the hash claims to cover.