Subtle ways to lose data

After working with computers, files, and digital devices for decades, I noticed some non-obvious ways of losing data. Even though in theory digital data can be copied perfectly, in reality there are obstacles such as poor specifications, silent translation, incorrect error handling, and other phenomena that can alter data in an undesired way. Here is a non-exhaustive list of subtle data-losing scenarios that I can personally attest to, along with explanations of the underlying details.

Long file paths

In most circumstances, the maximum path length on Windows is 260 UTF-16 code units (including the null terminator). However, some Windows APIs allow programs to manipulate files with paths up to about 32000 characters in length. Surprisingly, Java can easily access files with long paths, even when most Windows applications, C#/.NET, and Python can’t. Older versions of Windows have very few programs that support long paths, but Windows 10 is slowly pushing the adoption of this feature. On Unix there is no consensus on the maximum path length, but a concrete example is that on Linux the maximum path length seems to be about 4000 bytes.

In Windows Explorer before Windows 10, it is easy to inadvertently move a file such that its path length exceeds 260 characters. Suppose we have two folders and one file, each with a 100-character-long name. We have the empty folder C:\dirA, the empty folder C:\dirB, and the file C:\file.txt. First, move file.txt into dirB, which is a legal operation. Now move dirB into dirA, which Windows Explorer allows because it only checks the immediate items being moved and doesn’t strictly check all its recursive subitems – namely, C:\dirA\dirB is about 200 characters long which is legal. But now the file is at the path C:\dirA\dirB\file.txt, which is over 300 characters long, thus exceeding the limit. When viewing dirB in Windows Explorer, it simply doesn’t list the file because its path is too long. If you use Windows Explorer to copy dirA to another place, then it will skip the file with the long path, thus resulting in silent data loss.

On the subject of file paths, path strings on Windows are a veritable minefield of complex semantics, edge cases, and opportunities for misunderstandings: Project Zero: The Definitive Guide on Win32 to NT Path Conversion.

Disallowed file names

On Unix, no file name can contain the characters / (slash) or \0 (NUL). Also, no file/directory/item can be named “.” or “..” because they are special names that perform path navigation. These restrictions are not too onerous, but can create interoperability problems when transferring files to Windows, where file names are more restricted.

On Windows, no file name can contain the characters "*:/<>?\| (double quote, asterisk, colon, slash, less than, greater than, question mark, backslash, vertical bar). No file name can be a device like CON, AUX, NUL, LPT1, etc., not even with an extension (e.g. con.txt, aUx.abc.PNG). Windows Explorer forbids creating files with leading or trailing spaces or periods – but other programs can create such file names. More details can be found in these articles: Microsoft, Stack Overflow.

File name escaping

When using a command line shell or writing a script, certain characters need to be escaped so that they are not interpreted in a special way – for example, space separates arguments, question mark and asterisk are for globbing, parentheses are for subcommands, et cetera. Moreover, the programs being invoked need to parse the command line argument strings to distinguish between flags (usually starting with hyphen) and argument values – so files whose names start with a hyphen are problematic. It is a well-known Unix prank to ask someone to delete a file literally named - (hyphen) or literally named * (star).

If a file has problematic characters (such as hyphen, space, star, backslash, dollar) and a user or script fails to escape names properly, then this can result in data corruption or loss. Erroneous programs could create files under the wrong name, fail to read or copy every file, fail to delete files that should be deleted, etc.

Large files and folders

All versions of FAT limit any single file to slightly less than 4 gibibytes (GiB). They also limit any directory to have fewer than 65536 items (and far less if files have long names). If you attempt to transfer large files or folders from NTFS to FAT, you will either get an error message or silent data loss.

Windows/Unix file name length mismatch

On Windows, a file name can be approximately 255 UTF-16 code units long. On Unix, a file name can be at most 255 bytes long. But for example, a string of 255 Chinese characters in UTF-16 would produce 765 UTF-8 bytes. This means that when reading an NTFS volume on Unix, it is impossible for the operating system to handle these over-long file names, so those files silently disappear from listings. This is a problem when inspecting an NTFS drive on Unix to gather data, and especially when trying to make an accurate and complete copy of all the files.

Legacy 8.3 file names

On Windows FAT and NTFS, each file and folder has an 8.3-compatible name to support legacy applications from Windows 3, MS-DOS, etc. Those apps can access files using the short path, then read and write the file data, but cannot write new files with long names. Hence, legacy applications cannot faithfully reproduce a folder that contains files with long names. For example, the file at “C:\Program Files\Acme\docuMENT.html” could have an 8.3 path of “C:\PROGRA~1\ACME\DOCUMENT.HTM”.

Writable auto-mount

The default behavior on Windows and consumer-friendly Linux distributions is to automatically mount storage devices as writable. For example, upon plugging in a USB flash drive, the computer will mount it and present it as a writable folder. Even if the user doesn’t manually change or write any data, the act of mounting can trigger actions that write data to the drive, such as fsck/chkdsk (especially for FAT), journal recovery (NTFS, ext3, etc.), setting a “volume in use” bit, and potentially even defragmenting files in the background. These alterations are detrimental to forensic and data recovery work, where it is important to preserve the state of the drive at a certain point in time. With pristine data on a drive, it is possible to clone it and experiment with different data extraction and repair heuristics, instead of blindly letting the operating system run its preferred algorithms which could potentially destroy data permanently.

Shortcut auto-update

In the old days of Windows 9x, when you move the target of a shortcut, the shortcut consistently becomes broken. But somewhere along the way, the Windows NT line gained the ability to automatically update shortcuts when their target is moved. This is achieved by the Distributed Link Tracking services that run in the background. More info: Microsoft Docs, Microsoft Support.

But this mechanism is brittle in the face of drives that come and go. For example, if a shortcut is on a dismounted drive when the target is moved, then the shortcut won’t be updated when the drive is mounted again. False positives are possible too; suppose Alice creates a shortcut to D:\readme.txt , puts it on a USB drive, gives the drive to Bob and he mounts it, and he also happens to have a completely unrelated D:\readme.txt file, then if he renames that file then the shortcut might get updated as well. This type of confusion is especially likely to happen if the shortcuts are created by the same user at different PCs and/or involve old snapshots of the user’s own data.

Timestamp semantics

Just because a file timestamp field exists doesn’t mean it behaves the way you want or expect it to. Basic properties of each field:

Modification time: This denotes the most recent time a file was modified. For a directory, it denotes when any of its direct subitems (file/subdirectory) was added, renamed, or deleted – but it does not denote when any file content or subfolder was updated.
Creation time (Windows only): This denotes when a file or folder was created. When a file/folder is copied to another place, this timestamp is reset to the current time by default, but the modification time is copied. Hence it is entirely possible to have a file whose creation time is after its modification time. The creation timestamp is subject to the tunneling feature, where if a file is deleted and a new file is created with the same name within a few seconds, it will adopt the creation timestamp of the old file instead of resetting to the current time.
Access time: In principle, this timestamp should be updated any time a file is opened for reading or writing. This timestamp can be useful for deciding which infrequently accessed files can be compressed, moved to slower media, deleted from cache, etc. But in practice, a system might be configured to not update access times to reduce the I/O workload. Also, this timestamp cannot be updated if the file system is mounted as read-only (in software) or is actually read-only (like a CD-ROM).

File timestamps are usually preserved in archive formats (ZIP, tar, etc.) and are usually restored when unzipping, but the devil is in the details; each format and each tool has its own idiosyncrasies. File timestamps are conveyed by network file system protocols like NFS and SMB, by Unix tools like scp and rsync, but generally not conveyed by popular protocols like HTTP and FTP.

File timestamps interact poorly with version control systems. For example, Git only stores a timestamp per commit, which applies to an entire snapshot of a tree of directories and files; it does not have per-file timestamps. When retrieving an old version of a file, Git writes out the file contents but sets the modification timestamp to the current time, which seems counterintuitive. But it makes sense in conjunction with a build system like Make, which compares timestamps of source files versus target files to see if anything needs to be rebuilt.

In summary, even though file timestamp fields have simple-sounding goals, their detailed semantics, exceptional behavior, and lossiness mean that their data are not very reliable.

Timestamp granularity

A file or directory on a file system has various properties such as the last modification timestamp. Every timestamp has an inherent resolution/precision/granularity such as 1 second or 1 microsecond.

NTFS: 100 nanoseconds for creation, modification, access
FAT: 10 milliseconds for creation, 2 seconds for modification, 1 day for access
ext2, ext3: 1 second for ctime, mtime, atime
ext4, Btrfs: 1 nanosecond for ctime, mtime, atime

When moving/copying files from one file system to another, essentially every piece of software will silently convert timestamps, even if it involves a loss of resolution. For example, a file on NTFS with a timestamp of 01:23:45.6789012 being moved to a FAT32 volume would have its timestamp rounded to 01:23:46. I am not sure about the timestamp granularity in archive formats like ZIP, RAR, 7-Zip, tar, etc., but there are surely more opportunities for mismatches.

Time zone of timestamps

For files stored on a FAT file system, the timestamps are stored in terms of the local time of the machine that wrote the files. For files stored on NTFS and Unix file systems, timestamps are stored in UTC, independent of the machine’s time zone. When manipulating files under any timestamp scheme, there is no problem if the local time zone of the machine stays constant when creating, updating, and transferring files.

But when the machine’s time zone changes, data loss can occur. If the location experiences daylight saving time (DST), then each year when the clock springs forward there is an invalid hour in local time which does not exist, and when the clock falls back there is a duplicated hour where the time zone is ambiguous. Additionally, if the file timestamp was written in one time zone but the file is transferred from FAT to NTFS on a machine with a different time zone, then the translation from local time to UTC would be incorrect. On Windows, local timestamps are interpreted in the current time zone without regard to DST. So if a file was created in UTC−4, then half a year passes and DST changes the time zone to UTC−5, and the file gets transferred from FAT to NTFS, then the old file timestamp is interpreted as UTC−5, not UTC−4.

Generally speaking for any information system (not limited to file systems), storing timestamps in local time easily results in data loss; it causes headaches if the goal is to achieve long-term data accuracy. Every serious system should be designed to work with UTC timestamps, with optional functions to display them in local time.

File creation timestamp

Windows has a notion of creation time for files, but Unix does not. Copying a file from say, NTFS to ext3 will incur the loss of creation time metadata. Unix has a concept of ctime which reflects the last time that a file’s inode metadata was changed, but I find it nearly useless compared to the concept of file creation time.

Robocopy tool

Robocopy (Wikipedia, Microsoft) is a command line program that copies files and folders with a stronger emphasis on fidelity than other tools. For example, robocopy can preserve timestamps, handle long paths, copy attributes, and provide fine control over which metadata fields are copied. By comparison, Windows Explorer will reset creation timestamps, skip files with long paths, and normalize some names – all of which can be unintentional data loss.

Application configuration

Almost every computer application stores some state that carries over to future invocations of the app. The easiest way to test this idea is that if you take your favorite app and install it on a new PC/phone/device, how does its behavior differ from the main copy of the app that you are accustomed to using, and how many configuration knobs do you need to turn?

Examples: Windows Explorer lets you configure how the icons of each folder are sorted and displayed. A music player program has a user-created playlist that persists even after you restart the program. A word processor has a dictionary of custom words that you added. A mail program has a local cache of email messages downloaded from a server. A photo editor maintains a library/catalog of all the image files that you imported. A web browser stores your bookmarks and browsing history.

App config can be stored in a variety of places, and usually aren’t documented clearly to the end user. On Windows, an app could store its configuration data somewhere in the current user’s profile folder directly, in the AppData folder of the user’s profile, in the Windows Registry, at an online server, in the program directory (bad practice), or in the Windows system directory (very bad practice). On Unix, a program could store its configuration in the user’s home directory, in /var, in /etc, or possibly other places.

Truncated download

When transferring a file over a network, premature termination could result in a number of behaviors: A truncated file is saved and the user must manually resume the transfer, the partial file is deleted, or the transfer program will automatically try to resume the download. For example, Windows Explorer deletes partial files when copying between local drives or SMB servers. For example, FTP and old HTTP (without the Content-Length header) data transfers can result in silently truncated files which can only be detected by the user.

Alternate data streams / forks

A file has a primary stream of bytes, which is addressed by the main path used to access the file. But it can have other named data streams on the side. All these streams move together when a file is renamed, moved, or copied. But the mechanisms of alternate data streams / resource forks / extended attributes are specific to each operating system (Windows vs. Mac vs. Unix) and file system, and generally transfer poorly between different systems. For example, a file’s non-primary data streams are not carried by protocols such as HTTP or FTP, nor by old file systems like FAT (unlike NTFS).

File permissions

On Windows NTFS, every file and directory can have an access control list with many users and many security attributes. On Unix, every file/dir has either a dozen permission bits and a user and a group, or a full ACL like Windows. Many tools for transferring or archiving files/folders discard permission information. But even if the file security information is retained, it is almost certainly meaningless when it is blindly reproduced on a different computer system.

File sparseness

Most file systems support the concept of a file where most of the data is zero but some parts are filled in, and only those parts are stored on disk. For example, it is possible to create a logical 100 GB file where a couple of 1 MB blocks have real data and the rest is zero, and the file system will only consume a few megabytes of actual disk space to represent the file. If a volume has many sparse files, the total logical size could far exceed the real size of the volume. It can be assumed that backup tools and file transfer tools do not archive or restore sparse files accurately, and end up storing the entire logical file contents literally and consuming massive amounts of space.

Line separators in text files

There are at least three conventions for line separators or terminators in text files: LF (Unix, Linux, Mac OS X), CR+LF (Windows, most Internet protocols), CR (Mac OS 9 and below). Mac OS 9 and CR are just a historical note and there are essentially no such text files in the wild today. LF and CR+LF are both very popular, so care is needed to consciously choose a convention. Haphazard editing of a file can result in an unpleasant mixture of LF and CR+LF line separator sequences. Careless editing of a project can leave some files with LF and some files with CR+LF. I advocate for LF instead of CR+LF because it is easier to parse, because CR+LF confers no additional meaning, and because Unix diff tools work natively with LF and could show noisy control codes if CR characters are present.

File hard links

On a file system, it is possible to have multiple file paths pointing to the same piece of mutable file data; each path is called a hard link. But storing these hard links in archives or transferring them to another file system can be difficult. If an immutable file with multiple hard links is transferred naively, then this results in a waste of storage space. But if a mutable file with multiple hard links is transferred without preserving links, then this subtly changes the semantics because now each copy of the file can be changed independently and the changes cannot be seen at the other file paths.

CD audio

CD is a digital data storage format, and one would reasonably expect to be able to make perfect copies of audio CDs. It is easy to make a copy that sounds flawless to a human ear. But unlike computer files, it is tricky and unreliable to copy CD audio without changing or losing any bits.

One problem is that when ripping a CD audio track, the disc drive will return the audio data with some arbitrary time offset, depending on the hardware model. For example, if the computer requests the drive to read the range of audio samples [0, 2352) (which corresponds to sector 0), the drive might return the data for samples [100, 2452) instead. This problem is due to sloppiness in how drives behave, and is not inherent to the CD audio format.

Another problem is that a drive cannot be expected to read an audio stream perfectly without interruption, and the drive could skip data or read data from the wrong location. Programs like cdparanoia and Exact Audio Copy try to overcome this by rereading a CD many times, often in small segments, and piecing together a coherent track of audio by overlapping and comparing these segments. But it seems that at the end of any song as the audio fades out to absolute digital silence, it is difficult to correlate different pieces of audio to get an accurate reconstruction. This problem stems from the fact that individual CD audio frames do not have timecode information, and even CD sectors only have optional timecodes. So if a drive loses synchronization due to a disc defect or because the host didn’t ask the drive to continue reading, then the drive cannot guarantee an accurate seek back to the exact location at the end of the previous read.

CD metadata

In addition to storing raw PCM audio samples, the CD audio format has 8 subcode channels that each store 7200 bits of data per second of audio. Of particular interest is the Q channel, which contains metadata about the audio stream and timestamps (for accurate seeking). Although these subcode channels exist on every audio CD and are guaranteed by the standard, it seems that the hardware and software support for reading and archiving these subchannels’ data is rarely discussed and generally not implemented.

DV tape recording

DV is lossy intraframe video codec, whose data can be stored as a file on computer (e.g. wrapped in a video container format like AVI), or on a variety of tape-based formats (MiniDV, DVCAM, etc.). Working with DV video data on computers is straightforward, and frame-accurate edits can be made with ease.

Working with DV on tape is messy because consumer-grade camcorders have imprecise splicing. Suppose that a new video clip is recorded over the middle of an old clip. We cannot guarantee how many frames at the beginning of the new clip will be corrupted or truncated, nor can we guarantee at what point the old clip becomes corrupted or fully overwritten. Note that truncated DV frames look like blurry low-resolution previews, so at a glance they can be confused for full uncorrupted frames. Also note that the frames near a splice can read out differently over multiple attempts, which demonstrates a lack of consistency. By contrast, computer files suffer from none of these messy data-losing behaviors.

Physical media defects

In the old days when computer games and software were sold on physical media like floppy disks and CDs, some publishers deliberately created defects on the disk/disc as a form of copy protection. For example, a disc could have a sector with an intentionally failing checksum/ECC; the program will validate whether that sector is unreadable; the corrupt sector would generally make a copier program abort because it can’t read the entire disc. For example, a floppy disk could have physical holes in them so that it is impossible to write to those sector locations; this contrasts with a normal floppy whereas all sectors are writable. These low-level defects are difficult to reproduce on consume-grade hardware.

An example of efforts toward low-level archival is the applesauce project. Their floppy drive can read raw magnetic signals, which can cope with deliberately malformed bits that a normal drive could reject or misbehave on. Also, it seems their disk drive can make a physically accurate geometric map of where the bits are stored, instead of normal drives which only return the logical data structure.

ASCII mode file transfers

Some protocols like FTP can transfer a file in ASCII mode or binary mode. It never hurts to transfer a file in binary mode because all bytes are preserved literally, but for text files it is sometimes desirable to convert the line separators after the file transfer. Transferring a file in text mode will convert the line separators to the native sequence on the receiving system, but this action noticeably corrupts binary files.

Git branches

When making a commit in the Git version control system, the branch name is not recorded in the commit. As branch pointers move and as branches merge, there is no preserved history of which commits took place on which branch. This behavior probably differs from a user’s mental model of how a branch should work.

Well-known ways to lose data

These phenomena are less interesting, but they are listed for the sake of completeness:

Hardware, firmware, and software design errors (a.k.a. bugs)
Storage error (corruption in RAM, flash memory, magnetic disk, optical disc)
Transmission error (radio noise, truncated transmission, protocol error, malicious alteration)
Computation error (cosmic ray, electrical interference, overclocking, defective logic gate)
Wikipedia: Data corruption
User error (accidental deletion, running wrong tool, incomplete understanding of tools/system/data, interpreting data or feedback incorrectly)
Disasters and accidents (fire, flood, earthquake, power surge, truck strike)
Lossy multimedia compression (e.g. MP3 audio, JPEG images, MPEG videos, etc.)
Incorrect character encoding conversion (a.k.a. mojibake)
Online service shutdown (e.g. Dropbox file storage, Facebook social media)
Obsolescence of media formats (e.g. floppy disks) or file formats (e.g. from a proprietary software)
Undetected corruption in primary data copy, then copying it to a backup copy and deleting old backups
Malware, viruses, encryption ransomware
Forgetting cryptographic keys

Recommendations

I have some high-level tips to offer on the topic of avoiding subtle data loss. These points are not specific to any software or brand, but are general principles that should withstand the test of time.

Avoid non-file storage

Computer-based files are simple to understand and essentially universal. Avoid storing precious data in less universal formats such as CD audio, digital video tape, etc. Also avoid storing very important data in format-specific places – such as metadata fields (title, author, date, etc.) that can be supplied when burning a data CD.

Use simple semantics for files

At the most basic level, a file is an unnamed finite sequence of bytes. Everything beyond that is incrementally adding complexity: file names, directories, paths, attributes (such as author and title), timestamps, permissions (such as read-only), compression, encryption, hard links, symlinks, etc.

Define a clear data model

The shorter and clearer a data model is, the easier it is to understand and reason about. A data model that is designed to reduce data loss will define what each field means, what values are and aren’t allowed, how programs should interpret data, and specify every edge case.

As an anti-example, the Windows model of files is very complicated; many people do not know about behaviors such as: Reserved file names like NUL and CON, file name length limits, timestamp granularity, time zones, alternate data streams, etc. Storing or manipulating data that the user is not aware of is a recipe for unintended data loss if the user eventually understands why that piece of data is important and useful.

Use cryptographic hashes

Putting files and folders into a cryptographic hash tree is a proven way to uphold strong file integrity. If any file is added, modified, or deleted, then the root hash is essentially guaranteed to change and be noticeable. This sets a very high bar for integrity because every data byte that goes into the hash computation must be well-defined and visible, and it is impossible to change the data silently. The best and most popular example of this model is the Git version control system. It is not hard to write code to compute and verify the hash of every Git blob, tree, and commit. Hence it is possible to verify that the actual stored data exactly matches what the hash claims to cover.