Avoiding data corruption in backups

Integrity Check

© Lead Image © Maksym Yemelyanov, 123RF.com

© Lead Image © Maksym Yemelyanov, 123RF.com

Article from Issue 258/2022

A backup policy can protect your data from malware attacks and system crashes, but first you need to ensure that you are backing up uncorrupted data.

Most home users, and I dare say some system administrators, lack a backup policy. Their family pictures, music collections, and customer data files live on their hard drives and are never backed up to offline storage, only to be lost when the hard drive eventually crashes. A few users know to keep backups and regularly copy their files over to a safe storage medium. But even these conscientious people may find their strategy lacking when the time comes to recover from a system crash and they discover corrupted backup data. A successful backup strategy must involve checking for corrupted data.

Silent Data Corruption

The small number of users who keep copies of their important files only keep a single backup. Often, they use an external storage system, such as Tarsnap or a Nextcloud instance, periodically or continuously synchronizing the important files on their computers with the cloud. While a comfortable approach for end users, a single backup suffers a number of problems. Most importantly, single backups are vulnerable to silent data corruption.

Take for example a folder called Foals, which is full of pictures of happy young horses. My backup strategy consists of weekly copying the entire folder over to USB mass storage with a tool such as rsync [1]:

$ rsync -a --delete --checksum Foals/ /path/to/usb/

The rsync tool synchronizes the contents of /path/to/usb with the contents of Foals.

This strategy works until one of the pictures in Foals gets corrupted. Files get damaged for a number of reasons, such as a filesystem failing to recover properly after an unclean shutdown. Files also may be lost because of human error – you intend to delete Foals/10_foal.jpg but end up removing Foals/01_foal.jpg instead without realizing the mistake. If a file gets corrupted or lost and you don't detect the issue before the next backup cycle, rsync will overwrite the good copy in USB storage with bad data. At this point, all the good copies of the data cease to exist, destroyed by the very backup system intended to protect them.

To mitigate this threat, you can establish a long term storage policy for backups which involves saving your backup to a different folder each week within the USB mass storage. I could therefore keep a current backup of Foals in a folder called Foals_2022-01-30, an older backup in Foals_2022-01-23, and so on. When the backup storage becomes full, I could just delete the older folders to make room for the newer ones. With this strategy, if data corruption happens and it takes me a week to discover it, I may be able to dig up good copies of the files from an older snapshot (Figure 1). See the boxout "The rsync Time Machine" for instructions on how to set up this multi-week backup system.

Figure 1: A damaged image in the Foals folder results in corrupted data in the current backup directory (Foals_2022-01-30). Luckily, an undamaged version can be retrieved from an earlier backup (Foals_2022-01-23).

The rsync Time Machine

With rsync, you can save backups to a directly attached drive or over a network. As an added convenience, the snapshot of the folder that rsync takes does not take much space on your storage device.

Suppose I have an external drive mounted under /mnt. The first snapshot would be saved with a regular invocation of rsync:

$ mkdir /mnt/Foals_2022-01-23
$ rsync -a Foals/ /mnt/Foals_2022-01-23

The first command creates a directory with a name reflecting the date. The second command copies Foals to the newly created directory. The -a switch instructs rsync to work in "archival" mode, recursively descending into subdirectories, preserving symlinks, time metadata, file permissions, and file ownership data.

When the time comes to make another weekly backup, I create a different backup folder (which references the new current date) and copy Foals to it. However, rsync has a trick up its sleeve: The --link-dest switch tells rsync to transfer only the changes since the last backup:

$ mkdir /mnt/Foals_2022-01-30
$ rsync -a --link-dest /mnt/Foals_2022-01-23 Foals/ /mnt/Foals_2022-01-30

As a result, rsync copies any new file to the new backup directory, alongside any file that has been modified since the last backup. Files that have been deleted from the source directory are not copied. For files that exist in the source directory but have not been modified since the last backup, rsync creates a hard link to the unmodified files' respective copies in the old backup directory rather than copying them to the new backup directory.

The end result is that Foals_2022-01-23 contains a copy of Foals as it was on that date, while Foals_2022-01-30 contains a current snapshot of Foals. Because only modified or new files are added to the storage medium, they barely take up any extra space. Everything else is included in the new backup folder via hard links.

Unfortunately, long term storage only works if the data corruption is discovered in time. If your storage medium only has room for storing four snapshots, a particular version of a file will only exist in the backup for four weeks. On the fifth week, the oldest snapshot will be deleted in order to make room for new copies. If the data corruption is not detected within this time window, the good copies of the data will be gone and you will no longer be able to retrieve them from a backup.

Solving for Silent Data Corruption

The first step in guaranteeing a good backup is to verify that you are backing up only uncorrupted data, which is easier said than done. Fortunately, a number of tools exist to help you preserve your data integrity.

Filesystems with checksum support (such as ZFS) offer a reasonable degree of protection against corruption derived from hardware errors. A checksum function takes data, such as a message or a file, and generates a string of text from it. As long as the function is passed the same data, it will generate the same string. If the data gets corrupted in the slightest, the generated string will be different.

ZFS [2], in particular, can verify if a data block is correct upon reading it. If it is not (e.g., as a result of a hard drive defect), ZFS either repairs the data block or throws an error for the user to see.

However, ZFS cannot protect data against human error: If you delete a file by accident with

rm Foals/01_foal.jpg

ZFS has no way of knowing this is a mistake instead of a legitimate operation. If a bogus image editor accidentally damages the picture using valid system calls, ZFS can not differentiate changes caused by software bugs from changes intended by the user. While ZFS is often praised as the ultimate guarantee for data integrity, its impressive capabilities fall short in my opinion.

Protection from Userspace

To verify that the data being backed up is correct, I suggest relying on userspace utilities. While many userspace programs are superb at locating damaged files, they are not easily executable from an arbitrary recovery environment. In a system crash scenario, you may find yourself using something like an obsolete SystemRescue DVD (perhaps from an old Linux Magazine) instead of your normal platform. In keeping with the KISS principle, you should choose userspace tools that are portable and easy to use from any platform.

If your distribution includes the GNU coreutils package (which the vast majority do), you need no fancy tooling.

Ideally, you should verify the files' integrity immediately before the backup is performed. The simplest way of ensuring a given file has not been modified, accidentally or otherwise, is to calculate its checksum and compare the result with the checksum it threw from a known good state (Figure 2). Thus, the first step towards protecting a given folder against corruption is by calculating the checksum of every file in the folder:

$ cd Foals
$ find . -type f ! -name '*.md5' -print0 | xargs -0 md5sum | sort -k 2 > md5sums_`date -I`.md5
Figure 2: Checksums can be used to locate files that have been modified, intentionally or otherwise.

(See the "Creating a Checksum" box for a more detailed explanation.)

Creating a Checksum

Calculating a checksum is not intuitive, so I will break down the command and explain how it works its magic.

The find command locates any file (but not directories) in the current folder, excluding files with the .md5 extension. It prints a list of the found files to the standard output. The path of each file is null terminated in order to avoid security issues (which could be derived from piping paths with special characters into the next command):

find . -type f ! -name '*.md5' -print0

Then xargs just accepts the list provided by the find command and passes it to the md5sum program, which generates a checksum for every entry in the list. The -0 switch tells xargs that find is passing null-terminated paths to it:

xargs -0 md5sum

The sort command orders the list (because find is not guaranteed to deliver sorted results). The output of md5sums has two columns: The second column contains the path of each file; the first contains its corresponding checksum. Therefore, I pass the -k 2 switch to sort in order to sort the list using the path names as a criteria:

sort -k 2

These commands create a list of all the files in the Foals directory, alongside its md5 checksums, and places it under Foals. The file will have a name dependent on the current date (such as md5sums_2022-01-23.md5).

If a week later I want to verify that the files are fine, I can issue the same command to generate a new list. Then, it would be easy to check the differences between the state of the Foals folder on the previous date and the state of the Foals folder on the current date with the following command:

$ diff md5sums_2022-01-23.md5 md5sums_2022-01-30.md5

The diff command generates a list of differences between the two files, which will make it easy to spot which files have been changed, added, or removed from Foals (Figure 3). If a file has been damaged, this command will expose the difference.

Figure 3: The diff utility will compare the checksum files, but the output will be messy if there have been multiple changes between the current and the last backups.

Using diff is only practical if the dataset is small. If you are backing up several files, there are better ways to check that your data is not corrupted. For instance, you can use grep to list the entries that exist in the old checksum file but not in the new one. In other words: grep will list the files that have been modified or removed since the last time you performed a check:

$ grep -Fvf md5sums_2022-01-30.md5 md5sums_2022-01-23.md5

The -f md5sums_2022-01-30.md5 option instructs grep to treat every line of md5sums_2022-01-30.md5 as a target pattern. Any line in md5sums_2022-01-23.md5 that coincides with any of these patterns will be regarded as a match. The -F option forces grep to consider patterns as fixed, instead of as regular expressions. Therefore, for a match to be registered, it must be exact. Finally, -v inverts the matching: Only lines from md5sums_2022-01-23.md5 that match no pattern will be printed.

You can also list the files that have been added since the check was last run with the shell magic in Listing 1.

Listing 1

Newly Added Files

awk '{print $2}' < md5sums_2022-01-30.md5 | while read -r file; do
    if (! grep $file md5sums_2022-01-23.md5 > /dev/null); then
      echo "$file is new.";

With these tools, an integrity verification policy falls into place. In order to ensure you don't populate your backups with corrupted files, you must do the following:

  • Generate a list of the files in the dataset and its checksums before initiating the backup.
  • Verify this list against the list you generated at the last known good state.
  • Identify which changes have happened between the last known good state and the current state, and check if they suggest data corruption.
  • If the data is good, back up your files.

A great advantage of this method is that the checksum files can be used to verify the integrity of the backups themselves. For example, if you dumped the backup to /mnt/Foals_2022-01-23, you could just use a command such as:

$ cd /mnt/Foals_2022-01-23
$ md5sum --quiet -c md5sums_2022-01-23.md5

If any file was missing from the backup or had been modified, this command would reveal the issue right away.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More