Your NAS isn't enough – you still need to back up your data!
Not All NAS
Some users trust their data to powerful file servers that advertise enterprise data protection, but your Network Attached Storage system might not be as safe as you think it is.
There is a point in the life of a compulsive data hoarder when a regular computer is not enough to contain a burgeoning file collection. Upon the relentless expansion of a massive data compilation, the first step a home user takes to extend the storage capacity is to purchase an external USB hard drive. The hard drive will buy the user some time, but eventually this solution will fall short. A data hoarder who is dedicated enough will eventually have to invest in a Network Attached Storage (NAS) server.
A NAS is a dedicated server optimized to store large amounts of information. NAS servers are commonly available as commercial appliances, but many power users prefer to build their own from spare parts. Serious NAS servers are scalable and allowed to increase their capacity by adding hard drives as needed. Better yet, they often offer enterprise features that come in very handy, and they promise mitigations to the most common threats against the long term survival of your files.
NAS vendors often advertise fault tolerance and profess the immunity of their systems from disaster, which causes users to treat this sort of storage as bulletproof, dumping their data and then skipping the step of making backups. But rarely do these consumer-grade storage systems provide a complete solution. This article describes some of the things that can go wrong – and why you still need to perform backups to ensure that your data is safe.
The Features of a Quality NAS
A wide range of NAS options are available for home users. These options vary in quality from desktop toys to quasi-enterprise systems trying to pass as domestic appliances (Figure 1).
With the exception of the low end ones, NAS boxes are designed with the purpose of offering the highest possible availability. In this context, a high availability machine is one that can keep serving its users under adverse conditions. Such a server needs to be able to keep functioning if a hard drive fails, if the power grid blacks out, or if its power supply malfunctions.
Servers mitigate hard drive failures by the use of Redundant Array of Independent Disks (RAID). A RAID group is just a set of hard drives that are recognized as a single virtual drive by the operating system. (See the box entitled "Popular RAID Levels" for more information on some common RAID scenarios.) In a domestic NAS context, these drives will most often be grouped in the so called RAID 5 level. RAID 5 distributes the data within the array evenly across every device, with some extra parity components. Should one of the drives fail, the server will keep functioning in a degraded state by keeping the remaining drives running and using the parity data to reconstruct lost information.
Popular RAID Levels
RAIDs can be built in multiple ways, depending on the purpose they serve. The most popular traditional RAID levels are:
- RAID 0 stripes data across all the drives in the set for increased performance (Figure 2). The total size of the RAID is that of the sum of the sizes of every individual drive. A disk failure kills the array, making it a dangerous RAID level to use. RAID 0 has better read and write throughput than a single hard drive of the same size as the array, because the workload is evenly distributed over the individual drives in the RAID.
- RAID 1 mirrors the data across all the drives in the array (Figure 3). Since every drive has a full copy of all the data, a RAID 1 can keep working as long as one of its drives is still operational. RAID 1 is good for keeping a proper uptime, but it is not very cost effective, because, at the very least, it takes twice as many drives for the same storage capacity.
- RAID 5 is among the most popular in small deployments. This form of RAID is known as disk striping with parity. The disks are striped (as with RAID 0), but an additional drive provides a parity bit, ensuring that the array can keep working if one of the drives fails (Figure 4). RAID 6 does pretty much the same thing, except it can keep working after two hard drive failures.
- RAID 10 is a combination of RAID 0 and RAID 1. Drives are deployed in couples in which each unit mirrors the other. Then all the pairs are placed in a RAID 0 (Figure 5). RAID 10 can keep functioning as long as at least one drive in each pair is in working order.
A server can survive blackouts by the use of an Uninterrupted Power Supply (UPS), which is just a fancy term for a battery that kicks in when the power grid goes down (Figure 6). A modern UPS can communicate with the server over USB or Ethernet in order to let the operating system know how much power is left in the battery, which is useful to force the machine to shutdown in an orderly way when the supply is about to run dry.
About ECC
Good NAS hardware will often feature Error Correction Code (ECC) RAM. ECC RAM is capable of checking itself for consistency against random errors in memory, which are more frequent than it seems [1]. RAM errors are considered dangerous for the survival of a dataset and the continued operation of a server. A botched bit in RAM could cause the operating system to malfunction or cause a file to get corrupted. ECC is intended to reduce the risk of such an event and keep the system running after a memory error.
A theory holds that a bit error in RAM could cause a chain reaction, resulting in massive data corruption within a ZFS filesystem. It is therefore argued that the only safe way of running a ZFS server is with ECC RAM, and that doing otherwise is borderline suicidal.
ZFS uses no pre-mount consistency checker and lacks filesystem repair tools at the time of this writing. ZFS was conceived as a self-healing filesystem, capable of repairing data corruption on the go. Should ZFS try to read a data block that has been corrupted by, let's say, a hard drive defect, the filesystem would be able to identify the issue and attempt to repair it on the fly from parity data. Such self-healing features do, in theory, eliminate the need for recovery tools. The FreeNAS project (now TrueNAS) used to warn that a botched memory operation could cause permanent damage to the filesystem, and since there are no recovery tools available, data could end up being unrecoverable [2].
However, opinions differ on whether ZFS is more susceptible to failure than other filesystems. Matthew Ahrens, cofounder of Sun's ZFS project, argues that using ZFS with non-ECC RAM is about as risky as running a regular filesystem without it [3], arguing that ECC RAM is not necessary but is highly recommended.
RAID Issues
A good NAS promises excellent uptime and looks indestructible on the surface. It would seem like files should be able to survive indefinitely in such a server. After all, if a NAS is capable of withstanding a hard drive failure (the most common hardware malfunction [4]), there is not much incentive for spending the big amount of money required to set another server up and keeping a backup of the original one.
The problem is that there is only so much a file server can do to protect your data, especially outside of an enterprise environment. Quality server hardware is designed to guarantee good uptime in the face of trouble, but not necessarily the integrity of your information. There are a number of reasons why a NAS may still fail.
If a hard drive fails within a NAS' RAID 5 set, the whole array will work at a degraded level. From the user viewpoint, the array is still operational, but it has ceased to offer fault tolerance. Should another drive fail before a new one is added and the array is rebuilt, the information contained in the array will be lost. Many a RAID array has failed due to owner procrastination – or due to the long wait time waiting for the attention of an overworked sys admin.
But tardy repair is just one of the reasons why some experts are wary of depending on RAID. A casual search on the Internet will find countless opinions regarding the unsuitability of RAID 5 for modern file servers [5]. Storage media is not perfect and may suffer random read failures. Hard Drives are reliable enough for most purposes [6], but every now and then they will throw an Unrecoverable Read Error (URE). UREs are errors which take place when the hard drive tries to access a block of data and fails to do so. Modern drives are estimated to suffer an URE for every 10^14 bits read on average, which means errors are rare.
The bigger a disk array, the higher the chance that a defective sector exists somewhere. The argument of RAID 5 detractors is that disk arrays are becoming so big that the probability of triggering a URE is becoming too high to be acceptable. This is so because the more bits are managed by the RAID, the more likely it is that at least one block of information is problematic.
If a RAID 5 loses a drive to hardware failure, a new drive can be plugged in, and the RAID 5 may be rebuilt from the data existing in the remaining disks. However, if any of the remaining disks throws a URE during this process, the consequences may range from losing the data existing in that sector to being unable to rebuild the whole RAID (depending on the quality of the RAID controller and drives).
Experience suggests that the fear of being unable to rebuild big arrays is blown out of proportion. Nevertheless, it is important to remember that RAID 5 is a tool for guaranteeing uptime rather than the integrity of your files.
There are RAID levels with better fault tolerance than RAID 5 (such as RAID 6 or RAID 10) but using these alternative RAID levels in a small system is comparatively expensive.
Nearly as bad as this is the fact that many RAID controllers are proprietary and don't offer a good migration path. If you are using a proprietary solution and want to move your hard drives from an old server – maybe because the old one finally bit the dust! – you might discover that your data is unreadable in its destination machine.
On the other hand, software issues might destroy your files just as quickly as a hardware level malfunction, and using an enterprise-grade server won't do much for you if you are hit by a bug. For example, QNAP's NAS appliances were massively affected by a vulnerability that caused many users to be preyed on by the DeadBolt ransomware [7][8].
Buy this article as PDF
(incl. VAT)