Improving RAID performance with filesystem and RAID parameters
Optimum RAID
You can improve performance up to 20% by using the right parameters when you configure the filesystems on your RAID devices.
Creating a software RAID using the Linux kernel is becoming easier and easier. With a call to mdadm and pvcreate, you can be well on your way to using LVM on top of a RAID 5 or RAID 10 device. In fact, the procedure for setting up a RAID system has gotten so simple that many users routinely click through the commands without too much consideration for how the various settings might affect performance.
When it comes to RAID, however, the default settings aren't always a prescription for optimum performance. As you will learn in this article, tailoring your filesystem parameters to the details of the RAID system can improve performance by up to 20%.
Details
For this article, I focus on the XFS and the venerable ext3 filesystems. I have no particular connection to either filesystem and use them both in different circumstances.
My tests used a RAID 5 across 4,500GB disks. To be exact, I used three Samsung HD501LJ 500GB drives and a single Samsung HD642JJ 640GB drive. One of the joys of Linux software RAID is that you can spread your RAID over disks of different sizes, as long as the partition you use on the larger disk is the same size as on the smaller. The RAID used 343,525,896 blocks on each disk, resulting in a usable filesystem size of just under 1TB when created on the RAID. Testing was performed on an Intel Q6600 running on a P35 motherboard.
The Linux kernel allows its filesystems to use barriers to protect one sequence of write commands from the next. When the filesystem requests a barrier, the kernel will ensure that everything the filesystem has written up to that point is physically on disk. This is a very slow operation because of complications like flushing the entire disk cache, in many cases, and similar actions.
By default, ext4 and XFS use filesystem barriers if they can. Because ext3 does not use barriers, I disabled barriers in the XFS tests so that I would be closer to comparing apples to apples. Another reason I disabled barriers is that LVM does not support them, so I needed to remove barriers to compare XFS performance accurately with and without LVM.
Lining up the RAID
Filesystems are deceptively complex pieces of software. At first, a filesystem might appear rather simple: Save a file and make sure you can get it back again. Unfortunately, even writing a small file presents issues, because the system caches data in many places along the way to the disk platter. The kernel maintains caches, your disk controller might have a small memory cache, and the drive itself has many megabytes of volatile cache. So, when a filesystem wants to make sure 100KB is really on the disk, a fairly complex and often slow operation takes place.
When you use a RAID 5 system on four disks, the Linux kernel writes three blocks of real information to three different disks and then puts the parity information for those three blocks on the fourth. This way, you can lose any one of the four disks (from disk crash) and still have enough information to work out what the original three blocks of information were. Because filesystems and disk devices use "blocks," RAID experts call these three data blocks "chunks" to avoid confusion.
With four disks, you have three data chunks and one parity chunk. So the stripe size is three chunks and the parity stripe size is one chunk. The stripe size is very important, because the filesystem should try to write all the chunks in a stripe at the same time so that the parity chunk can be calculated from the three chunks that are already in RAM and written disk. You might be wondering what happens if a program only updates a single chunk out of the three data chunks.
To begin, I'll call the chunk being written chunk-X and the parity for that stripe chunk-P. One option is to read the other data chunks for the stripe, write out chunk-X, calculate a new parity chunk-P, and write the new chunk-P to disk. The other option is for the RAID to read chunk-P and the existing, old, chunk-X value off the disk and work out a way to change chunk-P to reflect the changes made to chunk-X.
As you can see, things become a little bit more complicated for the RAID when you are not just sequentially writing a large contiguous range of data to disk. Now consider that the filesystem itself has to keep metadata on the "disk" on which you create it. For example, when you create or delete a file, the metadata has to change, which means that some little pieces of data have to be written to disk.
Depending on how the filesystem is designed, reading a directory also calls for reading many little pieces of data from disk. Therefore, if the filesystem knows about your chunk size and stripe size, it can try to arrange the metadata into chunks that will make life easier for the RAID and thus result in improved performance.
The key to aligning the parameters of your filesystem and RAID device is to set the stripe and chunk size correctly to begin with when creating the filesystem with the mkfs command. If you are using XFS and creating the filesystem directly on the RAID device, then mkfs.xfs takes care of that step for you. Unfortunately, if you use LVM on top of your RAID and then create an XFS filesystem on an LVM logical volume, mkfs.xfs does not align the parameters for optimum performance.
RAID 5
For this article, I assume you have some familiarity with the concepts of RAID and RAID configuration in Linux. If you are new to Linux RAID, you'll find several useful discussions online [1].
Redundant Array of Inexpensive Disks (RAID) is a collection of techniques for fault-tolerant data storage. The term first appeared in the 1988 landmark paper "A Case for Redundant Arrays of Inexpensive Disks," which was written by David Patterson, Garth Gibson, and Randy Katz [2].
Fault-tolerant data storage systems provide a means for preserving data in the case of a hard disk failure. The easiest way to protect the system from a disk failure (at least conceptually) is simply to write all the data twice to two different disks. This approach is roughly equivalent to what is often called disk mirroring, and it is known within the RAID system as RAID 1.
Although disk mirroring certainly solves the fault tolerance problem, it isn't particularly elegant or efficient. Fifty percent of the available disk space is devoted to storing redundant information. The original paper by Patterson, Gibson, and Katz (as well as later innovations and scholarship) explored alternative techniques for offering more efficient fault tolerance. A favorite fault-tolerance method used throughout the computer industry is Disk Striping with Parity, which is also called RAID 5.
RAID 5 requires an array of three or more disks. For an array of N disks, data is written across N – 1 of the disks, and the final disk holds parity information that will be used to restore the original data in case one of the disks fails.
The amount of disks in the array devoted to redundant information is thus 1/N, and as the number of disks in the array increases, the penalty associated with providing fault tolerance diminishes.
Making it Happen
The command below creates a RAID device. The default chunk size of 64KB was used in a RAID 5 configuration on four disks. Creating the RAID with the links in /dev/disk/by-id leaves a little less room for the error of accidentally using the wrong disk.
cd /dev/disk/by-id mdadm --create --auto=md --verbose --chunk=64 --level=5 --raid-devices=4 /dev/md/md-alignment-test ata-SAMSUNG_HD501LJ_S0MUxxx-part4 ata-SAMSUNG_HD501LJ_S0MUxxx-part4 ata-SAMSUNG_HD501LJ_S0MUxxx-part4 ata-SAMSUNG_HD642JJ_S1AFxxx-part7
For both ext3 and XFS, I ran two tests: one with a fairly routine mkfs command and one with which I tried to get the filesystem to align itself properly to the RAID. These four filesystems are ext3, ext3 aligned to the RAID (ext3align), XFS (xfs), and XFS aligned to the RAID (xfsalign), all created with and without explicit alignment of the stripe and chunk size. The only extra option to mkfs.xfs I used for the XFS filesystem was lazy-count=1, which relieves some contention on the filesystem superblock under load and is highly recommended as a default option.
The mkfs command to create the aligned ext3 filesystem is shown below. I have used environment variables to illustrate what each calculation is achieving. The stride parameter tells ext3 how large each RAID chunk is in 4KB disk blocks. The stripe-width parameter tells ext3 how large a single stripe is in data blocks, which effectively becomes three times stride for the four-disk RAID 5 configuration.
export RAID_DEVICE=/dev/md/md-alignment-test export CHUNK_SZ_KB=64 export PARITY_DRIVE_COUNT=1 export NON_PARITY_DRIVE_COUNT=3 mkfs.ext3 -E stripe-width=$((NON_PARITY_DRIVE_COUNT*CHUNK_SZ_KB/4)),stride=$((CHUNK_SZ_KB/4) $RAID_DEVICE
The command to create the aligned XFS filesystem is shown in the code below. To begin with, notice that the sunit and swidth parameters closely mirror stride and stripe-width of the ext3 case, although for XFS, you specify the values in terms of 512-byte blocks instead of the 4KB disk blocks of ext3.
mkfs.xfs -f -l lazy-count=1 -d sunit=$(($CHUNK_SZ_KB*2)),swidth=$(($CHUNK_SZ_KB*2*$NON_PARITY_DRIVE_COUNT)) $RAID_DEVICE
And They're Off …
I used both bonnie++ and IOzone for benchmarking. Many folks will already be familiar with bonnie++, which provides benchmarks for per-char and per-block read/write as well as rewrite, seek, and file metadata operations like creation and deletion.
IOzone performs many tests with different sizes for data read and write and different file sizes, then it shows the effect of file size and read/write size on performance in the form of three-dimensional graphs.
Looking first at bonnie++ results, Figure 1 shows the block read and write performance. Note that because mkfs.xfs can detect the RAID configuration properly when you create it directly on the RAID device, the performance of xfs and xfsalign are the same, so only xfs is shown.
As you can see, setting the proper parameters for ext3 RAID alignment (see the ext3align column in Figure 1) provides a huge performance boost. An extra 10MB/sec block write performance for free is surely something you would like to have.
The bonnie++ rewrite performance is a very important metric if you are looking to run a database server on the filesystem. Although many desktop operations, such as saving a file, will replace the entire file in one operation, databases also tend to overwrite information in place to update records.
The rewrite performance is shown in Figure 2. Although the results for character I/O are less important than for block I/O, the additional performance for block rewrite gained by aligning the ext3 filesystem properly amounts to a 2MB/sec difference, or 6% additional performance for free.
Although the IOzone results for ext3 and ext3aligned are very similar, a few anomalies are worth reporting. The read performance is shown in Figure 3 and Figure 4. Notice that ext3align does a better job reading small chunks from smaller files than ext3 does. Strangely, performance drops for ext3align with 16MB files.