Speeding up slow disks with SSD caching
Afterburner
Flash memory is fast but also expensive. Caching with Flash provides a way out: A smaller and cheaper SSD can speed up the disk.
Hard disks are inexpensive, and they have huge capacities, but they are also slow. Solid state disks (SSDs) are fast, but smaller and more expensive. If you combine the advantages of a hard disk with an SSD-based cache, you pick up a large performance gain at a reasonable cost.
An application generally does not want all the data at once; most of the data is in the state of being ignored most of the time. Caching lets you move the most frequently requested data to an exclusive, fast medium and leave the less-frequently accessed data on the cheaper but slower background medium.
The Linux environment has several tools that provide the necessary software to support hard-disk caching. Does it help to use an SSD-based flash drive as a cache for a traditional hard disk? We decided to find out. This article explores the possibilities for caching with the Linux caching tools Enhance IO and dm-cache. If you are new to the topic of caching, and you would like some additional information on choices you might have to make, see the boxes titled "A Little Cache Theory" and "How Flash Works."
How Flash Works
The precursors of today's Flash memory appeared as early as the 1970s. The devices at the time stored computer microcode in ROM chips (read-only memory). These ROM chips could neither be deleted nor overwritten. Thus, an update meant replacing the chips.
To simplify the procedure, scientists developed erasable programmable read-only memory, EPROM for short. This memory typically had a covered transparent window on the silicon chip. If you removed the label on the window and irradiated the module for around a quarter of an hour with UV light, you would erase the chip, and you would be able to rewrite it. This solution was significantly less expensive than a throw-away ROM, but it was still cumbersome.
The next generation was a further improvement in the form of electrically erasable programmable read-only memory (EEPROM). This type of memory was erased by applying a voltage. Like its predecessors, EEPROM was used to store small amounts of data that needed to be preserved without power and did not frequently change. Like today's Flash memory, these memory modules already belonged to the random access memory (NVRAM) class.
Flash memory, which followed EEPROMs, had a much higher storage density, but relied on the same principle: It contains a floating gate transistor for each bit. The floating gate is an electrically isolated connection to which a voltage can be applied.
The presence of a voltage keeps the source-drain line of the transistor in a high-impedance state, that is, the transistor is non-conductive and blocks (Figure 1). Without voltage at the floating gate, the transistor conducts electricity between the source and drain instead. These two states distinguish the 0 and 1 binary bits.
A Little Cache Theory
If you want a cache to handle the most important data, you also need to define what is important. The cache's decision strategy sets the priority. Several models exist for setting priorities. All of these models define what data the cache needs to forget in favor of new entries. The cache delivers the data automatically and completely transparently when asked to do so by the background medium. The most important decision strategies are:
- FIFO (First in, first out): The entry that was written first to the cache drops out of it again first. This approach is disadvantageous if the cache is small. In this case, data needs to be deleted permanently to make room.
- LFU (least frequently used): Whatever is least frequently requested is forgotten. This strategy is more efficient than FIFO when applications actually require certain entries significantly more often than others.
- LRU (least recently used): This strategy keeps the entries in the cache that have been used recently and removes the oldest. This technique usually requires a number of bits to remember how old a particular entry is. Each hit in the cache updates the age of all the other entries. Variations on LRU include Pseudo-LRU (PLRU, which only needs one age bit) or segmented LRU (SLRU, which includes a protected segment from which the cache is not allowed to remove any entries).
- MRU (most recently used): The opposite of LRU is also useful, if the likelihood that data will be accessed increases with the age of the data. This scenario occurs, for example, in sequential parsing of a data file. If the use case lends itself to a scenario where data that is just read won't be accessed again in the near future, it makes sense to forget the most recent entries in the cache first.
- MQ (multi-queue): This technique maintains different queues with the LRU strategy, where each queue is associated with a particular access frequency. A history buffer remembers the access frequency of the last entries to have been removed for a certain time. Stochastic multi-queue (SMQ) is a variety of MQ.
- RR (random replacement): Ditches an entry at random.
- Application specific: The cache learns from the application, operating system, hypervisor, or database what is worth keeping and adjusts to patterns of user behavior.
In addition to the decision-making strategy, each cache also selects a write strategy. Write options include:
- Writethrough: The system immediately stores the block to be written in the cache, as well as on the background medium. However the process may have to wait to write to the slower medium.
- Writeback: The block to be written is first stored only in the cache, not on the background medium. The block only moves to the slow hard disk when the entry is displaced from the cache. This strategy avoids waiting times, but at the cost of temporary inconsistency. The medium behind the cache contains outdated data at times. The cache must be battery-buffered for this strategy; a power failure almost inevitably leads to data loss.
Another distinguishing feature for caches is how the cache addresses its entries. In direct-mapped caches, the address in the cache is derived directly from the address on the main storage medium, such as by using its least significant bits. Associative caches, on the other hand, use an algorithm to determine the location in the cache, for example, via a hash function. Direct mapping is faster, but two blocks can displace each other even if the remaining cache is completely empty. Associative mapping is more flexible and the computational effort is higher.
Caching Solutions on Linux
Linux offers a variety of solutions for hard disk caching. This article only considers caches for block devices, which aren't affected by the filesystem and know nothing about the nature of the applications. For simplicity, the tests in this article do not consider the case where the same blocks are cached in other parts of the I/O stack, say, by the hard disk itself or in RAM when using a buffer cache.
One family of possible caching solutions for Linux centers around Flashcache [1]. Flashcache implements an associative cache with a writeback policy and uses FIFO (be default) or LRU as a replacement strategy. For this article, I tested Enhance IO, developed by STEC Inc. [2], which is based on Flashcache. Unlike Flashcache, Enhance IO does not use the device mapper. Enhance IO can transparently set up caching for mounted block devices. The Enhance IO environment supports three write strategies: Read-only, Writethrough, and Writeback.
In Read-only mode, all write operations are fed directly to the hard disk. Reads first transfer the data from the disk to the SSD; if access to the same block occurs again, the block is then read from the SSD.
In Writethrough mode, read operations are treated similarly to Read-only, but are written in parallel to the HDD and SSD. Subsequent reads only access the SSD. Writeback mode performs all read and write operations on the SSD in the usual way. The operations reach the disk asynchronously.
The other caching solution I tested for this article is dm-cache [3], which is directly connected to the device mapper. The dm-cache method creates a LVM hybrid volume from three devices – the actual cache, a small device for metadata (both on SSD), and the hard disk. The caching strategy is stochastic multi-queue, or MQ; the write strategy can be Writeback, Writethrough, or Passthrough.
Installation
Installing dm-cache or Enhance IO is not exactly rocket science. For Enhance IO, you can follow the example in Listing 1. First clone the Git repository, copy the command file for the CLI to /sbin
, and copy the manpage to the right place (lines 1 to 5). Then, copy the directory containing the driver sources and rename it (lines 7-10). Next, you need to install framework dynamic kernel module support (DKMS). Before doing so, add the following line to a configuration file for DKMS (line 14):
PACKAGE_VERSION="0.1"
Now the installer can draw on DKMS to compile and install the driver module (line 16).
Listing 1
Enhance IO Installation
jcb@localhost: git clone https://github.com/STEC-inc/EnhanceIO jcb@localhost: cd EnhanceIO/ jcb@localhost:~/EnhanceIO$ sudo cp CLI/eio_cli /sbin/ jcb@localhost:~/EnhanceIO$ chmod 700 CLI/eio_cli jcb@localhost:~/EnhanceIO$ sudo cp ./CLI/eio_cli.8 /usr/share/man/man8/ jcb@localhost:~/EnhanceIO$ cd Driver jcb@localhost:~/EnhanceIO/Driver$ sudo cp -r enhanceio /usr/src jcb@localhost:~/EnhanceIO/Driver$ sudo mv /usr/src/enhanceio/usr/src/enhanceio-0.1 jcb@localhost:~/EnhanceIO/Driver$ cd /usr/src/enhanceio-0.1 jcb@localhost:/usr/src/enhanceio-0.1$ sudo vi dkms.conf jcb@localhost:/usr/src/enhanceio-0.1$ dnf install dkms jcb@localhost:/usr/src/enhanceio-0.1$ sudo dkms add -m enhanceio -v 0.1 jcb@localhost:/usr/src/enhanceio-0.1$ sudo dkms build -m enhanceio -v 0.1 jcb@localhost:/usr/src/enhanceio-0.1$ sudo dkms install -m enhanceio -v 0.1 [root@graphite enhanceio-0.1]# sudo eio_cli create -d /dev/mapper/testvol-data1 -s /dev/nvme0n1p2 -m wb -c enhanceio_cache
The final step sets up the cache (line 20). In this example, /dev/mapper/testvol-data1
is the LVM volume you wish to accelerate and /dev/nvme0n1p2
is the SSD. Intel kindly provided a fast PCI Express SSD, with 750 series NVMe, for the tests (with a capacity of 1.2 TB).
dm-cache is also easy to install. Because the device mapper framework is part of the kernel, you won't need any extra software. To prepare for installation, partition the SSD to have a larger cache and a smaller part available for the metadata device. You can calculate the size of the metadata partition with:
Metadata = 4194304 + (16 * cache size/block size)
In this example, the metadata partition is around 70MB. You can set up the special LVM device with the dmsetup
command:
dmsetup create dmcache --table '0 1366552543 cache /dev/nvme0n1p2 /dev/nvme0n1p1 /dev/sdb2 512 1 writeback default 0'
This cryptic command line lists the following: the first and last sectors of the cache, the device name for the metadata device, the cache device, the data device, then the block size in sectors, the number of the feature arguments, and the write strategy feature argument (Writeback, in this case). Then, it lists the caching policy and the number of policy arguments (here: zero).
If this command fails with the hard-to-understand error message Invalid or incomplete multi-byte or wide characters
, it is probably because the cache or the metadata partition contains old data. dmsetup
does not like that. A remedy is:
dd if=/dev/zero of=/dev/nvme0n1p2 dd if=/dev/zero of=/dev/nvme0n1p1
To check on the status, you can call the cache statistics for the two solutions after performing a number of writes and reads. For dm-cache, the figures are output without formatting and the meaning of the values is only documented in the source code. The output will look like Listing 2.
Listing 2
dmsetup status
root@graphite jcb]# dmsetup status dmcache: 0 1366552543 cache 8 12468/17920 512 4653/4194304 1488021 200791 2189199 41931 0 4650 0 1 writeback 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8 fedora-home: 0 199393280 linear
In the example shown in this listing, the first slash-separated pair of numbers is Used Metadata Blocks/Total Metadata Blocks
; this pair is followed by the block size in the cache and Used Cache Blocks/Total Cache Blocks
. Then, you see the second pair of slash-separated numbers with the values for Read Hits, Read Misses, Write Hits, and Write Misses. Things are easier with Enhance IO. The Enhance IO statistics are located in a file on the Proc filesystem and formatted in a table (see Listing 3).
Listing 3
Enhance IO Statistics (Excerpt)
[root@graphite jcb]# cat /proc/enhanceio/enhanceio_cache/stats reads 1962 writes 6268272 read_hits 346 read_hit_pct 17 write_hits 1870824 write_hit_pct 29 dirty_write_hits 167399 dirty_write_hit_pct 2 cached_blocks 17664 rd_replace 65 wr_replace 394196 <I>[...]<I>
Benchmarks
You might be wondering what the reward is for all this effort. To study the benefits of caching, we ran various benchmarks. First, we successively migrated the hard disk files of a virtual machine to a regular disk, a RAID device, devices with dm-cache or Enhance IO, and a plain vanilla SSD. We then booted the VM and measured the time in each case.
Figure 2 shows the results. The data comes from the log of the Bootchart Tools [4], showing the number of seconds from the beginning of the boot process to starting Xorg. dm-cache's bad performance is explained by the fact that the cache needs a long time to warm up. A few boot attempts are not enough to accurately measure the performance. For the FIO benchmark, we had to repeat the measurement more than 70 times before dm-cache produced stable results.
In a second benchmark, we used Flexible I/O Tester (FIO, [5]) and let it work with a read only, random-access workload. The tester first created 15 files with a size between 10 and 100GB (total size 96GB, which was eight times the size of the available RAM) and then read arbitrary 4KB blocks with up to 16 threads for several minutes. This test shows the impressive superiority of the SSD-based devices compared to hard drives (Figure 3).
The fact that a cache achieves results a few percent better than the standard SSD was not expected, but it results from normal fluctuations in the results and the fact that the influences on the complex I/O stack are diverse. Other caches in faster RAM at the filesystem level play a role. The result of each disk is so poor, at less than 1MB/s, that it disappears into the Y-axis. A RAID is significantly faster but still several orders of magnitude slower than devices that do without time-consuming repositioning of the read head.
For a third benchmark, we used Sysbench [6], which processed a read-write online transaction processing (OLTP) mix in a MySQL database. The MySQL data directory was stored successively on the devices. Each measurement was repeated at least three times, and a mean value was computed. The number of database threads working in parallel grew in the course of the benchmark process.
As you would expect, the SSD is the most expensive, but it is also the fastest solution. The two caches come pretty close to a peak of their power curve with 64 threads. The RAID's performance was passable but much slower. Finally, the hard disk drive was mercilessly outclassed (Figure 4).
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.
-
CachyOS Adds Support for System76's COSMIC Desktop
The August 2024 release of CachyOS includes support for the COSMIC desktop as well as some important bits for video.