This post explores the question: “how can gluster utilize SSDs” ? It does this by reviewing three tests done by the Red Hat performance group. In each test SSDs were used in a different configuration. The tests varied by cost of ownership and tunability.
The LSI Nytro MegaRAID 8110-4e card was used for testing on systems in Red Hat’s performance labs. The card can be configured for different RAID levels on disk drives. It has an SSD attached to it. The SSD is flexible and can be configured in different ways. Other cards could have been tested; this was chosen because of its availability in the lab.
Here are the tests:
- Replace every disk with SSDs (expensive)
- Use SSDs as a cache at the LSI controller level (less expensive, no control over cache settings).
- Use SSDs as a cache at the kernel level using dm-cache (less expensive, some control over cache settings).
We would expect SSDs perform best in random I/O workloads when compared with disks. The SSD size should be larger than the RAM, because the Linux buffer cache is already doing caching.
How can users realize SSDs benefit at the least cost?
Replacing disks with SSDs
The most obvious deployment method is to simply replace all the disks in Gluster with SSDs. This may be prohibitive from the cost perspective, but suggests an upper bound that would be achievable.
The experiments met expectations. They showed SSDs performed much better than disks in most cases, in particular with small files.
To compare Nytro SSD to traditional spindles, 3 modes of accessing storage were tried:
- pure SSD – just put XFS on top of the boot drive
- traditional – XFS on concatenated LVM volume consisting of 8 RAID1 disk drive pairs, with disk-local writeback caching DISABLED (WCE=0 for SCSI people)
- with write-back caching – fsync’ed writes complete as soon as they reach NVRAM on the nytro
- with write-thru caching – fsync’ed writes do NOT complete until the disk drive has sent them to the platter
These tests used 2 workload types, with varying file sizes and numbers of threads, placing files at random into directories and with random exponential file size distribution.
- create — opens brand new file, writes data to it, fsyncs it (so it persists in event of crash/power-fail), closes it
- read – opens existing file, reads it, closes it
|operation type||threads||file size||SSD files/sec||write-back files/sec||write-thru files/sec|
Note the numbers in bold. There are 5 times the write back performance numbers for create, and 10 times what was seen for reads.
Using SSDs as a cache at the controller level
By default, the LSI Nytro card utilizes its SSD as a cache in front of the disks. By using a cache, it is desired that frequently used data will be quickly accessible on the SSD rather than the disk. The caching policies are internal to the hardware – in effect the cache is a “black box”. This experiment tries to show how good its caching policies work.
The tests appeared to show the SSDs had some benefit, but not nearly as significant as when the disks were completely replaced. For example, the best results showed a 70% improvement, while replacing the disks completely in some cases yielded a 500% improvement.
I/O was run directly to RAID-6 volumes without gluster, in order to isolate the effect of the SSD. I/O was generated using the smallfile tool. The tool generated in a “small file” workload that generates random I/O operations.
- Run swift (object protocol) over XFS
- fsync after every create
- extended attributes written to every file
- A deep directory tree was generated with few files/directory
- 20 workload generator threads
- average file size is 64 KB
- an exponential file size distribution featuring mostly files smaller than chosen file size with a few of the files much larger than the chosen file size
- 200,000 separate files accessed per thread
- 5 extended attributes of 32 bytes each accessed per file for swift-put/swift-get operations
- files randomly sprayed across the directory tree (default is to access directories one at a time)
- fsync issued by thread after each file is written.
200,000 x 20 = 4 million files were written as part of each test, with a total of 256 GB of data, two times the amount of RAM available on the host.
3 passes of each test were done. The write tests are done using the swift-put operation, and the read tests are done using the swift-get operation.
- swift-put — create and write the files
- swift-get – read the files after dropping cache
- swift-get2 — read the files again after dropping cache; a 2nd read will detect whether Nytro is starting to cache portions of data/metadata in SSD
- swift-getcached — read files without dropping cache to see what is speed if host RAM is allowed to buffer data beforehand
- swift-getnomem — read files after restricting memory usage severely (this could simulate effect of having Nytro configured with far more SSD than RAM
Using SSDs as a cache at the kernel level (dm-cache)
The Nytro card’s default SSD caching configuration did not generate very impressive improvements. Linux introduced an alternative in the 3.9 kernel called the “dm-cache”, aka “dynamic block level storage caching”. It can cache blocks at the device manager level within the kernel. The blocks reside on a “cache device”, typically an SSD. A related project is called bcache.
The dm-cache module has a tunable policy (e.g. LRU, MFU). The file system can send hints to dm-cache to “encourage” blocks to be cached or not cached.
For the test, the Nytro card’s SSD was re-purposed to act as the caching device for the dm-cache. The test compared using dm-cache with using the normal RAID write-back cache on the Nytro controller.
The results showed dm-cache performed well when there was no caching on the controller level (write through was set on Nytro). When write-back was set, little benefit was observed in most cases, with the exception of the small file workload.
The test was preliminary, for example, RAID-10 rather than RAID-6 was used. The meta device used by dm-cache was housed on a disk.
That said, dm-cache appears promising. Nytro’s cache helps performance, but many users prefer JBODs to expensive controllers. Such JBOD users would see worse performance without having the cache. They may be able to recover that performance by using dm-cache+SSDs.