Experiments Using SSDs with Gluster

Gluster

2014-03-18

This post explores the question: “how can gluster utilize SSDs” ? It does this by reviewing three tests done by the Red Hat performance group. In each test SSDs were used in a different configuration. The tests varied by cost of ownership and tunability.

The LSI Nytro MegaRAID 8110-4e card was used for testing on systems in Red Hat’s performance labs. The card can be configured for different RAID levels on disk drives. It has an SSD attached to it. The SSD is flexible and can be configured in different ways. Other cards could have been tested; this was chosen because of its availability in the lab.

Here are the tests:

Replace every disk with SSDs (expensive)
Use SSDs as a cache at the LSI controller level (less expensive, no control over cache settings).
Use SSDs as a cache at the kernel level using dm-cache (less expensive, some control over cache settings).

We would expect SSDs perform best in random I/O workloads when compared with disks. The SSD size should be larger than the RAM, because the Linux buffer cache is already doing caching.

How can users realize SSDs benefit at the least cost?

Replacing disks with SSDs

The most obvious deployment method is to simply replace all the disks in Gluster with SSDs. This may be prohibitive from the cost perspective, but suggests an upper bound that would be achievable.

The experiments met expectations. They showed SSDs performed much better than disks in most cases, in particular with small files.

To compare Nytro SSD to traditional spindles, 3 modes of accessing storage were tried:

pure SSD – just put XFS on top of the boot drive
traditional – XFS on concatenated LVM volume consisting of 8 RAID1 disk drive pairs, with disk-local writeback caching DISABLED (WCE=0 for SCSI people)
- with write-back caching – fsync’ed writes complete as soon as they reach NVRAM on the nytro
- with write-thru caching – fsync’ed writes do NOT complete until the disk drive has sent them to the platter

These tests used 2 workload types, with varying file sizes and numbers of threads, placing files at random into directories and with random exponential file size distribution.

create — opens brand new file, writes data to it, fsyncs it (so it persists in event of crash/power-fail), closes it
read – opens existing file, reads it, closes it

operation type	threads	file size	SSD files/sec	write-back files/sec	write-thru files/sec
create	1	4	4325	2485
	1	16	3890	2662
	1	64	2527	1616
	4	4	12648	5354	192
	4	16	10785	5077	189
	4	64	6027	2761	175
	16	4	24110	5803	492
	16	16	18596	7427	496
	16	64	7963	3852	461
read	1	4	9869	5384
	1	16	6641	4157
	1	64	4222	2144
	4	4	22585	14619	11841
	4	16	19457	7394	7109
	4	64	9437	5164	2092
	16	4	58010	5260	4286
	16	16	36636	4356	3837
	16	64	10379	3756	2425

Note the numbers in bold. There are 5 times the write back performance numbers for create, and 10 times what was seen for reads.

Using SSDs as a cache at the controller level

By default, the LSI Nytro card utilizes its SSD as a cache in front of the disks. By using a cache, it is desired that frequently used data will be quickly accessible on the SSD rather than the disk. The caching policies are internal to the hardware – in effect the cache is a “black box”. This experiment tries to show how good its caching policies work.

The tests appeared to show the SSDs had some benefit, but not nearly as significant as when the disks were completely replaced. For example, the best results showed a 70% improvement, while replacing the disks completely in some cases yielded a 500% improvement.

I/O was run directly to RAID-6 volumes without gluster, in order to isolate the effect of the SSD. I/O was generated using the smallfile tool. The tool generated in a “small file” workload that generates random I/O operations.

Run swift (object protocol) over XFS
fsync after every create
extended attributes written to every file
A deep directory tree was generated with few files/directory
20 workload generator threads
average file size is 64 KB
an exponential file size distribution featuring mostly files smaller than chosen file size with a few of the files much larger than the chosen file size
200,000 separate files accessed per thread
5 extended attributes of 32 bytes each accessed per file for swift-put/swift-get operations
files randomly sprayed across the directory tree (default is to access directories one at a time)
fsync issued by thread after each file is written.

200,000 x 20 = 4 million files were written as part of each test, with a total of 256 GB of data, two times the amount of RAM available on the host.

3 passes of each test were done. The write tests are done using the swift-put operation, and the read tests are done using the swift-get operation.

swift-put — create and write the files
swift-get – read the files after dropping cache
swift-get2 — read the files again after dropping cache; a 2nd read will detect whether Nytro is starting to cache portions of data/metadata in SSD
swift-getcached — read files without dropping cache to see what is speed if host RAM is allowed to buffer data beforehand
swift-getnomem — read files after restricting memory usage severely (this could simulate effect of having Nytro configured with far more SSD than RAM

Using SSDs as a cache at the kernel level (dm-cache)

The Nytro card’s default SSD caching configuration did not generate very impressive improvements. Linux introduced an alternative in the 3.9 kernel called the “dm-cache”, aka “dynamic block level storage caching”. It can cache blocks at the device manager level within the kernel. The blocks reside on a “cache device”, typically an SSD. A related project is called bcache.

The dm-cache module has a tunable policy (e.g. LRU, MFU). The file system can send hints to dm-cache to “encourage” blocks to be cached or not cached.

For the test, the Nytro card’s SSD was re-purposed to act as the caching device for the dm-cache. The test compared using dm-cache with using the normal RAID write-back cache on the Nytro controller.

The results showed dm-cache performed well when there was no caching on the controller level (write through was set on Nytro). When write-back was set, little benefit was observed in most cases, with the exception of the small file workload.

The test was preliminary, for example, RAID-10 rather than RAID-6 was used. The meta device used by dm-cache was housed on a disk.

That said, dm-cache appears promising. Nytro’s cache helps performance, but many users prefer JBODs to expensive controllers. Such JBOD users would see worse performance without having the cache. They may be able to recover that performance by using dm-cache+SSDs.

Experiments Using SSDs with Gluster

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...