Gluster can have trouble delivering good performance for small file workloads. This problem is acute for features such as tiering and RDMA, which employ expensive hardware such as SSDs or infiniband. In such workloads the hardware’s benefits are unrealized, so there is little return on the investment.
A major contributing factor to this problem has been excessive network overhead in fetching file and directory metadata. Their aggregated costs exceed the benefits of the hardware’s accelerated data transfers. This fetch is called a LOOKUP. Note that for larger file sizes, the picture changes. For large files the improved transfer times exceed the LOOKUP costs, so in those cases RDMA and tiering features work well.
The chart below depicts the problem with RDMA. Large read-file workloads perform well, small read-file workloads perform poorly.
The following examples use the “smallfile”  utility as a workload generator. I run a large 28 brick tiered volume “vol1”. The configuration’s hot tier is a 2×2 ram disk, and the cold tier is a 2 x (8 + 4) HDD. I run from a single client, mounted over FUSE. The entire working set of files resides on the hot tier. The experiments using tiering can also be found in the SNIA SDC presentation here .
Running Gluster’s profile against a tiered volume generates a count of the number of LOOKUPs and depicts the problem.
.. 20K LOOKUPs are sent to each brick, on the first run.
The purpose behind most LOOKUPs is to confirm the existence and permissions of a given directory and file. The client sends such LOOKUPs for each level of the path. This phenomena has been dubbed the “path traversal problem.” It is a well known issue with distributed storage systems . The round trip time for each LOOKUP is not small and the cumulative effect is big. Alas, Gluster has suffered from it for years.
The smallfile_cli.py utility opens a file, does an IO, and then closes it. The path is 4 levels deep (p66/file_srcdir/gprfc066/thrd_00/<file>).
The 20K figure can be derived. There are 5000 files, and 4 levels of directories. 5000*4=20K.
The DHT and tier translators must validate on which brick the file resides. To do this, the first LOOKUPs received are sent to all subvolumes. The brick that has the file is called the “cached subvolume”. Normally, it is predicted by the distributed hash’s algorithm, unless the set of bricks has recently changed. Subsequent LOOKUPs are sent only to the cached subvolume.
Regardless of this phenomenon, the cached subvolume still receives as many LOOKUPs as the path length, due to the path traversal problem. So when the test is run a second time, gluster profile still shows 20K LOOKUPs, but only on bricks on the hot tier (the tier translator’s cached subvolume), and nearly none on the cold tier. The round trips are still there, and the overall problem persists.
To cope with this “lookup amplification”, a project has been underway to improve Gluster’s meta-data cache translator (md-cache), so the stat information LOOKUP requests could be cached indefinitely on the client. This solution requires client side cache entries to be invalidated if another client modified a file or directory. The invalidation mechanism is called an “upcall.” It is complex and has taken time to be written. But as of October 2016 this new functionality is largely code complete and available in Gluster upstream.
Enabling upcall in md-cache:
$ gluster volume set <volname> features.cache-invalidation on
$ gluster volume set <volname> features.cache-invalidation-timeout 600
$ gluster volume set <volname> performance.stat-prefetch on
$ gluster volume set <volname> performance.cache-samba-metadata on
$ gluster volume set <volname> performance.cache-invalidation on
$ gluster volume set <volname> performance.md-cache-timeout 600
$ gluster volume set <volname> network.inode-lru-limit: <big number here>
In the example, I used 90000 for the inode-lru-limit.
At the time of this writing, a cache entry will expire after 5 minutes. The code will eventually be changed to allow an entry to never expire. That functionality will come once more confidence is gained in the upcall feature.
With this enabled, gluster profile shows the number of LOOKUPs drops to a negligible number on all subvolumes. As reported by the smallfile_cli.py benchmark, this translates directly to better throughput for small file workloads. YMMV, but in my experiments, I saw tremendous improvements and the SSD benefits were finally enjoyed.
The number of UPCALLs and FORGETs is now visible using Gluster’s profiler.
The md-cache hit/miss statistics are visible this way:
$ kill -USR1 `pgrep gluster`
# wait a few seconds for the dump file to be created
The md-cache solution requires client side memory, something not all users can dedicate.
The “automated” part of gluster tiering is slow. Files are moved between tiers in a single threaded engine, and the SQL query operates in time linear to the number of files. So the set of files residing on the hot tier must be stable.
This post describes recent tests done by Red Hat on an 84 node gluster volume. Our experiments measured performance characteristics and management behavior. To our knowledge, this is the largest performance test ever done under controlled conditions within the organization (we have heard of larger clusters in the community but do not know any details about them).
Red Hat officially supports up to 64 gluster servers in a cluster and our tests exceed that. But the problems we encounter are not theoretical. The scalability issues appeared to be related to the number of bricks, not the number of servers. If a customer was to use just 16 servers, but have 60 drives on each, they would have 960 bricks and likely see similar issues to what we found.
Summary: With one important exception, our tests show gluster scales linearly on common I/O patterns. The exception is on file create operations. On creates, we observe network overhead increase as the cluster grows. This issue appears to have a solution and a fix is forthcoming.
We also observe that gluster management operations become slower as the number of nodes increases.bz 1044693 has been opened for this. However, we were using the shared local disk on the hypervisor, rather than the disk dedicated to the VM. When this was changed, performance of the commands increased, e.g. 8 seconds.
Configuring an 84 node volume is easier said than done. Our intention was to build a methodology (tools and procedures) to spin up and tear down a large cluster of gluster servers at will.
We do not have 84 physical machines available. But our lab does have very powerful servers (described below). They can run multiple gluster servers at a time in virtual machines. We ran 12 such VMs on each physical machine. Each virtual machine was bound to its own disk and CPU. Using this technique, we are able to use 7 physical servers to test 84 nodes.
Tools to setup and manage clusters of this many virtual machines are nascent. Much configuration work must be done by hand. The general technique is to create a “golden copy” VM and “clone” it many times. Care must be taken to keep track of IP addresses, host names, and the like. If a single VM is misconfigured , it can be difficult to locate the problem within a large cluster.
Puppet and Chef are good candidates to simplify some of the work, and vagrant can create virtual machines and do underlying resource management, but everything still must be tied together and programmed. Our first implementation did not use the modern tools. Instead, crude but effective bash, expect, and kickstart scripts were written. We hope to utilize puppet in the near term with the help from gluster configuration management guru James Shubin. If you like ugly scripts, they may be found here.
One of the biggest problem areas in this setup was networking. When KVM creates a Linux VM, a hardware address and virtual serial port console exist, and an IP address can be obtained using DHCP. But we have a limited pool of IP addresses on our public subnet- and our lab’s system administrator frowns upon 84 new IP addresses being allocated out of the blue. Worse, the public network is 1GbE ethernet – too slow for performance testing.
To workaround those problems, we utilized static IP addresses on a private 10GbE ethernet network. This network has its own subnet and is free from lab restrictions. It does not have a DHCP server. To set the static IP address, we wrote an “expect” script which logs into the VM over the serial line, and modifies the network configuration files.
At the hypervisor level, we manually set up the virtual bridge, the disk configurations, and set the virtual-host “tuned” profile.
Once the system was built, it quickly became apparent that another set of tools would be needed to manage the running VMs. For example, it is sometimes necessary to run the same command across all 84 machines. Bash scripts was written to that end, though other tools (pdsh) could have been used.
With that done, we were ready to do some tests. Our goals were:
To confirm gluster “scales linearly” for large and small files- as new nodes are added, performance increases accordingly
To examine behavior on large systems. Do all the management commands work?
Large files tests: gluster scales nicely.
Small file tests: gluster scales on reads, but not write-new-file.
Oops. Small file writes are not scaling linerally. Whats going on here?
Looking at wireshark traces, we observed many LOOKUP calls sent to each of the nodes for every file create operation. As the number of nodes increased, so did the number of LOOKUPs. It turns out that the gluster client was doing a multicast lookup on every node on creates. It does this to confirm the file does not already exist.
The gluster parameter “lookup-unhashed” forces DHT hashing to be used. This will send the LOOKUP to the node where the new file should reside, rather than doing a multicast to all nodes. Below are the results when this setting is enabled.
write-new-file test results with the parameter set (red line). Much better!
This parameter is dangerous. If the cluster’s brick topography has changed and the rebalancing was aborted, gluster may find itself in a situation believing a file does not exist, when it really does. In other words, the LOOKUP existence test would generate a false negative because DHT would have the client look to the wrong nodes. This could result in two GFIDs being accessible by the same path.
A fix is being written. It will assign generation counts to bricks. By default DHT will be used on lookups. But if the generation counts indicated a topography change had taken place on the target bricks, the client will revert to the slower broadcast mode of operation.
We observed any management commands dealing with the volume took as long as a minute. For example, the “gluster import” command on the oVirt UI takes more then 30 seconds to complete. Bug 1044693 was opened for this. In all cases the management command worked, but was very slow. See note above in red.
Some additional tests we could do were suggested by gluster engineers. This would be future work:
object enumeration – how well does “ls” scale for large volumes?
What is the largest number of small objects (files) that a machine can handle before it makes sense to add a new node
Snapshot testing for scale-out volumes
Openstack behavior – what happens when the number of VMs goes up? We would look at variance and latency for the worse case.
Proposals to do larger scale-out tests:
We could present to gluster volumes partitions of disks. For example, a single 1TB drive could be divided into 10 100GB drives. This could boost the cluster size by an order of magnitude. Given the disk head would be shared by multiple servers, this technique would only make sense for random I/O tests (where the head is already under stress).
Experiment with running gluster servers within containers.
Gluster volumes are constructed out out a varying number of bricks embedded within separate virtual machines. Each virtual machine has:
dedicated 7200-RPM SAS disk for Gluster brick
a file on hypervisor system disk for the operating system image
2 Westmere or Sandy Bridge cores
4 GB RAM
The KVM hosts are 7 standard Dell R510/R720 servers with these attributes:
2-socket Westmere/Sandy Bridge Intel x86_64
48/64 GB RAM
1 10-GbE interface with jumbo frames (MTU=9000)
12 7200-RPM SAS disks configured in JBOD mode from a Dell PERC H710 (LSI MegaRAID)
For sequential workloads, we use only 8 out of 12 guests in each host so that aggregate disk bandwidth never exceeds network bandwidth.
Clients are 8 standard Dell R610/R620 servers with:
2-socket Westmere/Sandy Bridge Intel x86_64
64 GB RAM
1 10-GbE Intel NIC interface with jumbo frames (MTU=9000)
This post shares some experiences I’ve had in simulating a block device in gluster.
The block device is a file based image, which acts as the backend for the Linux SCSI target. The file resides in gluster, so enjoys gluster’s feature set. But the client only sees a block device. The Linux SCSI target presents it over iSCSI.
Some more information on how to set this up is at the bottom of this post.
But where to mount gluster?
There are three options to consider. Call them client, server, and gateway.
The configurations are depicted in the diagram below:
Configuring the block device at the server means the client only requires an iSCSI initiator.
Configuring the block device at the client allows I/O to fan out to different nodes.
Configuring the block device at a gateway allows I/O to fan out to different nodes without changing the client.
These options have their pros and cons. For example, a gateway minimizes client overhead while providing fan-out. OTOH, the customer must provide an additional node.
In my case, the objective was to minimize customer burden. Server-side configuration is probably the best choice. After chatting with some colleagues here at Red Hat, thats what we settled on.
I ran some simple performance tests using the “fio” tool to generate I/O.
Up to 10 fio processes were started.
The queue depth for each was 32.
Each process sends I/O to its own slice of the volume.
client and server block cache is flushed between tests
echo 3 > /proc/sys/vm/drop_caches
50G volume replicated over two nodes
Gluster version: 22.214.171.124
Only two nodes were used in performance testing. Ideally more nodes would be used, but that equipment is not readily available.
There are numerous options to tune, including the queue depth, number of streams, number of paths, gluster volume configuration, block size, number of targets etc. Parameters were chosen based on trial and error rather than formal methodology.
Last Saturday on 14th September’13 I gave on GlusterFS presentation at LSPE-IN. The title for the presentation was Performance Characterization in Large distributed file system with GlusterFS . Few days before the talk I looked at the attendee list to get the feel of the audience. I felt that not many of them have used GlusterFS by now. So I decided rather than covering lots of performance part I should cover the concepts of GlusterFS and talk about the performance challenges which can be there by design like file-system in user-space etc.
I am glad I did that as there were only 4 folks from ~70 attendees who have used GlusterFS before.
By going through the concepts the audience were able to compare GlusterFS with other solutions in the market. They asked interesting question about hashing algorithm, replication etc.
I attended the meetup till lunch. Other than giving the presentation I attended a keynote on Challenges in scaling cloud storage by Srivibhavan (Vibhav) Balaram in which he talked about the challenges in cloud storage and QoS in cloud environment. I also attended “Scaling using event driven programming with Perl, A tutorial” Aveek Mishra.
This was again a very informative and well managed event. I’ll look forward to attend and present in future meet-ups.
Problem: VERY slow performance when using ‘bedtools’ and other apps that write zillions of small output chunks.
If this was a self-writ app or an infrequently used one, I wouldn’t bother writing this up, but ‘bedtools’ is a fairly popular genomics app and since many installations use gluster to host Next-Gen sequencing data and analysis, I thought I’d document it a bit more.
A sophisticated user complained that when using the gluster filsystem, his ‘bedtools’ performance decreased horribly relative to using a local filesystem (fs) or NFS-mounted fs. The ‘bedtools’ utility ‘genomeCoverageBed’ reads in several GB of ‘bam’ file, compares it to a reference genome, and then writes out millions of small coordinates. I originally thought that this was due to reads and writes from the gluster fs conflicting at some point because if the output was directed to another fs, it ran normally.
It turned out that it’s not that simultaneous reads and writes dramatically decrease perf, but
that the /type of writes/ being done by ‘bedtools’ kills performance.
Solution: the short version:
Insert ‘gzip’ to compress and stream the data before sending it to the gluster fs. The improvement in IO (and application) performance is dramatic.
ie (all files on a gluster fs) genomeCoverageBed -ibam RS_11261.bam -g ref/dmel-all-chromosome-r5.1.fasta \
-d |gzip > output.cov.gz
inserting the ‘| gzip’ increased the overall app speed by more than 30X, relative to not using it on a gluster fs. It even improved the wall clock speed of the app relative to running on a local
filesystem by ~1/3, decreased the gluster CPU utilization by ~99% and reduced the output size by 80%. So, wins all round.
Solution: the long version:
The type of writes that bedtools does is also fairly common in bioinformatics and especially in self-writ scripts and logs – lots of writes of tiny amounts of data.
As I understand it (which may be wrong; please correct) the gluster
native client which we’re using does not buffer IO as well as the
NFS client, which is why we frequently see complaints about gluster vs
NFS performance. The apparent problem for bedtools is that these zillions of tiny
writes are being handled separately or at least not cached well enough to be
consolidated in a large write. To present the data to gluster as a
continuous stream instead of these tiny writes, they have to be
converted to such a stream and ‘gzip’ is a nice solution because it
compresses as it converts.
Apparently anything that takes STDIN, buffers it appropriately, and then spits it out on STDOUT will work. Even piping the data thru ‘cat’ will work to allow bedtools to continue to run at 100%, tho it will cause the gluster CPU use to increase to ~90%. ‘cat’ of course uses less CPU itself (~14%) while gzip will use more (~60%) tho decreasing gluster’s use enormously.
Using ‘gzip’ seems to provide the better tradeoff since it decreases gluster use to ~1% and reduces the output dramatically as well. Note that this will obviously increase the total CPU use to more than 1 CPU, so you may have to modify your scheduler/CPU allocation to accomodate this.
Gluster does provide some performance options that seem to address this problem and I did try them, setting them to:
However, they did not seem to help at all and I’d still like an explanation
of what they’re supposed to do.
The upshot is that this seems like, if not a gluster bug, then at least an
opportunity to improve gluster performance considerably by consolidating small writes into larger ones.