by on May 9, 2014

Gluster scale-out tests: an 84 node volume

This post describes recent tests done by Red Hat on an 84 node gluster volume.  Our experiments measured performance characteristics and management behavior. To our knowledge, this is the largest performance test ever done under controlled conditions within the organization (we have heard of larger clusters in the community but do not know any details about them).

Red Hat officially supports up to 64 gluster servers in a cluster and our tests exceed that. But the problems we encounter are not theoretical. The scalability issues appeared to be related to the number of bricks, not the number of servers. If a customer was to use just 16 servers, but have 60 drives on each, they would have 960 bricks and likely see similar issues to what we found.

Summary: With one important exception, our tests show gluster scales linearly on common I/O patterns. The exception is on file create operations. On creates, we observe network overhead increase as the cluster grows. This issue appears to have a solution and a fix is forthcoming.

We also observe that gluster management operations become slower as the number of nodes increases. bz 1044693 has been opened for this.  However, we were using the shared local disk on the hypervisor, rather than the disk dedicated to the VM. When this was changed, performance of the commands increased, e.g. 8 seconds.

Configuration

Configuring an 84 node volume is easier said than done. Our intention was to build a methodology (tools and procedures) to spin up and tear down a large cluster of gluster servers at will.

We do not have 84 physical machines available. But our lab does have very powerful servers (described below). They can run multiple gluster servers at a time in virtual machines.  We ran 12 such VMs on each physical machine. Each virtual machine was bound to its own disk and CPU. Using this technique, we are able to use 7 physical servers to test 84 nodes.

Tools to setup and manage clusters of this many virtual machines are nascent. Much configuration work must be done by hand. The general technique is to create a “golden copy” VM and “clone” it many times. Care must be taken to keep track of IP addresses, host names, and the like. If a single VM is misconfigured , it can be difficult to locate the problem within a large cluster.

Puppet and Chef are good candidates to simplify some of the work, and vagrant can create virtual machines and do underlying resource management, but everything still must be tied together and programmed. Our first implementation did not use the modern tools. Instead, crude but effective bash, expect, and kickstart scripts were written. We hope to utilize puppet in the near term with the help from gluster configuration management guru James Shubin. If you like ugly scripts, they may be found here.

One of the biggest problem areas in this setup was networking. When KVM creates a Linux VM, a hardware address and virtual serial port console exist, and an IP address can be obtained using DHCP. But we have a limited pool of IP addresses on our public subnet- and our lab’s system administrator frowns upon 84 new IP addresses being allocated out of the blue.  Worse, the public network is 1GbE ethernet – too slow for performance testing.

To workaround those problems, we utilized static IP addresses on a private 10GbE ethernet network. This network has its own subnet and is free from lab restrictions. It does not have a DHCP server. To set the static IP address, we wrote an “expect” script which logs into the VM over the serial line, and modifies the network configuration files.

At the hypervisor level, we manually set up the virtual bridge, the disk configurations, and set the virtual-host “tuned” profile.

Once the system was built, it quickly became apparent that another set of tools would be needed to manage the running VMs. For example, it is sometimes necessary to run the same command across all 84 machines. Bash scripts was written to that end, though other tools (pdsh) could have been used.

Test results

With that done, we were ready to do some tests. Our goals were:

  1. To confirm gluster “scales linearly” for large and small files- as new nodes are added, performance increases accordingly
  2. To examine behavior on large systems. Do all the management commands work?

Large files tests: gluster scales nicely.

scaling

Small file tests: gluster scales on reads, but not write-new-file.

smf-scaling

Oops. Small file writes are not scaling linerally. Whats going on here?

Looking at wireshark traces, we observed many LOOKUP calls sent to each of the nodes for every file create operation. As the number of nodes increased, so did the number of LOOKUPs. It turns out that the gluster client was doing a multicast lookup on every node on creates. It does this to confirm the file does not already exist.

The gluster parameter “lookup-unhashed” forces DHT hashing to be used. This will send the LOOKUP to the node where the new file should reside, rather than doing a multicast to all nodes. Below are the results when this setting is enabled.

write-new-file test results with the parameter set (red line). Much better!

lookup-unhashed

 

This parameter is dangerous. If the cluster’s brick topography has changed and the rebalancing was aborted, gluster may find itself in a situation believing a file does not exist, when it really does. In other words, the LOOKUP existence test would generate a false negative because DHT would have the client look to the wrong nodes. This could result in two GFIDs being accessible by the same path.

A fix is being written. It will assign generation counts to bricks. By default DHT will be used on lookups. But if the generation counts indicated a topography change had taken place on the target bricks, the client will revert to the slower broadcast mode of operation.

We observed any management commands dealing with the volume took as long as a minute. For example, the “gluster import” command on the oVirt UI takes more then 30 seconds to complete. Bug 1044693 was opened for this. In all cases the management command worked, but was very slow. See note above in red.

Future

Some additional tests we could do were suggested by gluster engineers. This would be future work:

  1. object enumeration – how well does “ls” scale for large volumes?
  2. What is the largest number of small objects (files) that a machine can handle before it makes sense to add a new node
  3. Snapshot testing for scale-out volumes
  4. Openstack behavior – what happens when the number of VMs goes up? We would look at variance and latency for the worse case.

Proposals to do larger scale-out tests:

  • We could present to gluster volumes partitions of disks. For example, a single 1TB drive could be divided into 10 100GB drives.  This could boost the cluster size by an order of magnitude. Given the disk head would be shared by multiple servers, this technique would only make sense for random I/O tests (where the head is already under stress).
  • Experiment with running gluster servers within containers.

Hardware

Gluster volumes are constructed out out a varying number of bricks embedded within separate virtual machines.   Each virtual machine has:

  • dedicated 7200-RPM SAS disk for Gluster brick
  • a file on hypervisor system disk for the operating system image
  • 2 Westmere or Sandy Bridge cores
  • 4 GB RAM

The KVM hosts are 7 standard Dell R510/R720 servers with these attributes:

  • 2-socket Westmere/Sandy Bridge Intel x86_64
  • 48/64 GB RAM
  • 1 10-GbE interface with jumbo frames (MTU=9000)
  • 12 7200-RPM SAS disks configured in JBOD mode from a Dell PERC H710 (LSI MegaRAID)

For sequential workloads, we use only 8 out of 12 guests in each host so that aggregate disk bandwidth never exceeds network bandwidth.

Clients are 8 standard Dell R610/R620 servers with:

  • 2-socket Westmere/Sandy Bridge Intel x86_64
  • 64 GB RAM
  • 1 10-GbE Intel NIC interface with jumbo frames (MTU=9000)

5 Comments

  1. James says:

    This is a really good article! Thanks for the nod :)

  2. Keith says:

    Interesting paper. I have a few followup questions.

    1) as the number of bricks increased, were they eveninly distributed across the physical servers?
    2) what parameters where used to create the volume? Was it recreated each time?
    3) what was the backend file system?

    Thanks,
    Keith

  3. 1) yes, also the ratio of one brick per server was maintained

    2) created a volume with a single brick, then added bricks to the volume one at a time. So the test did not create the volume all at once for 84 servers.

    3) xfs

  4. Liang Zhou says:

    Amazing! I like your sharing. We are also doing some benchmark testing on GlusterFS. We found the bottle neck is the network. Your virtual machine idea truly inspried me.
    I have some questions
    1.What is the volume strategy in your test? and for the production environment, which strategy would you prefer?
    2.I suppose the bandwidth result is the plus of all clients. and what is the result if we only run one client at a time.

  5. We also had network bottlenecks, but switching to 10GbE fixed that.

    We have a single volume for all 84 nodes. This parallelized the I/O as much as possible across nodes. We did not replicate the volume for these performance test. In a production environment we would want to replicate the volume.

    I do not know what the result would be for one client at a time. We would not be initiating as much I/O so I presume would be 1/8th as large.

Leave a Reply

Your email address will not be published.