Gluster tiering and small file performance

Gluster

2016-10-31

Gluster can have trouble delivering good performance for small file workloads. This problem is acute for features such as tiering and RDMA, which employ expensive hardware such as SSDs or infiniband. In such workloads the hardware’s benefits are unrealized, so there is little return on the investment.

A major contributing factor to this problem has been excessive network overhead in fetching file and directory metadata. Their aggregated costs exceed the benefits of the hardware’s accelerated data transfers. This fetch is called a LOOKUP. Note that for larger file sizes, the picture changes. For large files the improved transfer times exceed the LOOKUP costs, so in those cases RDMA and tiering features work well.

The chart below depicts the problem with RDMA. Large read-file workloads perform well, small read-file workloads perform poorly.

Screen Shot 2016-10-28 at 3.57.45 PM

The following examples use the “smallfile” [1] utility as a workload generator. I run a large 28 brick tiered volume “vol1”. The configuration’s hot tier is a 2×2 ram disk, and the cold tier is a 2 x (8 + 4) HDD. I run from a single client, mounted over FUSE. The entire working set of files resides on the hot tier. The experiments using tiering can also be found in the SNIA SDC presentation here [3].

Running Gluster’s profile against a tiered volume generates a count of the number of LOOKUPs and depicts the problem.

$ ./smallfile_cli.py  --top /mnt/p66.b --host-set gprfc066 --threads 8 \
  --files 5000 --file-size 64 --record-size 64 --fsync N --operation read

$ gluster volume profile vol1 info cumulative|grep -E 'Brick|LOOKUP'..
.
.
Brick: gprfs018:/t4

     93.29     386.48 us     100.00 us    2622.00 us          20997      LOOKUP

.. 20K LOOKUPs are sent to each brick, on the first run.

The purpose behind most LOOKUPs is to confirm the existence and permissions of a given directory and file. The client sends such LOOKUPs for each level of the path. This phenomena has been dubbed the “path traversal problem.” It is a well known issue with distributed storage systems [2]. The round trip time for each LOOKUP is not small and the cumulative effect is big. Alas, Gluster has suffered from it for years.

The smallfile_cli.py utility opens a file, does an IO, and then closes it. The path is 4 levels deep (p66/file_srcdir/gprfc066/thrd_00/<file>).

The 20K figure can be derived. There are 5000 files, and 4 levels of directories. 5000*4=20K.

The DHT and tier translators must validate on which brick the file resides. To do this, the first LOOKUPs received are sent to all subvolumes. The brick that has the file is called the “cached subvolume”. Normally, it is predicted by the distributed hash’s algorithm, unless the set of bricks has recently changed. Subsequent LOOKUPs are sent only to the cached subvolume.

Regardless of this phenomenon, the cached subvolume still receives as many LOOKUPs as the path length, due to the path traversal problem. So when the test is run a second time, gluster profile still shows 20K LOOKUPs, but only on bricks on the hot tier (the tier translator’s cached subvolume), and nearly none on the cold tier. The round trips are still there, and the overall problem persists.

To cope with this “lookup amplification”, a project has been underway to improve Gluster’s meta-data cache translator (md-cache), so the stat information LOOKUP requests could be cached indefinitely on the client. This solution requires client side cache entries to be invalidated if another client modified a file or directory. The invalidation mechanism is called an “upcall.” It is complex and has taken time to be written. But as of October 2016 this new functionality is largely code complete and available in Gluster upstream.

Enabling upcall in md-cache:

$ gluster volume set <volname> features.cache-invalidation on
$ gluster volume set <volname> features.cache-invalidation-timeout 600
$ gluster volume set <volname> performance.stat-prefetch on
$ gluster volume set <volname> performance.cache-samba-metadata on
$ gluster volume set <volname> performance.cache-invalidation on
$ gluster volume set <volname> performance.md-cache-timeout 600
$ gluster volume set <volname> network.inode-lru-limit: <big number here>

In the example, I used 90000 for the inode-lru-limit.

At the time of this writing, a cache entry will expire after 5 minutes. The code will eventually be changed to allow an entry to never expire. That functionality will come once more confidence is gained in the upcall feature.

With this enabled, gluster profile shows the number of LOOKUPs drops to a negligible number on all subvolumes. As reported by the smallfile_cli.py benchmark, this translates directly to better throughput for small file workloads. YMMV, but in my experiments, I saw tremendous improvements and the SSD benefits were finally enjoyed.

Tuning notes..

The number of UPCALLs and FORGETs is now visible using Gluster’s profiler.
The md-cache hit/miss statistics are visible this way:

$ kill -USR1 `pgrep gluster`

# wait a few seconds for the dump file to be created

$ find /var/run/gluster -name \*dump\* -exec grep -E 'stat_miss|stat_hit' {} \;

Some caveats

The md-cache solution requires client side memory, something not all users can dedicate.
The “automated” part of gluster tiering is slow. Files are moved between tiers in a single threaded engine, and the SQL query operates in time linear to the number of files. So the set of files residing on the hot tier must be stable.

[1] Smallfile utility

[2] CEPH: RELIABLE, SCALABLE, AND HIGH-PERFORMANCE DISTRIBUTED STORAGE section 4.1.2.3

[3] SNIA SDC 2016 “Challenges with persistent memory in distributed storage systems”.

Gluster tiering and small file performance

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...