GlusterFS-3.7.0 saw the release of sharding feature, among several others. The feature was tagged as “experimental” as it was still in the initial stages of development back then. Here is some introduction to the feature:
Why shard translator?
GlusterFS’ answer to very large files (those which can grow beyond a single brick) had never been clear. There is a stripe translator which allows you to do that, but that comes at a cost of flexibility – you can add servers only in multiple of stripe-count x replica-count, mixing striped and unstriped files is not possible in an “elegant” way. This also happens to be a big limiting factor for the big data/Hadoop use case where super large files are the norm (and where you want to split a file even if it could fit within a single server.) The proposed solution for this is to replace the current stripe translator with a new “shard” translator.
What?
Unlike stripe, Shard is not a cluster translator. It is placed on top of DHT. Initially all files will be created as normal files, even up to a certain configurable size. The first block (default 4MB) will be stored like a normal file under its parent directory. However further blocks will be stored in a file, named by the GFID and block index in a separate namespace (like /.shard/GFID1.1, /.shard/GFID1.2 … /.shard/GFID1.N). File IO happening to a particular offset will write to the appropriate “piece file”, creating it if necessary. The aggregated file size and block count will be stored in the xattr of the original file.
Usage:
Here I have a 2×2 distributed-replicated volume.
# gluster volume info
Volume Name: dis-rep
Type: Distributed-Replicate
Volume ID: 96001645-a020-467b-8153-2589e3a0dee3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/1
Brick2: server2:/bricks/2
Brick3: server3:/bricks/3
Brick4: server4:/bricks/4
Options Reconfigured:
performance.readdir-ahead: on
To enable sharding on it, this is what I do:
# gluster volume set dis-rep features.shard on
volume set: success
Now, to configure the shard block size to 16MB, this is what I do:
# gluster volume set dis-rep features.shard-block-size 16MB
volume set: success
How files are sharded:
Now I write 84MB of data into a file named ‘testfile’.
# dd if=/dev/urandom of=/mnt/glusterfs/testfile bs=1M count=84
84+0 records in
84+0 records out
88080384 bytes (88 MB) copied, 13.2243 s, 6.7 MB/s
Let’s check the backend to see how the file was sharded to pieces and how these pieces got distributed across the bricks:
# ls /bricks/* -lh
/bricks/1:
total 0
/bricks/2:
total 0
/bricks/3:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile
/bricks/4:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile
So the file hashed to the second replica set (brick3 and brick4 which form a replica pair) and 16M in size. Where did the remaining 68MB worth of data go? To find out, let’s check the contents of the hidden directory .shard on all bricks:
# ls /bricks/*/.shard -lh
/bricks/1/.shard:
total 37M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5
/bricks/2/.shard:
total 37M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5
/bricks/3/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4
/bricks/4/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4
So, the file was basically split into 6 pieces: 5 of them residing in the hidden directory “/.shard” distributed across replica sets based on disk space availability and the file name hash, and the first block residing in its native parent directory. Notice how blocks 1 through 4 are all of size 16M and the last block (block-5) is 4M in size.
Now let’s do some math to see how ‘testfile’ was “sharded”:
The total size of the write was 84MB. And the configured block size in this case is 16MB. So (84MB divided by 16MB) = 5 with remainder = 4MB
So the file was basically broken into 6 pieces in all, with the last piece having 4MB of data and the rest of them 16MB in size.
Now when we view the file from the mount point, it would appear as one single file:
# ls -lh /mnt/glusterfs/
total 85M
-rw-r–r–. 1 root root 84M Dec 24 12:36 testfile
Notice how the file is shown to be of size 84MB on the mount point. Similarly, when the file is read by an application, the different pieces or ‘shards’ are stitched together and appropriately presented to the application as if there was no chunking done at all.
Advantages of sharding:
The advantage of sharding a file over striping it across a finite set of bricks are:
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...