all posts tagged qemu


by on November 26, 2013

GlusterFS Block Device Translator

Block device translator

Block device translator (BD xlator) is a new translator added to GlusterFS recently which provides block backend for GlusterFS. This replaces the existing bd_map translator in GlusterFS that provided similar but very limited functionality. GlusterFS expects the underlying brick to be formatted with a POSIX compatible file system. BD xlator changes that and allows for having bricks that are raw block devices like LVM which needn’t have any file systems on them. Hence with BD xlator, it becomes possible to build a GlusterFS volume comprising of bricks that are logical volumes (LV).

bd

BD xlator maps underlying LVs to files and hence the LVs appear as files to GlusterFS clients. Though BD volume externally appears very similar to the usual Posix volume, not all operations are supported or possible for the files on a BD volume. Only those operations that make sense for a block device are supported and the exact semantics are described in subsequent sections.

While Posix volume takes a file system directory as brick, BD volume needs a volume group (VG) as brick. In the usual use case of BD volume, a file created on BD volume will result in an LV being created in the brick VG. In addition to a VG, BD volume also needs a file system directory that should be specified at the volume creation time. This directory is necessary for supporting the notion of directories and directory hierarchy for the BD volume. Metadata about LVs (size, mapping info) is stored in this directory.

BD xlator was mainly developed to use block devices directly as VM images when GlusterFS is used as storage for KVM virtualization. Some of the salient points of BD xlator are

  • Since BD supports file level snapshots and clones by leveraging the snapshot and clone capabilities of LVM, it can be used to fully off-load snapshot and cloning operations from QEMU to the storage (GlusterFS) itself.
  • BD understands dm-thin LVs and hence can support files that are backed by thinly provisioned LVs. This capability of BD xlator translates to having thinly provisioned raw VM images.
  • BD enables thin LVs from a thin pool to be used from multiple nodes that have visibility to GlusterFS BD volume. Thus thin pool can be used as a VM image repository allowing access/visibility to it from multiple nodes.
  • BD supports true zerofill by using BLKZEROOUT ioctl on underlying block devices. Thus BD allows SCSI WRITESAME to be used on underlying block device if the device supports it.

Though BD xlator is primarily intended to be used with block devices, it does provide full Posix xlator compatibility for files that are created on BD volume but are not backed by or mapped to a block device. Such files which don’t have a block device mapping exist on the Posix directory that is specified during BD volume creation.

Availability

BD xlator developed by M. Mohan Kumar was committed into GlusterFS git in November 2013 and is expected to be part of upcoming GlusterFS-3.5 release.

Compiling BD translator

BD xlator needs lvm2 development library. –enable-bd-xlator option can be used with ./configure script to explicitly enable BD translator. The following snippet from the output of configure script shows that BD xlator is enabled for compilation.

GlusterFS configure summary
===================

Block Device xlator  : yes

Creating a BD volume

BD supports hosting of both linear LV and thin LV within the same volume. However I will be showing them separately in the following instructions. As noted above, the prerequisite for a BD volume is VG which I am creating here from a loop device, but it can be any other device too.

1. Creating BD volume with linear LV backend

- Create a loop device

[root@bharata ~]# dd if=/dev/zero of=bd-loop count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop

- Prepare a brick by creating a VG

[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg /dev/loop0

- Create the BD volume

Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta

It is recommended that this directory is created on an LV in the brick VG itself so that both data and metadata live together on the same device.

Create and mount the volume
[root@bharata ~]# gluster volume create bd bharata:/bd-meta?bd-vg force

The general syntax for specifying the brick is host:/posix-dir?volume-group-name where “?” is the separator.

[root@bharata ~]# gluster volume start bd
[root@bharata ~]# gluster volume info bd

Volume Name: bd
Type: Distribute
Volume ID: cb042d2a-f435-4669-b886-55f5927a4d7f
Status: Started
Xlator 1: BD
Capability 1: offload_copy
Capability 2: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta
Brick1 VG: bd-vg

[root@bharata ~]# mount -t glusterfs bharata:/bd /mnt

- Create a file that is backed by an LV

[root@bharata ~]# ls /mnt
[root@bharata ~]#

Since the volume is empty now, so is the underlying VG.
[root@bharata ~]# lvdisplay bd-vg
[root@bharata ~]#

Creating a file that is mapped to an LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to LV.

[root@bharata ~]# touch /mnt/lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “lv” /mnt/lv

Now an LV got created in the VG brick and the file /mnt/lv maps to this LV. Any read/write to this file ends up as read/write to the underlying LV.
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                          /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name                        6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name                       bd-vg
LV UUID                         PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status                      available
# open                          0
LV Size                    4.00 MiB
Current LE                   1
Segments                     1
Allocation                     inherit
Read ahead sectors    0
Block device                253:6

The file gets created with default LV size which is 1 LE which is 4MB in this case.
[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 4.0M Nov 26 16:15 /mnt/lv

truncate can be used to set the required file size.
[root@bharata ~]# truncate /mnt/lv -s 256M
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                          /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name                        6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name                       bd-vg
LV UUID                         PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status                       available
# open                           0
LV Size                     256.00 MiB
Current LE                   64
Segments                      1
Allocation                      inherit
Read ahead sectors     0
Block device                 253:6

[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 256M Nov 26 16:15 /mnt/lv

The size of the file/LV can be specified during creation/mapping time itself like this:
setfattr -n “user.glusterfs.bd” -v “lv:256MB” /mnt/lv

2. Creating BD volume with thin LV backend

- Create a loop device

[root@bharata ~]# dd if=/dev/zero of=bd-loop-thin count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop-thin

- Prepare a brick by creating a VG and thin pool

[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg-thin /dev/loop0

Create a thin pool
[root@bharata ~]# lvcreate –thin bd-vg-thin -L 1000M
Rounding up size to full physical extent 4.00 MiB
Logical volume “lvol0″ created

lvdisplay shows the thin pool
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name                       lvol0
VG Name                      bd-vg-thin
LV UUID                        HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID  0
LV Pool metadata          lvol0_tmeta
LV Pool data                  lvol0_tdata
LV Pool chunk size       64.00 KiB
LV Zero new blocks     yes
LV Status                      available
# open                          0
LV Size                          1000.00 MiB
Allocated pool data     0.00%
Allocated metadata     0.88%
Current LE                   250
Segments                     1
Allocation                     inherit
Read ahead sectors     auto
– currently set to         256
Block device                253:9

- Create the BD volume

Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta-thin

Create and mount the volume
[root@bharata ~]# gluster volume create bd-thin bharata:/bd-meta-thin?bd-vg-thin force
[root@bharata ~]# gluster volume start bd-thin
[root@bharata ~]# gluster volume info bd-thin

Volume Name: bd-thin
Type: Distribute
Volume ID: 27aa7eb0-4ffa-497e-b639-7cbda0128793
Status: Started
Xlator 1: BD
Capability 1: thin
Capability 2: offload_copy
Capability 3: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta-thin
Brick1 VG: bd-vg-thin
[root@bharata ~]# mount -t glusterfs bharata:/bd-thin /mnt

- Create a file that is backed by a thin LV

[root@bharata ~]# ls /mnt
[root@bharata ~]#

Creating a file that is mapped to a thin LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to a thin LV.

[root@bharata ~]# touch /mnt/thin-lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “thin:256MB” /mnt/thin-lv

Now /mnt/thin-lv is a thin provisioned file that is backed by a thin LV.
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name                        lvol0
VG Name                       bd-vg-thin
LV UUID                         HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID 1
LV Pool metadata         lvol0_tmeta
LV Pool data                 lvol0_tdata
LV Pool chunk size       64.00 KiB
LV Zero new blocks     yes
LV Status                      available
# open                         0
LV Size                         1000.00 MiB
Allocated pool data    0.00%
Allocated metadata    0.98%
Current LE                  250
Segments                    1
Allocation                    inherit
Read ahead sectors   auto
– currently set to        256
Block device               253:9

— Logical volume —
  LV Path                     /dev/bd-vg-thin/081b01d1-1436-4306-9baf-41c7bf5a2c73
LV Name                        081b01d1-1436-4306-9baf-41c7bf5a2c73
VG Name                       bd-vg-thin
LV UUID                         coxpTY-2UZl-9293-8H2X-eAZn-wSp6-csZIeB
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:43:19 +0530
LV Pool name                 lvol0
LV Status                       available
# open                           0
  LV Size                     256.00 MiB
Mapped size                  0.00%
Current LE                    64
Segments                      1
Allocation                      inherit
Read ahead sectors     auto
– currently set to          256
Block device                 253:10

As can be seen from above, creation of a file resulted in creation of a thin LV in the brick.

Snapshots and clones

BD xlator uses LVM snapshot and clone capabilities to provide file level snapshots and clones for files on GlusterFS volume. Snapshots and clones work only for those files that have been already mapped to an LV. In other words, snapshots and clones aren’t for Posix-only file that exist on BD volume.

Creating a snapshot

Say we are interested in taking snapshot of a file /mnt/file that already exists and has been mapped to an LV.

[root@bharata ~]# ls -l /mnt/file
-rw-r–r–. 1 root root 268435456 Nov 27 10:16 /mnt/file

[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                        /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name                      abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name                      bd-vg
LV UUID                        HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV Status                      available
# open                          0
LV Size                          256.00 MiB
Current LE                   64
Segments                     1
Allocation                     inherit
Read ahead sectors    0
Block device                253:6

Snapshot creation is a two step process.

- Create a snapshot destination file first
[root@bharata ~]# touch /mnt/file-snap

- Then take the actual snapshot
In order to create the actual snapshot, we need to know the GFID of the snapshot file.

[root@bharata ~]# getfattr -n glusterfs.gfid.string  /mnt/file-snap
getfattr: Removing leading ‘/’ from absolute path names
# file: mnt/file-snap
glusterfs.gfid.string=”bdf74e38-dc96-4b26-94e2-065fe3b8bcc3″

Use this GFID string to create the actual snapshot
[root@bharata ~]# setfattr -n snapshot -v bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 /mnt/file
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                         /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name                       abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name                      bd-vg
LV UUID                        HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV snapshot status   source of bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 [active]
LV Status                       available
# open                           0
LV Size                           256.00 MiB
Current LE                   64
Segments                      1
Allocation                      inherit
Read ahead sectors     0
Block device                 253:6

— Logical volume —
LV Path                        /dev/bd-vg/bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
LV Name                      bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
VG Name                      bd-vg
LV UUID                        9XH6xX-Sl64-uNhk-7OiH-f91m-DaMo-6AWiBD
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:20:35 +0530
LV snapshot status  active destination for abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Status                      available
# open                          0
LV Size                          256.00 MiB
Current LE                   64
COW-table size             4.00 MiB
COW-table LE               1
Allocated to snapshot  0.00%
Snapshot chunk size    4.00 KiB
Segments                      1
Allocation                      inherit
Read ahead sectors     auto
– currently set to          256
Block device                 253:7

As can be seen from the lvdisplay output, /mnt/file-snap now is the snapshot of /mnt/file.

Creating a clone

Creating a clone is similar to creating a snapshot except that “clone” attribute name should be used instead of “snapshot”.

setfattr -n clone -v <gfid-of-clone-file> <path-to-source-file>

Clone in BD volume is essentially a server off-loaded full copy of the file.

Concerns

As you have seen, creation of block device backed file on BD volume, creation of snapshots and clones involve non-standard steps including setting of extended attributes. These steps could be cumbersome for an end user and there are plans to encapsulate all these into nice APIs that users could use easily.


by on August 20, 2013

Troubleshooting QEMU-GlusterFS

As described in my previous blog post, QEMU supports talking to GlusterFS using libgfapi which is a much better way to use GlusterFS to host VM images than using the FUSE mount to access GlusterFS volumes. However due to some bugs that exist in GlusterFS-3.4, any invalid specification of GlusterFS drive on QEMU command line can result in completely non-obvious error messages from QEMU. The aim of this blog post is to document some of these known error scenarios in QEMU-GlusterFS so that users can figure out what could be potentially wrong in their QEMU-GlusterFS setup.

As of this writing (Aug 2013), I know about two bugs in GlusterFS that cause all this confusion in the failure messages:

  • A bug that results in GlusterFS log messages not reaching stderr because of which no meaningful failure messages are shown from QEMU. This bug has been fixed, and it should eventually make it to some release (probably 3.4.1) of GlusterFS.
  • A bug in a libgfapi routine called glfs_init() that results in returning failure with errno set to 0. QEMU depends on errno here and this leads to unpleasant segmentation faults in QEMU.

Environment

I am using Fedora 19 with distro-provided QEMU and GlusterFS rpms for all the experiments here.

# rpm -qa | grep qemu
qemu-system-x86-1.4.2-5.fc19.x86_64
ipxe-roms-qemu-20130517-2.gitc4bce43.fc19.noarch
qemu-kvm-1.4.2-5.fc19.x86_64
qemu-guest-agent-1.4.2-3.fc19.x86_64
libvirt-daemon-driver-qemu-1.0.5.1-1.fc19.x86_64
qemu-img-1.4.2-5.fc19.x86_64
qemu-common-1.4.2-5.fc19.x86_64

# rpm -qa | grep gluster
glusterfs-3.4.0-8.fc19.x86_64
glusterfs-libs-3.4.0-8.fc19.x86_64
glusterfs-fuse-3.4.0-8.fc19.x86_64
glusterfs-server-3.4.0-8.fc19.x86_64
glusterfs-api-3.4.0-8.fc19.x86_64
glusterfs-debuginfo-3.4.0-8.fc19.x86_64
glusterfs-cli-3.4.0-8.fc19.x86_64

Error scenarios

1. glusterd service not started

On Fedora 19, glusterd service fails to start during boot and if you don’t notice it, you can end up using QEMU with GlusterFS when glusterd service isn’t running. In this scenario, you will encounter the following kind of failure:

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

2. GlusterFS volume not started

If QEMU is used to boot a VM image on GlusterFS volume which is not in started state, the following kind of error is seen:

[root@localhost ~]# gluster volume status test
Volume test is not started
[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

3. QEMU run as non-root user

As of this writing (Aug 2013), GlusterFS doesn’t support non-root users to access GlusterFS volumes via libgfapi without some manual settings. This is a very typical scenario for most users who don’t use QEMU directly but use it via libvirt and oVirt. In this scenario, the following kind of error is seen:

[bharata@localhost ~]$ qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

As you can see, 3 different failure scenarios produced similar error messages. This will improve once error logs from GlusterFS start reaching QEMU when GlusterFS with the fix is used.

4. Specifying non-existing GlusterFS volume or server

Specifying invalid or wrong server or volume names can cause nasty failures like this:

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test-xxx/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test-xxx/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test-xxx image=F17-qcow2 transport=tcp
Segmentation fault (core dumped)

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster-xxx/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster-xxx/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster-xxx port=0 volume=test image=F17-qcow2 transport=tcp
Segmentation fault (core dumped)

This should get fixed when the bug in libgfapi is resolved.


by on August 7, 2013

UNMAP/DISCARD support in QEMU-GlusterFS

In my last blog post on QEMU-GlusterFS, I described the integration of QEMU with GlusterFS using libgfapi. In this post, I give an overview of the recently added discard support to QEMU’s GlusterFS back-end and how it can be used. Newer SCSI devices support UNMAP command that is used to return the unused/freed blocks back to the storage. This command is typically useful when the storage is thin provisioned like a thin provisioned SCSI LUN. In response to a file deletion, the host device driver could send down an UNMAP command to the SCSI target and instruct it to free the relevant blocks from the thin provisioned LUN. This leads to much better utilization of the storage.

Linux support for discard

In Linux, SCSI UNMAP is supported via the generic discard framework which I believe is also used to support ATA TRIM command. ATA TRIM command typically used in SSD isn’t the topic of discussion of this blog. There are multiple ways in which discard functionality is invoked or used in Linux.

  • For direct block devices, one could use BLKDISCARD ioctl to release the unused blocks.
  • File systems like EXT4 support a file level discard using FALLOC_FL_PUNCH_HOLE option of fallocate system call.
  • For releasing the unused blocks at the file systems level, fstrim command can be used.
  • Finally, file systems like EXT4 also support ‘discard’ mount option that will control if file system (EXT4) should issue UNMAP requests to the underlying block device when there are free blocks.

QEMU support for discard

UNMAP is primarily useful in two ways for KVM virtualization.

  • When a file is deleted in the VM, the resulting UNMAP in the guest is passed down to host which will result in host sending the discard request to the thin provisioned SCSI device. This results in the blocks consumed by the deleted file to be returned back to the SCSI storage. The effect is same when there is an explicit discard request from the guest using either ioctl or fallocate methods listed in the previous section.
  • When a VM image is deleted, there is a potential to return the freed blocks back to the storage by sending the UNMAP command to the SCSI storage.

Guest UNMAP requests will end up in QEMU only if the guest is using scsi or virtio-scsi device and not virtio-blk device. QEMU will forward this request further down to the host (device or file system) only if ‘discard=on’ or ‘discard=unmap’ drive flag is used for the device on the QEMU command line.

Example1: qemu-system-x86_64 -drive file=/images/vm.img,if=scsi,discard=on
Example2: qemu-system-x86_64 -device virtio-scsi-pci -drive if=none,discard=on,id=rootdisk,file=gluster://host/volume/image -device scsi-hd,drive=rootdisk

The way discard request is further passed down in the host is determined by the block driver inside QEMU which is serving the disk image type. While QEMU uses fallocate(FALLOC_FL_PUNCH_HOLE) for raw file backends and ioctl(BLKDISCARD) for block device back-end, other backends use their own interfaces to pass down the discard request.

GlusterFS support for discard

GlusterFS, starting from version 3.4, supports discard functionality through a new API called glfs_discard(glfs_fd, offset, size) that is available as part of libgfapi library.  On the GlusterFS server side, discard request is handled differently for posix back-end and Block Device(BD) back-end.

For the posix back-end, fallocate(FALLOC_FL_PUNCH_HOLE) is used to eventually release the blocks to the filesystem. If the posix brick has been mounted with ‘-o discard’ option, then the discard request will eventually reach the SCSI storage if the storage device supports UNMAP.

Support for BD back-end is planned to come up as soon as the ongoing development work on the newer and feature rich BD translator becomes upstream. BD translator should be using ioctl(BLKDISCARD) to UNMAP the blocks.

Discard support in GlusterFS back-end of QEMU

As described in my earlier blog, QEMU starting from version 1.3 supports GlusterFS back-end using libgfapi which is the non-FUSE way of accessing GlusterFS volumes. Recently I added discard support to GlusterFS block driver in QEMU and this support should be available in QEMU-1.6 onwards. This work involved using the glfs_discard() API from QEMU GlusterFS driver to send the discard request to GlusterFS server.

With this, the entire KVM virtualization stack using GlusterFS back-end is enabled to take advantage of SCSI UNMAP command.

Typical usage

This section describes a typical use case with QEMU-GlusterFS where a file deleted from inside the VM results in discard requests for the host storage.

Step 1: Prepare a block device that supports UNMAP

Since I don’t have a real SCSI device that supports UNMAP, I am going to use a loop device to host my GlusterFS volume. Loop device supports discard.

[root@llmvm02 bharata]# dd if=/dev/zero of=discard-loop count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.63255 s, 1.7 GB/s

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 2097152    IO Block: 4096   regular file
Device: fc13h/64531d    Inode: 15740290    Links: 1

[root@llmvm02 bharata]# losetup /dev/loop0 discard-loop

[root@llmvm02 bharata]# cat /sys/block/loop0/queue/discard_max_bytes
4294966784

Step 2: Prepare the brick directory for GlusterFS volume

[root@llmvm02 bharata]# mkfs.ext4 /dev/loop0
[root@llmvm02 bharata]# mount -o discard /dev/loop0 /discard-mnt/
[root@llmvm02 bharata]# mount  | grep discard
/dev/loop0 on /discard-mnt type ext4 (rw,discard)

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 66368      IO Block: 4096   regular file

Step 3: Create a GlusterFS volume

[root@llmvm02 bharata]# gluster volume create discard llmvm02:/discard-mnt/ force
volume create: discard: success: please start the volume to access data

[root@llmvm02 bharata]# gluster volume start discard
volume start: discard: success

[root@llmvm02 bharata]# gluster volume info discard
Volume Name: discard
Type: Distribute
Volume ID: ed7a6f8a-9cb8-463d-a948-61974cb64c99
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: llmvm02:/discard-mnt

Step 4: Create a sparse file in the GlusterFS volume

[root@llmvm02 bharata]# glusterfs -s llmvm02 –volfile-id=discard /mnt
[root@llmvm02 bharata]# touch /mnt/file
[root@llmvm02 bharata]# truncate -s 450M /mnt/file

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 0          IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 66368      IO Block: 4096   regular file

Step 5: Use this sparse file on GlusterFS volume as a disk drive with QEMU

[root@llmvm02 bharata]# qemu-system-x86_64 –enable-kvm -nographic -m 8192 -smp 2 -device virtio-scsi-pci -drive if=none,cache=none,id=F17,file=gluster://llmvm02/test/F17 -device scsi-hd,drive=F17 -drive if=none,cache=none,id=gluster,discard=on,file=gluster://llmvm02/discard/file -device scsi-hd,drive=gluster -kernel /home/bharata/linux-2.6-vm/kernel1 -initrd /home/bharata/linux-2.6-vm/initrd1 -append “root=UUID=d29b972f-3568-4db6-bf96-d2702ec83ab6 ro rd.md=0 rd.lvm=0 rd.dm=0 SYSFONT=True KEYTABLE=us rd.luks=0 LANG=en_US.UTF-8 console=tty0 console=ttyS0 selinux=0″

The file appears as a SCSI disk in the guest.

[root@F17-kvm ~]# dmesg | grep -i sdb
sd 0:0:1:0: [sdb] 921600 512-byte logical blocks: (471 MB/450 MiB)
sd 0:0:1:0: [sdb] Write Protect is off
sd 0:0:1:0: [sdb] Mode Sense: 63 00 00 08
sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA

Step 6: Generate discard requests from the guest

Format a FS on the SCSI disk, mount it and create a file on it

[root@F17-kvm ~]# mkfs.ext4 /dev/sdb
[root@F17-kvm ~]# mount -o discard /dev/sdb /guest-mnt/
[root@F17-kvm ~]# dd if=/dev/zero of=/guest-mnt/file bs=1M count=400
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 1.07204 s, 391 MB/s

Now we can see the blocks count growing in the host like this:

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 846880     IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 917288     IO Block: 4096   regular file

Remove the file and this should generate discard requests which should get passed down to host eventually resulting in the discarded blocks getting released from the underlying loop device in the host.

[root@F17-kvm ~]# rm -f /guest-mnt/file

In the host,

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 46600      IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 112960     IO Block: 4096   regular file

Thus we saw the discard requests from the guest eventually resulting in the blocks getting released in the host block device. I was using a loop device in the host, but this should work for any SCSI device that supports UNMAP. It is not necessary to have SCSI device with UNMAP support to get the benefit of space saving. In fact if the underlying device is a thin logical volume(dm-thin) coming from a thin pool dm device, the  space saving can be realized at the thin pool level itself.

Concerns with discard mount option

There have been concerns about the cost of UNMAP operation in the storage and its detrimental effect on the IO throughput. So it is unclear if everyone will want to turn on the discard mount option by default. I wish I had access to an UNMAP-capable storage array to really test the effect of UNMAP on IO performance.


by on October 28, 2012

QEMU-GlusterFS native integration

GlusterFS is a distributed file system implemented in user space. It is strictly not a native file system in itself but is an aggregator of different file systems. GlusterFS can aggregate individual file system mount points or directories (called bricks in gluster terminology) to provide a single unified file system namespace. In addition to NFS and CIFS, the most common
way to access GlusterFS namespace is via FUSE based Gluster native client.

gluster

More information on creating and mounting GlusterFS volume can be obtained from GlusterFS website.

GlusterFS for virtualization

Until recently using GlusterFS volumes to host VM images and data was sub-optimal due to the FUSE overhead involved in accessing gluster volumes via GlusterFS native client. However this has changed now with two specific enhancements:

- A new library called libgfapi is now available as part of GlusterFS that  provides POSIX-like C APIs for accessing gluster volumes. libgfapi support will be available from GlusterFS-3.4 release.
– QEMU (starting from QEMU-1.3) will have GlusterFS block driver that uses libgfapi and hence there is no FUSE overhead any longer when QEMU works with VM images on gluster volumes.

GlusterFS with its pluggable translator model can serve as a flexible storage backend for QEMU. QEMU has to just talk to GlusterFS and GlusterFS will hide different file systems and storage types underneath. Various GlusterFS storage features like replication and striping will automatically be available for QEMU. Efforts are also on to add block device backend in Gluster via Block Device (BD) translator that will expose underlying block devices as files to QEMU. This allows GlusterFS to be a single storage backend for both file and block based storage types.

QEMU with native GlusterFS block driver

QEMU with native GlusterFS block driver

GlusterFS specifcation in QEMU

VM image residing on gluster volume can be specified on QEMU command line using URI format:

gluster[+transport]://[server[:port]]/volname/image[?socket=...]

gluster is the protocol.

transport specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are tcp, unix and rdma. If a transport type isn’t specified, then tcp type is assumed.

server specifies the server where the volume file specification for the given volume resides. This can be either hostname, ipv4 address or ipv6 address. ipv6 address needs to be within square brackets [ ]. If transport type is unix, then server field should not be specified. Instead the socket field needs to be populated with the path to unix domain socket.

port is the port number on which glusterd is listening. This is optional and if not specified, QEMU will send 0 which will make gluster to use the default port. If the transport type is unix, then port should not be specified.

volname is the name of the gluster volume which contains the VM image.

image is the path to the actual VM image that resides on gluster volume.

Examples:

gluster://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4/testvol/a.img
gluster+tcp://1.2.3.4:24007/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]/testvol/dir/a.img
gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
gluster+tcp://server.domain.com:24007/testvol/dir/a.img
gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
gluster+rdma://1.2.3.4:24007/testvol/a.img

(GlusterFS URI description and above examples are taken from QEMU documentation)

Configuring QEMU with GlusterFS backend

While building QEMU from source, in addition to the normal configuration options, ensure that –enable-uuid and –enable-glusterfs options are is specified explicitly with ./configure script. (Update Feb 2013: A fix in QEMU-1.3 time frame makes the use of –enable-uuid unnecessary for GlusterFS support in QEMU)

Update Aug 2013: Starting with QEMU-1.6, pkg-config is used to configure the GlusterFS backend in QEMU. If you are using GlusterFS compiled and installed from sources, then the GlusterFS package config file (glusterfs-api.pc) might not be present at the standard path and you will have to explicitly add the path by executing this command before running the QEMU configure script:

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/

Without this, GlusterFS driver will not be compiled into QEMU even when GlusterFS is present in the system.

Creating a VM image on GlusterFS backend

qemu-img command can be used to create VM images on gluster backend. The general syntax for image creation looks like this:

qemu-img create gluster://server/volname/path/to/image size

Example:

To create a raw image, qemu-img create gluster://1.2.3.4/testvol/dir/a.img 5G
To create a qcow2 image, qemu-img create -f qcow2 gluster://server.domain.com:24007/testvol/a.img 5G

Booting VM image from GlusterFS backend

A VM image a.img residing on gluster volume testvol can be booted using QEMU like this:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio

In addition to VM images, gluster drives can also be used as data drives:

qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,if=virtio -drive file=gluster://1.2.3.4/datavol/a-data.img,if=virtio

Here a-data.img from datavol gluster volume appears as a 2nd drive for the guest.

Performance numbers

The following numbers from FIO benchmark are to show the performance advantage of using QEMU’s GlusterFS block driver instead of the usual FUSE mount while accessing the VM image.

Test setup

Host Dual core x86_64 system running Fedora 17 kernel (3.5.6-1.fc17.x86_64)
Guest Fedora 17 image, 4 way SMP, 2GB RAM, using virtio and cache=none QEMU options

QEMU options

FUSE mount qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/mnt/F17,if=virtio,cache=none => /mnt is GlusterFS FUSE mount point
GlusterFS block driver in QEMU (FUSE bypass) qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=gluster://bharata/test/F17,if=virtio,cache=none
Base (VM image accessed directly from brick) qemu-system-x86_64 –enable-kvm –nographic -smp 4 -m 2048 -drive file=/test/F17,if=virtio,cache=none => /test is brick directory

FIO load files

Sequential read direct IO ; Read 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=read
bs=128k
size=512m
directory=/data1
[file1]
iodepth=4
[file2]
iodepth=32
[file3]
iodepth=8
[file4]
iodepth=16
Sequential write direct IO ; Write 4 files with aio at different depths
[global]
ioengine=libaio
direct=1
rw=write
bs=128k
size=512m
directory=/data1
[file1]
iodepth=4
[file2]
iodepth=32
[file3]
iodepth=8
[file4]
iodepth=16


FIO READ numbers

aggrb (KB/s) minb (KB/s) maxb (KB/s)
FUSE mount 15219 3804 5792
QEMU’s GlusterFS block driver (FUSE bypass) 39357 9839 12946
Base 43802 10950 12918

FIO WRITE numbers

aggrb (KB/s) minb (KB/s) maxb (KB/s)
FUSE mount 24579 6144 8423
QEMU’s GlusterFS block driver (FUSE bypass) 42707 10676 17262
Base 42393 10598 15646

Updated numbers

Here are the recent FIO numbers averaged from 5 runs using latest QEMU (git commit: 03a36f17d77) and GlusterFS (git commit: cee1b62d01). The test environment remains same as above with the following two changes:

  • The GlusterFS volume has write-behind translator turned off
  • The host kernel is upgraded to 3.6.7-4.fc17.x86_64

FIO READ numbers

aggrb (KB/s) % Reduction from Base
Base 44464 0
FUSE mount 21637 -51
QEMU’s GlusterFS block driver (FUSE bypass) 38847 -12.6

FIO WRITE numbers

aggrb (KB/s) % Reduction from Base
Base 45824 0
FUSE mount 40919 -10.7
QEMU’s GlusterFS block driver (FUSE bypass) 45627 -0.43

GlusterFS support in oVirt

While I described how to use GlusterFS as a storage backend for QEMU manually, there have been efforts to enable QEMU-GlusterFS native support from libvirt, VDSM and oVirt as well. We now have GlusterFS enabled completely from oVirt which allows user to use self-help portal of oVirt to create GlusterFS volume and use it as storage backend to host VM images. The GlusterFS storage domain work in VDSM and the enablement of the same from oVirt allows oVirt to exploit the QEMU-GlusterFS native integration rather than using FUSE for accessing GlusterFS volume.

Deepak C Shetty has created a nice video demo of how to use oVirt to create a GlusterFS storage domain and boot VMs off it.

UNMAP/Discard support in QEMU-GlusterFS

UNMAP support in QEMU-GlusterFS is explained here.