all posts tagged Virtualization


by on May 9, 2014

Gluster scale-out tests: an 84 node volume

This post describes recent tests done by Red Hat on an 84 node gluster volume.  Our experiments measured performance characteristics and management behavior. To our knowledge, this is the largest performance test ever done under controlled conditions within the organization (we have heard of larger clusters in the community but do not know any details about them).

Red Hat officially supports up to 64 gluster servers in a cluster and our tests exceed that. But the problems we encounter are not theoretical. The scalability issues appeared to be related to the number of bricks, not the number of servers. If a customer was to use just 16 servers, but have 60 drives on each, they would have 960 bricks and likely see similar issues to what we found.

Summary: With one important exception, our tests show gluster scales linearly on common I/O patterns. The exception is on file create operations. On creates, we observe network overhead increase as the cluster grows. This issue appears to have a solution and a fix is forthcoming.

We also observe that gluster management operations become slower as the number of nodes increases. bz 1044693 has been opened for this.  However, we were using the shared local disk on the hypervisor, rather than the disk dedicated to the VM. When this was changed, performance of the commands increased, e.g. 8 seconds.

Configuration

Configuring an 84 node volume is easier said than done. Our intention was to build a methodology (tools and procedures) to spin up and tear down a large cluster of gluster servers at will.

We do not have 84 physical machines available. But our lab does have very powerful servers (described below). They can run multiple gluster servers at a time in virtual machines.  We ran 12 such VMs on each physical machine. Each virtual machine was bound to its own disk and CPU. Using this technique, we are able to use 7 physical servers to test 84 nodes.

Tools to setup and manage clusters of this many virtual machines are nascent. Much configuration work must be done by hand. The general technique is to create a “golden copy” VM and “clone” it many times. Care must be taken to keep track of IP addresses, host names, and the like. If a single VM is misconfigured , it can be difficult to locate the problem within a large cluster.

Puppet and Chef are good candidates to simplify some of the work, and vagrant can create virtual machines and do underlying resource management, but everything still must be tied together and programmed. Our first implementation did not use the modern tools. Instead, crude but effective bash, expect, and kickstart scripts were written. We hope to utilize puppet in the near term with the help from gluster configuration management guru James Shubin. If you like ugly scripts, they may be found here.

One of the biggest problem areas in this setup was networking. When KVM creates a Linux VM, a hardware address and virtual serial port console exist, and an IP address can be obtained using DHCP. But we have a limited pool of IP addresses on our public subnet- and our lab’s system administrator frowns upon 84 new IP addresses being allocated out of the blue.  Worse, the public network is 1GbE ethernet – too slow for performance testing.

To workaround those problems, we utilized static IP addresses on a private 10GbE ethernet network. This network has its own subnet and is free from lab restrictions. It does not have a DHCP server. To set the static IP address, we wrote an “expect” script which logs into the VM over the serial line, and modifies the network configuration files.

At the hypervisor level, we manually set up the virtual bridge, the disk configurations, and set the virtual-host “tuned” profile.

Once the system was built, it quickly became apparent that another set of tools would be needed to manage the running VMs. For example, it is sometimes necessary to run the same command across all 84 machines. Bash scripts was written to that end, though other tools (pdsh) could have been used.

Test results

With that done, we were ready to do some tests. Our goals were:

  1. To confirm gluster “scales linearly” for large and small files- as new nodes are added, performance increases accordingly
  2. To examine behavior on large systems. Do all the management commands work?

Large files tests: gluster scales nicely.

scaling

Small file tests: gluster scales on reads, but not write-new-file.

smf-scaling

Oops. Small file writes are not scaling linerally. Whats going on here?

Looking at wireshark traces, we observed many LOOKUP calls sent to each of the nodes for every file create operation. As the number of nodes increased, so did the number of LOOKUPs. It turns out that the gluster client was doing a multicast lookup on every node on creates. It does this to confirm the file does not already exist.

The gluster parameter “lookup-unhashed” forces DHT hashing to be used. This will send the LOOKUP to the node where the new file should reside, rather than doing a multicast to all nodes. Below are the results when this setting is enabled.

write-new-file test results with the parameter set (red line). Much better!

lookup-unhashed

 

This parameter is dangerous. If the cluster’s brick topography has changed and the rebalancing was aborted, gluster may find itself in a situation believing a file does not exist, when it really does. In other words, the LOOKUP existence test would generate a false negative because DHT would have the client look to the wrong nodes. This could result in two GFIDs being accessible by the same path.

A fix is being written. It will assign generation counts to bricks. By default DHT will be used on lookups. But if the generation counts indicated a topography change had taken place on the target bricks, the client will revert to the slower broadcast mode of operation.

We observed any management commands dealing with the volume took as long as a minute. For example, the “gluster import” command on the oVirt UI takes more then 30 seconds to complete. Bug 1044693 was opened for this. In all cases the management command worked, but was very slow. See note above in red.

Future

Some additional tests we could do were suggested by gluster engineers. This would be future work:

  1. object enumeration – how well does “ls” scale for large volumes?
  2. What is the largest number of small objects (files) that a machine can handle before it makes sense to add a new node
  3. Snapshot testing for scale-out volumes
  4. Openstack behavior – what happens when the number of VMs goes up? We would look at variance and latency for the worse case.

Proposals to do larger scale-out tests:

  • We could present to gluster volumes partitions of disks. For example, a single 1TB drive could be divided into 10 100GB drives.  This could boost the cluster size by an order of magnitude. Given the disk head would be shared by multiple servers, this technique would only make sense for random I/O tests (where the head is already under stress).
  • Experiment with running gluster servers within containers.

Hardware

Gluster volumes are constructed out out a varying number of bricks embedded within separate virtual machines.   Each virtual machine has:

  • dedicated 7200-RPM SAS disk for Gluster brick
  • a file on hypervisor system disk for the operating system image
  • 2 Westmere or Sandy Bridge cores
  • 4 GB RAM

The KVM hosts are 7 standard Dell R510/R720 servers with these attributes:

  • 2-socket Westmere/Sandy Bridge Intel x86_64
  • 48/64 GB RAM
  • 1 10-GbE interface with jumbo frames (MTU=9000)
  • 12 7200-RPM SAS disks configured in JBOD mode from a Dell PERC H710 (LSI MegaRAID)

For sequential workloads, we use only 8 out of 12 guests in each host so that aggregate disk bandwidth never exceeds network bandwidth.

Clients are 8 standard Dell R610/R620 servers with:

  • 2-socket Westmere/Sandy Bridge Intel x86_64
  • 64 GB RAM
  • 1 10-GbE Intel NIC interface with jumbo frames (MTU=9000)
by on February 28, 2014

New Linux Container Virtualization Technology from Docker

New Linux Container Virtualization Technology from Docker

The Docker, new container-based virtualization tech startup has started it’s venture to the server virtualization industry by offering their newest version of the software, the Docker 0.8. The company has been known for their production of a faster alternative when it comes to running virtual machines over that of hypervisors.

by on November 26, 2013

GlusterFS Block Device Translator

Block device translator

Block device translator (BD xlator) is a new translator added to GlusterFS recently which provides block backend for GlusterFS. This replaces the existing bd_map translator in GlusterFS that provided similar but very limited functionality. GlusterFS expects the underlying brick to be formatted with a POSIX compatible file system. BD xlator changes that and allows for having bricks that are raw block devices like LVM which needn’t have any file systems on them. Hence with BD xlator, it becomes possible to build a GlusterFS volume comprising of bricks that are logical volumes (LV).

bd

BD xlator maps underlying LVs to files and hence the LVs appear as files to GlusterFS clients. Though BD volume externally appears very similar to the usual Posix volume, not all operations are supported or possible for the files on a BD volume. Only those operations that make sense for a block device are supported and the exact semantics are described in subsequent sections.

While Posix volume takes a file system directory as brick, BD volume needs a volume group (VG) as brick. In the usual use case of BD volume, a file created on BD volume will result in an LV being created in the brick VG. In addition to a VG, BD volume also needs a file system directory that should be specified at the volume creation time. This directory is necessary for supporting the notion of directories and directory hierarchy for the BD volume. Metadata about LVs (size, mapping info) is stored in this directory.

BD xlator was mainly developed to use block devices directly as VM images when GlusterFS is used as storage for KVM virtualization. Some of the salient points of BD xlator are

  • Since BD supports file level snapshots and clones by leveraging the snapshot and clone capabilities of LVM, it can be used to fully off-load snapshot and cloning operations from QEMU to the storage (GlusterFS) itself.
  • BD understands dm-thin LVs and hence can support files that are backed by thinly provisioned LVs. This capability of BD xlator translates to having thinly provisioned raw VM images.
  • BD enables thin LVs from a thin pool to be used from multiple nodes that have visibility to GlusterFS BD volume. Thus thin pool can be used as a VM image repository allowing access/visibility to it from multiple nodes.
  • BD supports true zerofill by using BLKZEROOUT ioctl on underlying block devices. Thus BD allows SCSI WRITESAME to be used on underlying block device if the device supports it.

Though BD xlator is primarily intended to be used with block devices, it does provide full Posix xlator compatibility for files that are created on BD volume but are not backed by or mapped to a block device. Such files which don’t have a block device mapping exist on the Posix directory that is specified during BD volume creation.

Availability

BD xlator developed by M. Mohan Kumar was committed into GlusterFS git in November 2013 and is expected to be part of upcoming GlusterFS-3.5 release.

Compiling BD translator

BD xlator needs lvm2 development library. –enable-bd-xlator option can be used with ./configure script to explicitly enable BD translator. The following snippet from the output of configure script shows that BD xlator is enabled for compilation.

GlusterFS configure summary
===================

Block Device xlator  : yes

Creating a BD volume

BD supports hosting of both linear LV and thin LV within the same volume. However I will be showing them separately in the following instructions. As noted above, the prerequisite for a BD volume is VG which I am creating here from a loop device, but it can be any other device too.

1. Creating BD volume with linear LV backend

- Create a loop device

[root@bharata ~]# dd if=/dev/zero of=bd-loop count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop

- Prepare a brick by creating a VG

[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg /dev/loop0

- Create the BD volume

Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta

It is recommended that this directory is created on an LV in the brick VG itself so that both data and metadata live together on the same device.

Create and mount the volume
[root@bharata ~]# gluster volume create bd bharata:/bd-meta?bd-vg force

The general syntax for specifying the brick is host:/posix-dir?volume-group-name where “?” is the separator.

[root@bharata ~]# gluster volume start bd
[root@bharata ~]# gluster volume info bd

Volume Name: bd
Type: Distribute
Volume ID: cb042d2a-f435-4669-b886-55f5927a4d7f
Status: Started
Xlator 1: BD
Capability 1: offload_copy
Capability 2: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta
Brick1 VG: bd-vg

[root@bharata ~]# mount -t glusterfs bharata:/bd /mnt

- Create a file that is backed by an LV

[root@bharata ~]# ls /mnt
[root@bharata ~]#

Since the volume is empty now, so is the underlying VG.
[root@bharata ~]# lvdisplay bd-vg
[root@bharata ~]#

Creating a file that is mapped to an LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to LV.

[root@bharata ~]# touch /mnt/lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “lv” /mnt/lv

Now an LV got created in the VG brick and the file /mnt/lv maps to this LV. Any read/write to this file ends up as read/write to the underlying LV.
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                          /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name                        6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name                       bd-vg
LV UUID                         PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status                      available
# open                          0
LV Size                    4.00 MiB
Current LE                   1
Segments                     1
Allocation                     inherit
Read ahead sectors    0
Block device                253:6

The file gets created with default LV size which is 1 LE which is 4MB in this case.
[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 4.0M Nov 26 16:15 /mnt/lv

truncate can be used to set the required file size.
[root@bharata ~]# truncate /mnt/lv -s 256M
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                          /dev/bd-vg/6ff0f25f-2776-4d19-adfb-df1a3cab8287
LV Name                        6ff0f25f-2776-4d19-adfb-df1a3cab8287
VG Name                       bd-vg
LV UUID                         PjMPcc-RkD5-RADz-6ixG-UYsk-oclz-vL0nv6
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:15:45 +0530
LV Status                       available
# open                           0
LV Size                     256.00 MiB
Current LE                   64
Segments                      1
Allocation                      inherit
Read ahead sectors     0
Block device                 253:6

[root@bharata ~]# ls -lh /mnt/lv
-rw-r–r–. 1 root root 256M Nov 26 16:15 /mnt/lv

The size of the file/LV can be specified during creation/mapping time itself like this:
setfattr -n “user.glusterfs.bd” -v “lv:256MB” /mnt/lv

2. Creating BD volume with thin LV backend

- Create a loop device

[root@bharata ~]# dd if=/dev/zero of=bd-loop-thin count=1024 bs=1M
[root@bharata ~]# losetup /dev/loop0 bd-loop-thin

- Prepare a brick by creating a VG and thin pool

[root@bharata ~]# pvcreate /dev/loop0
[root@bharata ~]# vgcreate bd-vg-thin /dev/loop0

Create a thin pool
[root@bharata ~]# lvcreate –thin bd-vg-thin -L 1000M
Rounding up size to full physical extent 4.00 MiB
Logical volume “lvol0″ created

lvdisplay shows the thin pool
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name                       lvol0
VG Name                      bd-vg-thin
LV UUID                        HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID  0
LV Pool metadata          lvol0_tmeta
LV Pool data                  lvol0_tdata
LV Pool chunk size       64.00 KiB
LV Zero new blocks     yes
LV Status                      available
# open                          0
LV Size                          1000.00 MiB
Allocated pool data     0.00%
Allocated metadata     0.88%
Current LE                   250
Segments                     1
Allocation                     inherit
Read ahead sectors     auto
– currently set to         256
Block device                253:9

- Create the BD volume

Create a POSIX directory first
[root@bharata ~]# mkdir /bd-meta-thin

Create and mount the volume
[root@bharata ~]# gluster volume create bd-thin bharata:/bd-meta-thin?bd-vg-thin force
[root@bharata ~]# gluster volume start bd-thin
[root@bharata ~]# gluster volume info bd-thin

Volume Name: bd-thin
Type: Distribute
Volume ID: 27aa7eb0-4ffa-497e-b639-7cbda0128793
Status: Started
Xlator 1: BD
Capability 1: thin
Capability 2: offload_copy
Capability 3: offload_snapshot
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bharata:/bd-meta-thin
Brick1 VG: bd-vg-thin
[root@bharata ~]# mount -t glusterfs bharata:/bd-thin /mnt

- Create a file that is backed by a thin LV

[root@bharata ~]# ls /mnt
[root@bharata ~]#

Creating a file that is mapped to a thin LV is a 2 step operation. First the file should be created on the mount point and a specific extended attribute should be set to map the file to a thin LV.

[root@bharata ~]# touch /mnt/thin-lv
[root@bharata ~]# setfattr -n “user.glusterfs.bd” -v “thin:256MB” /mnt/thin-lv

Now /mnt/thin-lv is a thin provisioned file that is backed by a thin LV.
[root@bharata ~]# lvdisplay bd-vg-thin
— Logical volume —
LV Name                        lvol0
VG Name                       bd-vg-thin
LV UUID                         HVa3EM-IVMS-QG2g-oqU6-1UxC-RgqS-g8zhVn
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:39:06 +0530
LV Pool transaction ID 1
LV Pool metadata         lvol0_tmeta
LV Pool data                 lvol0_tdata
LV Pool chunk size       64.00 KiB
LV Zero new blocks     yes
LV Status                      available
# open                         0
LV Size                         1000.00 MiB
Allocated pool data    0.00%
Allocated metadata    0.98%
Current LE                  250
Segments                    1
Allocation                    inherit
Read ahead sectors   auto
– currently set to        256
Block device               253:9

— Logical volume —
  LV Path                     /dev/bd-vg-thin/081b01d1-1436-4306-9baf-41c7bf5a2c73
LV Name                        081b01d1-1436-4306-9baf-41c7bf5a2c73
VG Name                       bd-vg-thin
LV UUID                         coxpTY-2UZl-9293-8H2X-eAZn-wSp6-csZIeB
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-26 16:43:19 +0530
LV Pool name                 lvol0
LV Status                       available
# open                           0
  LV Size                     256.00 MiB
Mapped size                  0.00%
Current LE                    64
Segments                      1
Allocation                      inherit
Read ahead sectors     auto
– currently set to          256
Block device                 253:10

As can be seen from above, creation of a file resulted in creation of a thin LV in the brick.

Snapshots and clones

BD xlator uses LVM snapshot and clone capabilities to provide file level snapshots and clones for files on GlusterFS volume. Snapshots and clones work only for those files that have been already mapped to an LV. In other words, snapshots and clones aren’t for Posix-only file that exist on BD volume.

Creating a snapshot

Say we are interested in taking snapshot of a file /mnt/file that already exists and has been mapped to an LV.

[root@bharata ~]# ls -l /mnt/file
-rw-r–r–. 1 root root 268435456 Nov 27 10:16 /mnt/file

[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                        /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name                      abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name                      bd-vg
LV UUID                        HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV Status                      available
# open                          0
LV Size                          256.00 MiB
Current LE                   64
Segments                     1
Allocation                     inherit
Read ahead sectors    0
Block device                253:6

Snapshot creation is a two step process.

- Create a snapshot destination file first
[root@bharata ~]# touch /mnt/file-snap

- Then take the actual snapshot
In order to create the actual snapshot, we need to know the GFID of the snapshot file.

[root@bharata ~]# getfattr -n glusterfs.gfid.string  /mnt/file-snap
getfattr: Removing leading ‘/’ from absolute path names
# file: mnt/file-snap
glusterfs.gfid.string=”bdf74e38-dc96-4b26-94e2-065fe3b8bcc3″

Use this GFID string to create the actual snapshot
[root@bharata ~]# setfattr -n snapshot -v bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 /mnt/file
[root@bharata ~]# lvdisplay bd-vg
— Logical volume —
LV Path                         /dev/bd-vg/abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Name                       abf93bbd-2c78-4612-8822-c4e0a40c4626
VG Name                      bd-vg
LV UUID                        HwSRTL-UdPH-MMz7-rg7U-pU4a-yS4O-59bDfY
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:16:54 +0530
LV snapshot status   source of bdf74e38-dc96-4b26-94e2-065fe3b8bcc3 [active]
LV Status                       available
# open                           0
LV Size                           256.00 MiB
Current LE                   64
Segments                      1
Allocation                      inherit
Read ahead sectors     0
Block device                 253:6

— Logical volume —
LV Path                        /dev/bd-vg/bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
LV Name                      bdf74e38-dc96-4b26-94e2-065fe3b8bcc3
VG Name                      bd-vg
LV UUID                        9XH6xX-Sl64-uNhk-7OiH-f91m-DaMo-6AWiBD
LV Write Access            read/write
LV Creation host, time bharata, 2013-11-27 10:20:35 +0530
LV snapshot status  active destination for abf93bbd-2c78-4612-8822-c4e0a40c4626
LV Status                      available
# open                          0
LV Size                          256.00 MiB
Current LE                   64
COW-table size             4.00 MiB
COW-table LE               1
Allocated to snapshot  0.00%
Snapshot chunk size    4.00 KiB
Segments                      1
Allocation                      inherit
Read ahead sectors     auto
– currently set to          256
Block device                 253:7

As can be seen from the lvdisplay output, /mnt/file-snap now is the snapshot of /mnt/file.

Creating a clone

Creating a clone is similar to creating a snapshot except that “clone” attribute name should be used instead of “snapshot”.

setfattr -n clone -v <gfid-of-clone-file> <path-to-source-file>

Clone in BD volume is essentially a server off-loaded full copy of the file.

Concerns

As you have seen, creation of block device backed file on BD volume, creation of snapshots and clones involve non-standard steps including setting of extended attributes. These steps could be cumbersome for an end user and there are plans to encapsulate all these into nice APIs that users could use easily.


by on September 26, 2013

Alternative Design to VMware VSAN with GlusterFS

Shortly before VMware’s VSAN was released, I had designed my new lab using GlusterFS across 2 to 4 nodes on my Dell C6100. Since this server did not have a proper RAID card and had 4 nodes total, I needed to design something semi-redundant incase a host were to fail.

Scaling:

You have a few options on how you want to scale this, the simplest being 2 nodes with GlusterFS replicating the data. This only requires 1 VM on each host with VMDK’s or RDM’s for storage, then shared back to the host via NFS which will be described later.

If you wish to scale beyond 2 nodes and only replicate the data twice instead of across all 4 nodes, you’ll just need to set up the volume as a distributed-replicate, this should keep 2 copies of a file between the 4 or more hosts. What I mistakenly found out previously was that if you use the same folder across all the nodes, it replicates the data to all 4 of them instead of just 2. You can see a sample working layout below:

Volume Name: DS-01
Type: Distributed-Replicate
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 172.16.0.21:/GFS-1/Disk1
Brick2: 172.16.0.22:/GFS-2/Disk1
Brick3: 172.16.0.23:/GFS-3/Disk1
Brick4: 172.16.0.24:/GFS-4/Disk1
Options Reconfigured:
nfs.rpc-auth-allow: 192.168.1.1

Networking:

After trying several different methods of making a so called FT NFS server, using things like UCARP and HeartBeat and failing, I thought about using a vSwitch with no uplink and using the same IP address across all of the nodes and their storage VM’s. Since the data is replicated and the servers are aware of where the data is, it theoretically should be available where needed. This also ends up tricking vSphere into thinking this IP address is actually available across the network and is really “shared” storage.

Networking on a host ended up looking similar to this:

vSwitch0 — vmnic0
Management: 10.0.0.x

vSwitch1 — No Uplink
GlusterFS Server: 192.168.1.2
GlusterFS Client(NFS Server): 192.168.1.3
VMKernel port: 192.168.1.4

vSwitch2 — vmnic1
GlusterFS Replication: 172.16.0.x

Screen Region 2013-10-04 at 10.18.43

Then you can go ahead and setup a vSphere cluster and add the datastores with the same IP address across all hosts.

Testing:

I will admit I did not have enough time to properly test things like performance and such before moving to VSAN, but what I did test worked. I was able to do vMotions across the hosts in this setup and validate HA failover on a Hypervisor failure. There are obviously some design problems with this, because if 1 of the VM’s were to have issues, it will break on the host. I had only designed this to account for a host failing which I thought would most likely be the issue I’d face most often.

Thoughts, concerns, ideas?

by on September 11, 2013

Up and Running with oVirt 3.3

The oVirt Project is now putting the finishing touches on version 3.3 of its KVM-based virtualization management platform. The release will be feature-packed, including expanded support for Gluster storage, new integration points for OpenStack’s Neutron networking and Glance image services, and a raft of new extensibility and usability upgrades.

oVirt 3.3 also sports an overhauled All-in-One (AIO) setup plugin, which makes it easy to get up and running with oVirt on a single machine to see what oVirt can do for you.

Prerequisites

  • Hardware: You’ll need a machine with at least 4GB RAM and processors with hardware virtualization extensions. A physical machine is best, but you can test oVirt effectively using nested KVM as well.
  • Software: oVirt 3.3 runs on the 64-bit editions of Fedora 19 or Red Hat Enterprise Linux 6.4 (or on the equivalent version of one of the RHEL-based Linux distributions such as CentOS or Scientific Linux).
  • Network: Your test machine’s domain name must resolve properly, either through your network’s DNS, or through the /etc/hosts files of your test machine itself and through those of whatever other nodes or clients you intend to use in your installation.On Fedora 19 machines with a static IP address (dhcp configurations appear not to be affected), you must disable NetworkManager for the AIO installer to run properly [BZ]:
    $> sudo systemctl stop NetworkManager.service
    $> sudo systemctl mask NetworkManager.service
    $> sudo service network start
    $> sudo chkconfig network on

    Also, check the configuration file for your interface (for instance, /etc/sysconfig/network-scripts/ifcfg-eth0) and remove the trailing zero from “GATEWAY0″ “IPADDR0″ and “NETMASK0″ as this syntax appears only to work while NetworkManager is enabled. [BZ]

  • All parts of oVirt should operate with SELinux in enforcing mode, but SELinux bugs do surface. At the time that I’m writing this, the Glusterization portion of this howto requires that SELinux be put in permissive mode. Also, the All in One install on CentOS needs SELinux to be in permissive mode to complete.You can put selinux in permissive mode with the command:
    sudo setenforce 0

    To make the shift to permissive mode persist between reboots, edit “/etc/sysconfig/selinux” and change SELINUX=enforcing to SELINUX=permissive.


Install & Configure oVirt All in One

  1. Run one of the following commands to install the oVirt repository on your test machine.
    1. For Fedora 19:
      $> sudo yum localinstall http://ovirt.org/releases/ovirt-release-fedora.noarch.rpm -y
    2. For RHEL/CentOS 6.4 (also requires EPEL):
       $> sudo yum localinstall http://resources.ovirt.org/releases/ovirt-release-el6-8-1.noarch.rpm -y
      sudo yum localinstall http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm -y
  2. Next, install the oVirt All-in-One setup plugin:
    $> sudo yum install ovirt-engine-setup-plugin-allinone -y
  3. Run the engine-setup installer. When asked whether to configure VDSM on the host, answer yes. You should be fine accepting the other default values.
    $> sudo engine-setup
    engine-setup33.png

    Once the engine-setup script completes, you’ll have a working management server that doubles as a virtualization host. The script sets up a local storage domain for hosting VM images, and an iso domain for storing iso images for installing operating systems on the VMs you create.

  4. Before we leave the command line and fire up the oVirt Administrator Portal, we’re going to create one more storage domain: an export domain, which oVirt uses for ferrying VM images and templates between data centers.We can do this by creating the export domain mount point, setting the permissions properly, copying and tweaking the configuration files that engine-setup created for the iso domain, and reloading nfs-server:
    $> sudo mkdir /var/lib/exports/export
    $> sudo chown 36:36 /var/lib/exports/export
    1. For Fedora:
      $> sudo cp /etc/exports.d/ovirt-engine-iso-domain.exports /etc/exports.d/ovirt-engine-export-domain.exports

      In ovirt-engine-export-domain.exports Change “iso” to “export”

      $> sudo vi /etc/exports.d/ovirt-engine-export-domain.exports
      $> sudo service nfs-server reload
    2. For RHEL/CentOS:
      $> sudo vi /etc/exports

      In /etc/exports append the line:

      /var/lib/exports/export    0.0.0.0/0.0.0.0(rw)
      $> sudo service nfs reload
  5. Now, fire up your Web browser, visit the address your oVirt engine machine, and click the “Administrator Portal” link. Log in with the user name “admin” and the password you entered during engine-setup.
    admin-portal-login33.png
    admin-portal-login-a33.png

    Once logged into the Administrator Portal, click the “Storage” tab, select your ISO_DOMAIN, and visit the the “Data Center” tab in the bottom half of the screen. Next, click the “Attach” link, check the check box next to “local_datacenter,” and hit “OK.” This will attach the storage domain that houses your ISO images to your local datacenter.

    storage-tab33.png
    attach-iso33.png

    Next, we’ll create and activate our export domain. From the “Storage” tab, click “New Domain,” give the export domain a name (I’m using EXPORT_DOMAIN), choose “local_datacenter” in Data Center drop down menu, choose “Export / NFS” from “Domain Function / Storage Type” drop down menu, enter your oVirt machine IP / FQDN :/var/lib/exports/export in the Export Path, and click OK.

    new-export-domain33.png
  6. Before we create a VM, let’s head back to the command line and upload an iso image that we can use to install an OS on the VM we create.Download an iso image:
    $> curl -O http://mirrors.kernel.org/fedora/releases/19/Fedora/x86_64/iso/Fedora-19-x86_64-netinst.iso

    Upload the image into your iso domain (the password is the same as for the Administrator Portal):

    $> engine-iso-uploader upload -i ISO_DOMAIN Fedora-19-x86_64-netinst.iso
  7. Now we’re ready to create and run a VM. Head back to the oVirt Administrator Portal, visit the “Virtual Machines” tab, and click “New VM.” In the resulting dialog box, give your new instance a name and click “OK.”
    new-VM33.png

    In the “New Virtual Machine – Guide Me” dialog that pops up next, click “Configure Virtual Disks,” enter a disk size, and click “OK.” Hit “Configure Later” to dismiss the Guide Me dialog.

    add-disk33.png

    Next, select your newly-created VM, and click “Run Once.” In the dialog box that appears, expand “Boot Options,” check the “Attach CD” check box, choose your install iso from the drop down, and hit “OK” to proceed.

    run-once33.png

    After a few moments, the status of your new vm will switch from red to green, and you can click on the green monitor icon next to “Migrate” to open a console window.

    run-VM33.png

    oVirt defaults to the SPICE protocol for new VMs, which means you’ll need the virt-viewer package installed on your client machine. If a SPICE client isn’t available to you, you can opt for VNC by stopping your VM, clicking “Edit,” “Console,” “Show Advanced Options,” and choosing VNC from the “Protocol” drop down menu.

That’s enough for this blog post, but stay tuned for more oVirt 3.3 how-to posts. In particular, I have walkthroughs in the works for making use of oVirt’s new and improved Gluster storage support, and for making oVirt and OpenStack play nicely together.

If you’re interested in getting involved with the project, you can find all the mailing list, issue tracker, source repository, and wiki information you need here.

On IRC, I’m jbrooks, ping me in the #ovirt room on OFTC or write a comment below and I’ll be happy to help you get up and running or get pointed in the right direction.

Finally, be sure to follow us on Twitter at @redhatopen for news on oVirt and other open source projects in the Red Hat world.

by on August 20, 2013

Troubleshooting QEMU-GlusterFS

As described in my previous blog post, QEMU supports talking to GlusterFS using libgfapi which is a much better way to use GlusterFS to host VM images than using the FUSE mount to access GlusterFS volumes. However due to some bugs that exist in GlusterFS-3.4, any invalid specification of GlusterFS drive on QEMU command line can result in completely non-obvious error messages from QEMU. The aim of this blog post is to document some of these known error scenarios in QEMU-GlusterFS so that users can figure out what could be potentially wrong in their QEMU-GlusterFS setup.

As of this writing (Aug 2013), I know about two bugs in GlusterFS that cause all this confusion in the failure messages:

  • A bug that results in GlusterFS log messages not reaching stderr because of which no meaningful failure messages are shown from QEMU. This bug has been fixed, and it should eventually make it to some release (probably 3.4.1) of GlusterFS.
  • A bug in a libgfapi routine called glfs_init() that results in returning failure with errno set to 0. QEMU depends on errno here and this leads to unpleasant segmentation faults in QEMU.

Environment

I am using Fedora 19 with distro-provided QEMU and GlusterFS rpms for all the experiments here.

# rpm -qa | grep qemu
qemu-system-x86-1.4.2-5.fc19.x86_64
ipxe-roms-qemu-20130517-2.gitc4bce43.fc19.noarch
qemu-kvm-1.4.2-5.fc19.x86_64
qemu-guest-agent-1.4.2-3.fc19.x86_64
libvirt-daemon-driver-qemu-1.0.5.1-1.fc19.x86_64
qemu-img-1.4.2-5.fc19.x86_64
qemu-common-1.4.2-5.fc19.x86_64

# rpm -qa | grep gluster
glusterfs-3.4.0-8.fc19.x86_64
glusterfs-libs-3.4.0-8.fc19.x86_64
glusterfs-fuse-3.4.0-8.fc19.x86_64
glusterfs-server-3.4.0-8.fc19.x86_64
glusterfs-api-3.4.0-8.fc19.x86_64
glusterfs-debuginfo-3.4.0-8.fc19.x86_64
glusterfs-cli-3.4.0-8.fc19.x86_64

Error scenarios

1. glusterd service not started

On Fedora 19, glusterd service fails to start during boot and if you don’t notice it, you can end up using QEMU with GlusterFS when glusterd service isn’t running. In this scenario, you will encounter the following kind of failure:

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

2. GlusterFS volume not started

If QEMU is used to boot a VM image on GlusterFS volume which is not in started state, the following kind of error is seen:

[root@localhost ~]# gluster volume status test
Volume test is not started
[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

3. QEMU run as non-root user

As of this writing (Aug 2013), GlusterFS doesn’t support non-root users to access GlusterFS volumes via libgfapi without some manual settings. This is a very typical scenario for most users who don’t use QEMU directly but use it via libvirt and oVirt. In this scenario, the following kind of error is seen:

[bharata@localhost ~]$ qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test image=F17-qcow2 transport=tcp
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test/F17-qcow2,if=virtio: could not open disk image gluster://kvm-gluster/test/F17-qcow2: No data available

As you can see, 3 different failure scenarios produced similar error messages. This will improve once error logs from GlusterFS start reaching QEMU when GlusterFS with the fix is used.

4. Specifying non-existing GlusterFS volume or server

Specifying invalid or wrong server or volume names can cause nasty failures like this:

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster/test-xxx/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster/test-xxx/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster port=0 volume=test-xxx image=F17-qcow2 transport=tcp
Segmentation fault (core dumped)

[root@localhost ~]# qemu-system-x86_64 –enable-kvm -nographic -smp 2 -m 1024 -drive file=gluster://kvm-gluster-xxx/test/F17-qcow2,if=virtio
qemu-system-x86_64: -drive file=gluster://kvm-gluster-xxx/test/F17-qcow2,if=virtio: Gluster connection failed for server=kvm-gluster-xxx port=0 volume=test image=F17-qcow2 transport=tcp
Segmentation fault (core dumped)

This should get fixed when the bug in libgfapi is resolved.


by on August 7, 2013

UNMAP/DISCARD support in QEMU-GlusterFS

In my last blog post on QEMU-GlusterFS, I described the integration of QEMU with GlusterFS using libgfapi. In this post, I give an overview of the recently added discard support to QEMU’s GlusterFS back-end and how it can be used. Newer SCSI devices support UNMAP command that is used to return the unused/freed blocks back to the storage. This command is typically useful when the storage is thin provisioned like a thin provisioned SCSI LUN. In response to a file deletion, the host device driver could send down an UNMAP command to the SCSI target and instruct it to free the relevant blocks from the thin provisioned LUN. This leads to much better utilization of the storage.

Linux support for discard

In Linux, SCSI UNMAP is supported via the generic discard framework which I believe is also used to support ATA TRIM command. ATA TRIM command typically used in SSD isn’t the topic of discussion of this blog. There are multiple ways in which discard functionality is invoked or used in Linux.

  • For direct block devices, one could use BLKDISCARD ioctl to release the unused blocks.
  • File systems like EXT4 support a file level discard using FALLOC_FL_PUNCH_HOLE option of fallocate system call.
  • For releasing the unused blocks at the file systems level, fstrim command can be used.
  • Finally, file systems like EXT4 also support ‘discard’ mount option that will control if file system (EXT4) should issue UNMAP requests to the underlying block device when there are free blocks.

QEMU support for discard

UNMAP is primarily useful in two ways for KVM virtualization.

  • When a file is deleted in the VM, the resulting UNMAP in the guest is passed down to host which will result in host sending the discard request to the thin provisioned SCSI device. This results in the blocks consumed by the deleted file to be returned back to the SCSI storage. The effect is same when there is an explicit discard request from the guest using either ioctl or fallocate methods listed in the previous section.
  • When a VM image is deleted, there is a potential to return the freed blocks back to the storage by sending the UNMAP command to the SCSI storage.

Guest UNMAP requests will end up in QEMU only if the guest is using scsi or virtio-scsi device and not virtio-blk device. QEMU will forward this request further down to the host (device or file system) only if ‘discard=on’ or ‘discard=unmap’ drive flag is used for the device on the QEMU command line.

Example1: qemu-system-x86_64 -drive file=/images/vm.img,if=scsi,discard=on
Example2: qemu-system-x86_64 -device virtio-scsi-pci -drive if=none,discard=on,id=rootdisk,file=gluster://host/volume/image -device scsi-hd,drive=rootdisk

The way discard request is further passed down in the host is determined by the block driver inside QEMU which is serving the disk image type. While QEMU uses fallocate(FALLOC_FL_PUNCH_HOLE) for raw file backends and ioctl(BLKDISCARD) for block device back-end, other backends use their own interfaces to pass down the discard request.

GlusterFS support for discard

GlusterFS, starting from version 3.4, supports discard functionality through a new API called glfs_discard(glfs_fd, offset, size) that is available as part of libgfapi library.  On the GlusterFS server side, discard request is handled differently for posix back-end and Block Device(BD) back-end.

For the posix back-end, fallocate(FALLOC_FL_PUNCH_HOLE) is used to eventually release the blocks to the filesystem. If the posix brick has been mounted with ‘-o discard’ option, then the discard request will eventually reach the SCSI storage if the storage device supports UNMAP.

Support for BD back-end is planned to come up as soon as the ongoing development work on the newer and feature rich BD translator becomes upstream. BD translator should be using ioctl(BLKDISCARD) to UNMAP the blocks.

Discard support in GlusterFS back-end of QEMU

As described in my earlier blog, QEMU starting from version 1.3 supports GlusterFS back-end using libgfapi which is the non-FUSE way of accessing GlusterFS volumes. Recently I added discard support to GlusterFS block driver in QEMU and this support should be available in QEMU-1.6 onwards. This work involved using the glfs_discard() API from QEMU GlusterFS driver to send the discard request to GlusterFS server.

With this, the entire KVM virtualization stack using GlusterFS back-end is enabled to take advantage of SCSI UNMAP command.

Typical usage

This section describes a typical use case with QEMU-GlusterFS where a file deleted from inside the VM results in discard requests for the host storage.

Step 1: Prepare a block device that supports UNMAP

Since I don’t have a real SCSI device that supports UNMAP, I am going to use a loop device to host my GlusterFS volume. Loop device supports discard.

[root@llmvm02 bharata]# dd if=/dev/zero of=discard-loop count=1024 bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.63255 s, 1.7 GB/s

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 2097152    IO Block: 4096   regular file
Device: fc13h/64531d    Inode: 15740290    Links: 1

[root@llmvm02 bharata]# losetup /dev/loop0 discard-loop

[root@llmvm02 bharata]# cat /sys/block/loop0/queue/discard_max_bytes
4294966784

Step 2: Prepare the brick directory for GlusterFS volume

[root@llmvm02 bharata]# mkfs.ext4 /dev/loop0
[root@llmvm02 bharata]# mount -o discard /dev/loop0 /discard-mnt/
[root@llmvm02 bharata]# mount  | grep discard
/dev/loop0 on /discard-mnt type ext4 (rw,discard)

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 66368      IO Block: 4096   regular file

Step 3: Create a GlusterFS volume

[root@llmvm02 bharata]# gluster volume create discard llmvm02:/discard-mnt/ force
volume create: discard: success: please start the volume to access data

[root@llmvm02 bharata]# gluster volume start discard
volume start: discard: success

[root@llmvm02 bharata]# gluster volume info discard
Volume Name: discard
Type: Distribute
Volume ID: ed7a6f8a-9cb8-463d-a948-61974cb64c99
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: llmvm02:/discard-mnt

Step 4: Create a sparse file in the GlusterFS volume

[root@llmvm02 bharata]# glusterfs -s llmvm02 –volfile-id=discard /mnt
[root@llmvm02 bharata]# touch /mnt/file
[root@llmvm02 bharata]# truncate -s 450M /mnt/file

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 0          IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 66368      IO Block: 4096   regular file

Step 5: Use this sparse file on GlusterFS volume as a disk drive with QEMU

[root@llmvm02 bharata]# qemu-system-x86_64 –enable-kvm -nographic -m 8192 -smp 2 -device virtio-scsi-pci -drive if=none,cache=none,id=F17,file=gluster://llmvm02/test/F17 -device scsi-hd,drive=F17 -drive if=none,cache=none,id=gluster,discard=on,file=gluster://llmvm02/discard/file -device scsi-hd,drive=gluster -kernel /home/bharata/linux-2.6-vm/kernel1 -initrd /home/bharata/linux-2.6-vm/initrd1 -append “root=UUID=d29b972f-3568-4db6-bf96-d2702ec83ab6 ro rd.md=0 rd.lvm=0 rd.dm=0 SYSFONT=True KEYTABLE=us rd.luks=0 LANG=en_US.UTF-8 console=tty0 console=ttyS0 selinux=0″

The file appears as a SCSI disk in the guest.

[root@F17-kvm ~]# dmesg | grep -i sdb
sd 0:0:1:0: [sdb] 921600 512-byte logical blocks: (471 MB/450 MiB)
sd 0:0:1:0: [sdb] Write Protect is off
sd 0:0:1:0: [sdb] Mode Sense: 63 00 00 08
sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn’t support DPO or FUA

Step 6: Generate discard requests from the guest

Format a FS on the SCSI disk, mount it and create a file on it

[root@F17-kvm ~]# mkfs.ext4 /dev/sdb
[root@F17-kvm ~]# mount -o discard /dev/sdb /guest-mnt/
[root@F17-kvm ~]# dd if=/dev/zero of=/guest-mnt/file bs=1M count=400
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 1.07204 s, 391 MB/s

Now we can see the blocks count growing in the host like this:

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 846880     IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 917288     IO Block: 4096   regular file

Remove the file and this should generate discard requests which should get passed down to host eventually resulting in the discarded blocks getting released from the underlying loop device in the host.

[root@F17-kvm ~]# rm -f /guest-mnt/file

In the host,

[root@llmvm02 bharata]# stat /mnt/file
File: `/mnt/file’
Size: 471859200     Blocks: 46600      IO Block: 131072 regular file

[root@llmvm02 bharata]# stat discard-loop
File: `discard-loop’
Size: 1073741824    Blocks: 112960     IO Block: 4096   regular file

Thus we saw the discard requests from the guest eventually resulting in the blocks getting released in the host block device. I was using a loop device in the host, but this should work for any SCSI device that supports UNMAP. It is not necessary to have SCSI device with UNMAP support to get the benefit of space saving. In fact if the underlying device is a thin logical volume(dm-thin) coming from a thin pool dm device, the  space saving can be realized at the thin pool level itself.

Concerns with discard mount option

There have been concerns about the cost of UNMAP operation in the storage and its detrimental effect on the IO throughput. So it is unclear if everyone will want to turn on the discard mount option by default. I wish I had access to an UNMAP-capable storage array to really test the effect of UNMAP on IO performance.


by on July 22, 2013

Nutanix vs. GlusterFS or Projects are not Products

Anyone who knows me knows, that I’ve been a VMware user for a long time. I’ve spent a large chunk of my career building virtualization solutions for different companies based on VMware tech. I’ve been active in the VMware community, and I’ve got to say it’s one of the healthiest I’ve seen in a long time. There’s a ton of interesting things going on and a robust and solid ecosphere of partners developing really great products with VMware as a core component of their products.

Those of you who REALLY know me also know that I’m a passionate proponent for Open Source and Free Software wherever it makes sense. Open Source empowers customers, users, and developers to participate more intimately with how their software evolves. I’m also a fan of organizations paying for the Free Software they consume as well, which not only addresses support requirements for most organizations, but it also encourages organizations to “have a stake” in how the products are being built, and how those projects evolve.

That last sentence is critical. Projects are not products. As worlds collide, and as proprietary companies not normally exposed to or involved with traditional open source projects start getting involved, it’s easy for them to occasionally say things that will cause those of us in the community to scratch our heads. But what do we mean when we say “projects aren’t products”?

What does that even mean?
First, take a look at this:
rht-lifecycle

As an example, let’s say a Red Hat engineer (or a customer, or a Fedora contributor, or a partner vendor like IBM, or ANYONE really) wants to add a cool new feature specifically to RHEL, Red Hat’s flagship product. The new feature would first be added to Fedora, tested, hardened, and when stable, if selected, rolled into the downstream product, RHEL. This “upstream first” methodology keeps the “community of developers” front and center, and it doesn’t hold things back from what is contributed to the Open Source project.[1]

Similarly, new features and functionality for Red Hat Enterprise Storage get added to Gluster first. Gluster is the project, whereas Red Hat Enterprise Storage is the product. This isn’t stopping companies from deploying projects into production, many companies do (I’m guessing to the chagrin of sales folks everywhere) but overall, it helps the community with users and developers of all types, as feedback and bug reports are public for EVERYONE. Remember, projects like Gluster can’t exist behind a wall, because contributors come from lots of different companies and backgrounds. Everyone is welcome to submit code and patches, as well as file bug reports. This (IMHO) is what makes these Open Source projects great, and also what tends to drive the most confusion with proprietary companies trying to interact with them in the wild.

nutanix_tweet

This was a recent post
by Binny Gill, the Director of Engineering for Nutanix. I won’t get into the stream of back and forth that happened after this was posted, I just wanted to share my thoughts about this from someone who has chosen to live in both worlds for a long time. [2]

It’s easy to bash on Open Source projects. With all of the mailing lists for users and developers public and if you treat them as competitive, you’ve got a target rich environment from which to pull all the ammo you need to smash them into the dirt. Every single one. That’s by design though, as most projects are interested in developing in the open. The beauty of this is, if you want to engage the community, even if you feel like you work at a “competitor”, you can do so! Join the mailing list, grab the source, watch, learn. That’s what it’s there for. I’d actually encourage Nutanix employees to download and test out the new libgfapi and kvm-qemu integration against NDFS. I know Nutanix has solid KVM support and I know their engineers are rock stars, so it would actually be pretty awesome to see a side by side comparison. Data data data!

Comparing products with projects comes off like a cheap shot. For example, grabbing a single bug report from an outdated version of Gluster and claiming Gluster itself is not enterprise ready. If NDFS were an open source project, including an upstream project with publicly available user and developer mailing lists, where all new patches and bugs were reported, I’m willing to bet there would be plenty of “non-enterprise ready” commentary available. But it doesn’t. It isn’t Open Source. The support logs aren’t public. It gets to only put its best foot forward in the pubic sphere.

Personally, I’d love to see Nutanix add some real gasoline to their support of Open Source by contributing NDFS back to the community as an upstream project. Especially, if it is the best of the best. Then we’d ALL be able to move beyond twitfights and one-sided performance testing (although I’m still interested in what the numbers would look like). It would also add another solid option for open source enterprise storage product offerings. The more the merrier. With the adoption rates of Nutanix within the VMware scope, I don’t doubt it’s awesome-sauce with a side of amazing. I can’t imagine how Nutanix contributing an Open Source storage component would impact the converged hardware sales and support space they’re currently rocking. While I can’t see it happening any time soon, I’d love to rock some Open Source NDFS love in my home lab.

Personally, I’m thrilled to see more closed source vendors consuming and supporting open source virtualization projects. I think is a safe way to get started. More involvement should be encouraged. Companies like Nutanix already understand the value prop of Open Source in their space and are focused on making things like KVM and OpenStack rock with NDFS and the Nutanix platform. It’s a hell of a start and it makes sense. Some of the most exciting innovative technologies evolving today are Open Source projects. And most of them are occurring within Upstream Open Source Projects.

In the meantime, I’ll just stick with KVM and Gluster in my home lab and do what I can to improve the upstream projects and look forward to the conversations at VMworld this year.

[1] I’m aware there are TONS of different ways to do open source, and lots of other projects do them differently. For Red Hat, the “Upstream First” mantra means that everyone can contribute, and everyone can get a seat at the table if they want it. I understand this is overly simplistic, but I hope you get the idea.

[2] Thankfully, all the right folks now are talking on twitter about technical differences and explaining feature functionality and architecture. It’s unfortunate that with all of the latest and greatest features of gluster available to Nutanix to review and tinker with but I don’t think they have a “competitive” lab set up with Gluster for testing. (Hey Nutanix, give me a ring I’d love to set up a geo-replicated Gluster cluster for you with zero software costs)

by on February 26, 2013

Converged Infrastructure prototyping with Gluster 3.4 alpha and QEMU 1.4.0

I just wrapped up my presentation at the Gluster Workshop at CERN where I discussed Open Source advantages in tackling converged infrastructure challenges. Here is my slidedeck. Just a quick heads up, there’s some animation that’s lost in the pdf export as well as color commentary during almost every slide.

During the presentation I demo’ed out the new QEMU/GlusterFS native integration leveraging libgfapi. For those of you wondering what that means, in short, there’s no need for FUSE anymore and QEMU leverages GlusterFS natively on the back end. Awesome.

So for my demo I needed two boxes running QEMU/KVM/GlusterFS. This would provide the compute and storage hypervisor layers. As I only have a single laptop to tour Europe with, I obviously needed a nested KVM environment.

If you’re got enough hardware feel free to skip the Enable Nested virtualization section and skip ahead to the Base OS installation.

This wasn’t as easy envionment to get up and running, this is alpha code boys and girls so expect to roll your sleeves up. Ok with that out of the way, I’d like to walk through the steps I did in order to get my demo envionment up and running. These installation assumes you have Fedora 18 installed and updated with virt-manager and KVM installed.

Enable Nested Virtualization

Since we’re going to want to install an OS on our VM running on our Gluster/QEMU cluster that we’re building, we’ll need to enable Nested Virtualization.

Let’s first check and see if nested virtualization is enabled. If it responds with an N then No. If yes, skip this section to the install.

$ cat /sys/module/kvm_intel/parameters/nested
N

If it’s not we’ll need to load a KVM specific module with the nested option loaded. The easist way to change this is using the modprobe configuration files:

$ echo “options kvm-intel nested=1″ | sudo tee /etc/modprobe.d/kvm-intel.conf

Reboot your machine once the changes have been made and check again to see if the feature is enabled:

$ cat /sys/module/kvm_intel/parameters/nested
Y

That’s it we’re done with prepping the host.

Install VMs OS

Starting with my base Fedora laptop, I’ve installed virt-manager for VM management. I wanted to use Boxes, but it’s not designed for this type of configuration. So. Create your new VM, I selected the “Fedora http install” as I didn’t have an iso laying around. Also http install=awesome.
gluster01

To do this, select the http install option and enter the nearest available location.

gluster02

For me this was the Masaryk University, Brno (where I happened to be sitting during Dev Days 2013)

http://ftp.fi.muni.cz/pub/linux/fedora/linux/releases/18/Fedora/x86_64/os/


I went with an 8 gig base disk to start (we’ll add another one in a bit), gave the VM 1G of ram and a default vCPU. Start the VM build and install.

gluster03

The install will take a bit longer as it’s downloading the install files during the intial boot.

gluster04

Select the language you want to use and continue to the installation summary screen. Here we’ll want to change the software selection option.

gluster05

and select the minimal install:

gluster06

during the installation, go ahead and set the root password:

gluster07

Once the installation is complete, the VM will reboot. Once done, power it down. Although we’ve enable Nested Virtualization, we need to pass the CPU flags onto the VM.

In the virt-manager window right click on the VM, and select open. In the VM window, select view > details. Rather than guessing the cpu architecture, select the copy from host and select Ok.

gluster08

While you’re here go ahead and add an additional 20 gig virtual drive. Make sure you select virtio for the drive type!

gluster09

Boot your VM up and let’s get started.

Base installation components

You’ll need to install some base components before you get started installing GlusterFS or QEMU.

After logging in as root,

yum update


yum install nettools wget xfsprogs binutils

Now we’re going to create the mount point and format the additional drive we just installed.

mkdir -p /export/brick1


mkfs.xfs -i size=512 /dev/vdb

We’ll need to edit our fstab and add this as well, so that it will remain persistent going forward after any reboots.

add the following line to /etc/fstab

/dev/vdb /export/brick1 xfs defaults 1 2

Once you’re done with this, let’s go ahead and mount the drive.

mount -a && mount

Firewalls. YMMV

it may be just me (I’m sure it is) but I struggled getting gluster to work with firewalld on fedora 18. This is not reccomeneded in production envionments, but for our all in VM on a laptop deployment, I just disabled and removed firewalld.

yum remove firewalld

Gluster 3.4.0 Alpha Installation

First thing we’ll need to do on our VM is configure and enable the gluster repo.

wget http://download.gluster.org/pub/gluster/glusterfs/qa-releases/3.4.0alpha/Fedora/glusterfs-alpha-fedora.repo

and move it to /etc/yum.repos.d/

mv glusterfs-alpha-fedora.repo /etc/yum.repos.d/

Now we enable the repo and install glusterfs:

yum update

yum install glusterfs-server glusterfs-devel

Important to note here we need the gluster-devel package for the QEMU integration we’ll be testing. Once done we’ll start the glusterd service and verify that it’s working.

break break 2nd VM

Ok folks, if you’ve made it here, get a coffee and do the install again on a 2nd VM. You’ll need the 2nd replication VM target before you proceed.

break break Network Prepping both VMs

As we’re on the private nat’d network on our laptop that virt-manager is managing, we’ll need to update our VMs we create and assign static addresses, as well as editing the /etc/hosts file to add both servers with thier addresses. We’re not proud here people, this is a test envionment, if you want to use proper DNS, I won’t judge if you don’t.

1) change both VMs to using static addresses in the nat range.
2) change VMs hostnames
3) update both VMs /etc/hosts to include both nodes. This is hacky but expedient.

back to Gluster

start and verify the gluster services on both VMs.
service glusterd start
service glusterd status

On either host, we’ll need to create the gluster volume and set it for replication.
gluster volume create vmstor replica 2 ci01.local:/export/brick1 ci02.local:/export/brick1

Now we’ll start the volume we just created
gluster volume start vmstor

Verify that everything is good, if this returns fine, you’re up and running with GlusterFS!
gluster volume info

building QEMU dependancies

let’s get some prereqs for getting the latest qemu up and running

yum install lvm2-devel git gcc-c++ make glib2-devel pixman-devel

Now we’ll download QEMU:

git clone git://git.qemu-project.org/qemu.git

The rest is pretty standard compiling from source. you’ll start with configuring your build. I’ll trim the target list to save time as I know I’m not going to use many of the QEMU supported architectures.
./configure --enable-glusterfs --target-list=i386-softmmu,x86_64-softmmu,x86_64-linux-user,i386-linux-user

With that done everything on this host is done, and we’re ready to start building VMs using GlusterFS natively bypassing fuse and leveraging thin provisioning. W00!

Creating Virtual Disks on GlusterFS

qemu-img create gluster://ci01:0/vmstor/test01?transport=socket 5G

Breaking this down, we’re using qemu-img to create a disk natively on GlusterFS that’s five gigs in size. I’m looking for some more information about what the transport socket is, expect an answer soonish.

Build a VM and install an OS onto the GlusterFS mounted disk image

At this point you’ll want something to actually install on your image. I went with TinyCore because as it is I’m already pushing up against the limitations of this laptop with nested virtualization. You can download TinyCore Linux here.

qemu-system-x86_64 --enable-kvm -m 1024 -smp 4 -drive file=gluster://ci01/vmstor/test01,if=virtio -vnc 192.168.122.209:1 --cdrom /home/theron/CorePlus-current.iso

This is the quickest way to get this moving, I skipped using Virsh for the demo, and am assigning the VNC IP and port manually. Once the VM starts up you should be able to connect to the VM from your external host and start the install process.

gluster10

To get the install going, select the harddrive that was build with qemu-img and follow the OS install procedures.

finish

At this point you’re done and you can start testing and submitting bugs!

I’d expect to see some interesting things with OpenStack in this space as well as tighter oVirt integration moving forward.

Let me know what you think about this guide and if it was useful.

side note

Also, something completely related. I’m pleased to announce that I’ve joined the Open Source and Standards team at Redhat working to promote and assist making upstream projects wildly successful. If you’re unsure what that means or you’re wondering why Red Hat cares about upstream projects, PLEASE reach out and say hello.

References:

nested KVM
KVM VNC
Using QEMU to boot VM on GlusterFS
QEMU downloads
QEMU Gluster native file system integration