all posts tagged Storage


by on April 16, 2014

New Style Replication

This afternoon, I’ll be giving a talk about (among other things) my current project at work – New Style Replication. For those who don’t happen to be at Red Hat Summit, here’s some information about why, what, how, and so on.

First, why. I’m all out of tact and diplomacy right now, so I’m just going to come out and say what I really think. The replication that GlusterFS uses now (AFR) is unacceptably prone to “split brain” and always will be. That’s fundamental to the “fan out from the client” approach. Quorum enforcement helps, but the quorum enforcement we currently have sacrifices availability unnecessarily and still isn’t turned on by default. Even worse, once split brain has occurred we give the user very little help resolving it themselves. It’s almost like we actively get in their way, and I believe that’s unforgivable. I’ve submitted patches to overcome both of these shortcomings, but for various reasons those have been almost completely ignored. Many of the arguments about NSR vs. AFR have been about performance, which I’ll get into later, but that’s really not the point. In priority order, my goals are:

  • More correct behavior, particularly with respect to split brain.

  • More flexibility regarding tradeoffs between performance, consistency, and availability. At the extremes, I hope that NSR can be used for a whole continuum from fully synchronous to fully asynchronous replication.

  • Better performance in the most common scenarios (though our unicorn-free reality dictates that in return it might be worse in others).

To show the most fundamental difference between NSR and AFR, I’ll borrow one of my slides.

image

The “fan out” flow is AFR. The client sends data directly to both servers, and waits for both to respond. The “chain” flow is NSR. The client sends data to one server (the temporary master), which then sends it to the others, then the replies have to propagate back through that first server to the client. (There is actually a fan-out on the server side for replica counts greater than two, so it’s technically more splay than chain replication, but bear with me.) The master is elected and re-elected via etcd, in case people were wondering why I’d been hacking on that.

Using a master this way gives us two advantages. First, the master is key to how “reconciliation” (data repair after a node has left and returned) works. NSR recovery is log-based and precise, unlike AFR which marks files as needing repair and then has to scan the file contents to find parts that differ between replicas. Masters serve for terms. The order of requests between terms is recorded as part of the leader-election process, and the order within a term is implicit in the log for that term. Thus, we have all of the information we need to do reconciliation across any set of operations without having to throw up our hands and say we don’t know what the correct final state should be.

There’s a lot more about the “what” and the “how” that I’ll leave for a later post, but that should do as a teaser while we move on to the flexibility and performance parts. In its most conservative default mode, the master forwards writes to all other replicas before performing them locally and doesn’t report success to the client until all writes are done. Either of those “all” parts can be relaxed to achieve better performance and/or asynchronous replication at some small cost in consistency.

  • First we have an “issue count” which might be from zero to N-1 (for N replicas). This is the number of non-leader replicas to which a write must be issued before the master issues it locally.

  • Second we have a “completion count” which might be from one to N. This is the number of writes that must be complete (including on the master) before success is reported to the client.

The defaults are Issue=N-1 and Completion=N for maximum consistency. At the other extreme, Issue=0 means that the master can issue its local write immediately and Completion=1 means it can report success as soon as one write – almost certainly that local one – completes. Any other copies are written asynchronously but in order. Thus, we have both sync and async replication under one framework, merely tweaking parameters that affect small parts of the implementation instead of having to use two completely different approaches. This is what “unified replication” in the talk is about.

OK, on to performance. The main difference here is that the client-fan-out model splits the client’s outbound bandwidth. If you have N replicas, a client with bandwidth BW can never achieve more than BW/N write throughput. In the chain/splay model, the client can use its full bandwidth and the server can use its own BW/(N-1) simultaneously. This means increased throughput in most cases, and that’s not just theoretical: I’ve observed and commented on exactly that phenomenon in head-to-head comparisons with more than one alternative to GlusterFS. Yes, if enough clients gang up on a server then that server’s networking can become more of a bottleneck than with the client-fan-out model, and if the server is provisioned similarly to the clients, and if we’re not disk-bound anyway, then that can be a problem. Likewise, the two-hop latency with this approach can be a problem for latency-sensitive and insufficiently parallel applications (remember that this is all within one replica set among many active simultaneously within a volume). However, these negative cases are much – much – less common in practice than the positive cases. We did have to sacrifice some unicorns, but the workhorses are doing fine.

That’s the plan to (almost completely) eliminate the split-brain problems that have been the bane of our users’ existence, while also adding flexibility and improving performance in most cases. If you want to find out more, come to one of my many talks or find me online, and I’ll be glad to talk your ear off about the details.

by on January 9, 2014

Configuring OpenStack Havana Cinder, Nova and Glance to run on GlusterFS

Configuring Glace, Cinder and Nova for OpenStack Havana to run on GlusterFS is actually quite simple; assuming that you’ve already got GlusterFS up and running.

So lets first look at my Gluster configuration. As you can see below, I have a Gluster volume defined for Cinder, Glance and Nova.… Read the rest

The post Configuring OpenStack Havana Cinder, Nova and Glance to run on GlusterFS appeared first on vmware admins.

by on December 19, 2013

Installing GlusterFS on RHEL 6.4 for OpenStack Havana (RDO)

The OpenCompute systems are the the ideal hardware platform for distributed filesystems. Period. Why? Cheap servers with 10GB NIC’s and a boatload of locally attached cheap storage!

In preparation for deploying RedHat RDO on RHEL, the distributed filesystem I chose was GlusterFS.… Read the rest

The post Installing GlusterFS on RHEL 6.4 for OpenStack Havana (RDO) appeared first on vmware admins.

by on September 26, 2013

Alternative Design to VMware VSAN with GlusterFS

Shortly before VMware’s VSAN was released, I had designed my new lab using GlusterFS across 2 to 4 nodes on my Dell C6100. Since this server did not have a proper RAID card and had 4 nodes total, I needed to design something semi-redundant incase a host were to fail.

Scaling:

You have a few options on how you want to scale this, the simplest being 2 nodes with GlusterFS replicating the data. This only requires 1 VM on each host with VMDK’s or RDM’s for storage, then shared back to the host via NFS which will be described later.

If you wish to scale beyond 2 nodes and only replicate the data twice instead of across all 4 nodes, you’ll just need to set up the volume as a distributed-replicate, this should keep 2 copies of a file between the 4 or more hosts. What I mistakenly found out previously was that if you use the same folder across all the nodes, it replicates the data to all 4 of them instead of just 2. You can see a sample working layout below:

Volume Name: DS-01
Type: Distributed-Replicate
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 172.16.0.21:/GFS-1/Disk1
Brick2: 172.16.0.22:/GFS-2/Disk1
Brick3: 172.16.0.23:/GFS-3/Disk1
Brick4: 172.16.0.24:/GFS-4/Disk1
Options Reconfigured:
nfs.rpc-auth-allow: 192.168.1.1

Networking:

After trying several different methods of making a so called FT NFS server, using things like UCARP and HeartBeat and failing, I thought about using a vSwitch with no uplink and using the same IP address across all of the nodes and their storage VM’s. Since the data is replicated and the servers are aware of where the data is, it theoretically should be available where needed. This also ends up tricking vSphere into thinking this IP address is actually available across the network and is really “shared” storage.

Networking on a host ended up looking similar to this:

vSwitch0 — vmnic0
Management: 10.0.0.x

vSwitch1 — No Uplink
GlusterFS Server: 192.168.1.2
GlusterFS Client(NFS Server): 192.168.1.3
VMKernel port: 192.168.1.4

vSwitch2 — vmnic1
GlusterFS Replication: 172.16.0.x

Screen Region 2013-10-04 at 10.18.43

Then you can go ahead and setup a vSphere cluster and add the datastores with the same IP address across all hosts.

Testing:

I will admit I did not have enough time to properly test things like performance and such before moving to VSAN, but what I did test worked. I was able to do vMotions across the hosts in this setup and validate HA failover on a Hypervisor failure. There are obviously some design problems with this, because if 1 of the VM’s were to have issues, it will break on the host. I had only designed this to account for a host failing which I thought would most likely be the issue I’d face most often.

Thoughts, concerns, ideas?

by on August 27, 2013

Creating An NFS-Like Standalone Storage Server With GlusterFS 3.2.x On Debian Wheezy

Creating An NFS-Like Standalone Storage Server With GlusterFS 3.2.x On Debian Wheezy

This tutorial shows how to set up a standalone storage server on Debian Wheezy. Instead of NFS, I will use GlusterFS
here. The client system will be able to access the storage as if it was
a local filesystem.

GlusterFS is a clustered file-system capable of scaling to several
peta-bytes. It aggregates various storage bricks over Infiniband RDMA or
TCP/IP interconnect into one large parallel network file system.
Storage bricks can be made of any commodity hardware such as x86_64
servers with SATA-II RAID and Infiniband HBA.

by on June 13, 2013

GlusterFS portability on full view – ARM 64

Today at Red Hat Summit, Jon Masters, Red Hat’s chief ARM architect, demonstrated GlusterFS replicated on two ARM 64 servers, streaming a video. This marks the first successful demo of a distributed filesystem running on ARM 64.

Video and podcast to come soon.

 

 

by on January 10, 2013

Creating An NFS-Like Standalone Storage Server With GlusterFS 3.2.x On Ubuntu 12.10

Creating An NFS-Like Standalone Storage Server With GlusterFS 3.2.x On Ubuntu 12.10

This tutorial shows how to set up a standalone storage server on Ubuntu 12.10. Instead of NFS, I will use GlusterFS
here. The client system will be able to access the storage as if it was
a local filesystem.

GlusterFS is a clustered file-system capable of scaling to several
peta-bytes. It aggregates various storage bricks over Infiniband RDMA or
TCP/IP interconnect into one large parallel network file system.
Storage bricks can be made of any commodity hardware such as x86_64
servers with SATA-II RAID and Infiniband HBA.

by on January 6, 2013

Distributed Replicated Storage Across Four Storage Nodes With GlusterFS 3.2.x On CentOS 6.3

Distributed Replicated Storage Across Four Storage Nodes With GlusterFS 3.2.x On CentOS 6.3

This tutorial shows how to combine four single storage servers (running CentOS 6.3) to a distributed replicated storage with GlusterFS. Nodes 1 and 2 (replication1) as well as 3 and 4 (replication2) will mirror each other, and replication1 and replication2 will be combined to one larger storage server (distribution). Basically, this is RAID10 over network.

If you lose one server from replication1 and one from replication2,
the distributed volume continues to work. The client system (CentOS 6.3
as well) will be able to access the storage as if it was a local
filesystem.

GlusterFS is a clustered file-system capable of scaling to several
peta-bytes. It aggregates various storage bricks over Infiniband RDMA or
TCP/IP interconnect into one large parallel network file system.
Storage bricks can be made of any commodity hardware such as x86_64
servers with SATA-II RAID and Infiniband HBA.

by on December 17, 2012

Creating An NFS-Like Standalone Storage Server With GlusterFS 3.2.x On CentOS 6.3

Creating An NFS-Like Standalone Storage Server With GlusterFS 3.2.x On CentOS 6.3

This tutorial shows how to set up a standalone storage server on CentOs 6.3. Instead of NFS, I will use GlusterFS
here. The client system will be able to access the storage as if it was
a local filesystem. GlusterFS is a clustered file-system capable of scaling to several
peta-bytes. It aggregates various storage bricks over Infiniband RDMA or
TCP/IP interconnect into one large parallel network file system.
Storage bricks can be made of any commodity hardware such as x86_64
servers with SATA-II RAID and Infiniband HBA.