all posts tagged distributed


by on April 16, 2014

New Style Replication

This afternoon, I’ll be giving a talk about (among other things) my current project at work – New Style Replication. For those who don’t happen to be at Red Hat Summit, here’s some information about why, what, how, and so on.

First, why. I’m all out of tact and diplomacy right now, so I’m just going to come out and say what I really think. The replication that GlusterFS uses now (AFR) is unacceptably prone to “split brain” and always will be. That’s fundamental to the “fan out from the client” approach. Quorum enforcement helps, but the quorum enforcement we currently have sacrifices availability unnecessarily and still isn’t turned on by default. Even worse, once split brain has occurred we give the user very little help resolving it themselves. It’s almost like we actively get in their way, and I believe that’s unforgivable. I’ve submitted patches to overcome both of these shortcomings, but for various reasons those have been almost completely ignored. Many of the arguments about NSR vs. AFR have been about performance, which I’ll get into later, but that’s really not the point. In priority order, my goals are:

  • More correct behavior, particularly with respect to split brain.

  • More flexibility regarding tradeoffs between performance, consistency, and availability. At the extremes, I hope that NSR can be used for a whole continuum from fully synchronous to fully asynchronous replication.

  • Better performance in the most common scenarios (though our unicorn-free reality dictates that in return it might be worse in others).

To show the most fundamental difference between NSR and AFR, I’ll borrow one of my slides.

image

The “fan out” flow is AFR. The client sends data directly to both servers, and waits for both to respond. The “chain” flow is NSR. The client sends data to one server (the temporary master), which then sends it to the others, then the replies have to propagate back through that first server to the client. (There is actually a fan-out on the server side for replica counts greater than two, so it’s technically more splay than chain replication, but bear with me.) The master is elected and re-elected via etcd, in case people were wondering why I’d been hacking on that.

Using a master this way gives us two advantages. First, the master is key to how “reconciliation” (data repair after a node has left and returned) works. NSR recovery is log-based and precise, unlike AFR which marks files as needing repair and then has to scan the file contents to find parts that differ between replicas. Masters serve for terms. The order of requests between terms is recorded as part of the leader-election process, and the order within a term is implicit in the log for that term. Thus, we have all of the information we need to do reconciliation across any set of operations without having to throw up our hands and say we don’t know what the correct final state should be.

There’s a lot more about the “what” and the “how” that I’ll leave for a later post, but that should do as a teaser while we move on to the flexibility and performance parts. In its most conservative default mode, the master forwards writes to all other replicas before performing them locally and doesn’t report success to the client until all writes are done. Either of those “all” parts can be relaxed to achieve better performance and/or asynchronous replication at some small cost in consistency.

  • First we have an “issue count” which might be from zero to N-1 (for N replicas). This is the number of non-leader replicas to which a write must be issued before the master issues it locally.

  • Second we have a “completion count” which might be from one to N. This is the number of writes that must be complete (including on the master) before success is reported to the client.

The defaults are Issue=N-1 and Completion=N for maximum consistency. At the other extreme, Issue=0 means that the master can issue its local write immediately and Completion=1 means it can report success as soon as one write – almost certainly that local one – completes. Any other copies are written asynchronously but in order. Thus, we have both sync and async replication under one framework, merely tweaking parameters that affect small parts of the implementation instead of having to use two completely different approaches. This is what “unified replication” in the talk is about.

OK, on to performance. The main difference here is that the client-fan-out model splits the client’s outbound bandwidth. If you have N replicas, a client with bandwidth BW can never achieve more than BW/N write throughput. In the chain/splay model, the client can use its full bandwidth and the server can use its own BW/(N-1) simultaneously. This means increased throughput in most cases, and that’s not just theoretical: I’ve observed and commented on exactly that phenomenon in head-to-head comparisons with more than one alternative to GlusterFS. Yes, if enough clients gang up on a server then that server’s networking can become more of a bottleneck than with the client-fan-out model, and if the server is provisioned similarly to the clients, and if we’re not disk-bound anyway, then that can be a problem. Likewise, the two-hop latency with this approach can be a problem for latency-sensitive and insufficiently parallel applications (remember that this is all within one replica set among many active simultaneously within a volume). However, these negative cases are much – much – less common in practice than the positive cases. We did have to sacrifice some unicorns, but the workhorses are doing fine.

That’s the plan to (almost completely) eliminate the split-brain problems that have been the bane of our users’ existence, while also adding flexibility and improving performance in most cases. If you want to find out more, come to one of my many talks or find me online, and I’ll be glad to talk your ear off about the details.