all posts tagged Howtos


by on October 22, 2013

Keeping your VMs from going read-only when encountering a ping-timeout in GlusterFS

GlusterFS communicates over TCP. This allows for stateful handling of file descriptors and locks. If, however, a server fails completely, kernel panic, power loss, some idiot with a reset button... the client will wait for ping-timeout (42 by the defaults) seconds before abandoning that TCP connection. This is important because re-establishing FDs and locks can be a very expensive operation. As glusterbot says in #gluster:

Allowing a longer time to reestablish connections is logical, unless you have servers that frequently die.

When you're hosting VM images on GlusterFS, that 42 seconds will cause your ext4 filesystems to error and become read-only. You have two options:

  • Shorten the ping-timeout
    You can shorten the ping-timout by setting the volume option, network.ping-timeout
  • Change ext4's error behavior
    You can change ext4's error behavior with the mount option, "errors=continue" or by changing the default in the superblock using tune2fs
by on August 26, 2013

How To Check (And Fix) Indexes For Elasticsearch/Logstash

I had a computer with some bad ram that created a corrupt index in Elasticsearch. After trying all weekend and half of Monday to figure out how to get Elasticsearch running again with no response on the IRC channel, I eventually figured it out with the help of some obscure email references to references.

Elasticsearch was failing to boot with no errors, critical or not, just some warnings:

[2013-08-23 21:31:32,527][WARN ][index.shard.service      ] [Iron Fist] [logstash-2013.08.24][3] Failed to perform scheduled engine refresh
[2013-08-23 21:43:59,264][WARN ][index.merge.scheduler    ] [Iron Fist] [logstash-2013.08.24][3] failed to merge
[2013-08-23 21:43:59,267][WARN ][index.engine.robin       ] [Iron Fist] [logstash-2013.08.24][3] failed engine
[2013-08-23 21:43:59,340][WARN ][cluster.action.shard     ] [Iron Fist] sending failed shard for [logstash-2013.08.24][3], node[TFt4zNl4QjWO9dyDhwnwkA], [P], s[STARTED], reason [engine failure, message [MergeException[java.lang.RuntimeException: Invalid vInt detected (too many bits)]; nested: RuntimeException[Invalid vInt detected (too many bits)]; ]]
[2013-08-23 21:43:59,340][WARN ][cluster.action.shard     ] [Iron Fist] received shard failed for [logstash-2013.08.24][3], node[TFt4zNl4QjWO9dyDhwnwkA], [P], s[STARTED], reason [engine failure, message [MergeException[java.lang.RuntimeException: Invalid vInt detected (too many bits)]; nested: RuntimeException[Invalid vInt detected (too many bits)]; ]]
[2013-08-24 04:47:10,230][WARN ][index.merge.scheduler    ] [Iron Fist] [logstash-2013.08.24][2] failed to merge
[2013-08-24 04:47:10,236][WARN ][index.engine.robin       ] [Iron Fist] [logstash-2013.08.24][2] failed engine
[2013-08-24 04:47:10,302][WARN ][cluster.action.shard     ] [Iron Fist] sending failed shard for [logstash-2013.08.24][2], node[TFt4zNl4QjWO9dyDhwnwkA], [P], s[STARTED], reason [engine failure, message [MergeException[java.lang.RuntimeException: Invalid vLong detected (negative values disallowed)]; nested: RuntimeException[Invalid vLong detected (negative values disallowed)]; ]]
[2013-08-24 04:47:10,302][WARN ][cluster.action.shard     ] [Iron Fist] received shard failed for [logstash-2013.08.24][2], node[TFt4zNl4QjWO9dyDhwnwkA], [P], s[STARTED], reason [engine failure, message [MergeException[java.lang.RuntimeException: Invalid vLong detected (negative values disallowed)]; nested: RuntimeException[Invalid vLong detected (negative values disallowed)]; ]]

The problem was actually in shard 0's index which was never mentioned. The solution was to check the indexes using (based on the install locations for the RPM provided from http://www.elasticsearch.org/download/):

ES_HOME=/usr/share/elasticsearch
ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-0.90.3.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/*
INDEXPATH=/data/logstash/data/elasticsearch/nodes/0/indices/logstash-2013.08.24/0/index/
sudo -u logstash java -cp $ES_CLASSPATH -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $INDEXPATH

Once the problem index was identified, if the solution that it suggests is acceptible, run the command again with the "-fix" switch.

by on July 1, 2013

CentOS 6 upstart overrides: How to make something happen before something else

I have generic users that are logged in automatically since they're not allowed user prefs anyway, and it would be more of a hindrance in their environment.

I use puppet to manage a staging directory that has the home directory as I want it to be, and use tmpfs for the user's actual $HOME. In order for this to work, the staged default home needs to be copied onto the tmpfs mount before Xwindows starts.

All this means that I can't do it in rc.local as that's run simultaneous to the prefdm upstart job, creating a race condition.

To resolve the problem, I learned about upstart override configuration. It turns out you can create a <jobname>.override and override any stanza in the packaged upstart job configuration. In my case I created the following:

/etc/prefdm.override
# homedir - sets up the home directory of the auto-logged-in user
#

pre-start script
	/usr/bin/puppet agent --onetime --no-daemonize >/var/log/prestart.log
	/usr/bin/rsync -av /home/thinguest.orig/ /home/thinguest/ >>/var/log/prestart.log
	/sbin/restorecon -rv /home/thinguest >> /var/log/prestart.log
end script

This overrides the non-existent pre-start stanza in prefdm.conf with that script, ensuring those things are done before /etc/X11/prefdm -nodaemon is executed.

This is actually true, of course, for all the EL6 based distros, and also is true for Ubuntu/Debian.

by on May 30, 2013

Don't Get Stuck Micro Engineering for Scale

Somebody today asked if GlusterFS could be made as fast as a local filesystem. My answer just came out without a ton of thought behind it, but I found it rather profound.

Comparing (any) clustered filesystem to a local filesystem is like comparing apples and orchards

You can reach up and quickly grab an apple, eat it, and its purpose is served. But then you look at the other apples in the orchard and they're not nearly as easy to use. If you wanted an apple from that tree over there, it might require considerable walking (increased latency). The aggregate performance of picking all the apples in the orchard will most certainly not be the same as reaching up and picking the apple on your local branch.

However...

If your goal is not to feed just yourself, but a thousand people, you look at the ability to complete the whole job. If you had to feed them from your local tree, it would take a considerably long time. Just picking and distributing the apples from the one tree would take a long time.

In the orchard, though, you could have them disbursed to a multitude of trees. Each person could reach up and pick an apple. The scaled performance would far exceed the performance of just one local tree.

Consider your total workload

Performance of one thread reading and writing to one file is not going to be as fast. But what about thousands of simultaneous file accesses? Millions? Scale must be thought of all at once. Don't get stuck micro engineering.

by on January 30, 2013

GlusterFS volumes not mounting in Debian Squeeze at boot time

In mixed results, some users have been reporting issues with mounting GlusterFS volumes at boot time. I spun up a VM at Rackspace to see what I could see.

For my volume I used the following fstab entry. The host is defined in /etc/hosts:

server1:testvol /mnt/testvol glusterfs _netdev 0 0

The error listed in the client logs tells me that the fuse module isn't loaded when the volume tries to mount:

[2013-01-30 17:14:05.307253] E [mount.c:598:gf_fuse_mount] 0-glusterfs-fuse: cannot open /dev/fuse (No such file or d
irectory)
[2013-01-30 17:14:05.307348] E [xlator.c:385:xlator_init] 0-fuse: Initialization of volume 'fuse' failed, review your
volfile again

There are no logs with useable timestamps. The init scripts in /etc/rcS.d show that networking is being started before fuse. networking calls any scrips in /etc/network/if-up.d when the network comes up. Of these, the inaptly named mountnfs mounts all the fstab entries with _netdev set using the command

mount -a -O_netdev

The fuse init script was designed with the expectation that all the remote filesystems should already be mounted (for the case of nfs mounted /usr). This means that it's scheduled after networking to allow those remote mounts to occur.

Solution

Since I don't really care if remote filesystems are mounted before the fuse module is loaded, I worked around this by changing /etc/init.d/fuse replacing $remote_fs with $local_fs for the Required-Start:

# Required-Start:    $local_fs

Then re-order the init processes:

update-rc.d fuse start 34 S . stop 41 0 6 .

PS:

People often ask us to document troubleshooting steps. Because it's not supposed to fail, there are seldom fixed troubleshooting steps. If there were, we'd file bug reports and get them fixed.

Here's the process I used:

Check the client log. That's actually one that's documented everywhere. If something goes wrong, check the log.

Fuse isn't loaded. Where's it supposed to get loaded from? I'm out of my expertise with debian so I grep fuse /etc/init.d/* to see what all might have an effect. Looks like /etc/init.d/fuse is it.

fuse's Default-Start is "S" so I looked in /etc/rcS.d and saw the boot order. Thinking that mountnfs.sh (S17mountnfs.sh) was the likely script that was supposed to mount the gluster volume, I manually set the start order of fuse higher. (mv S19fuse S16fuse). Rebooting still didn't mount the volume.

I decided to see for sure where the volume was being started so in /sbin/mount.glusterfs I added "ps axf >>/tmp/mounttimeps". Rebooted.

Looking in my new file I saw:

  103 hvc0     Ss+    0:00 init boot 
  104 hvc0     S+     0:00  \_ /bin/sh /etc/init.d/rc S
  107 hvc0     S+     0:00      \_ startpar -p 4 -t 20 -T 3 -M boot -P N -R S
  399 hvc0     S      0:00          \_ startpar -p 4 -t 20 -T 3 -M boot -P N -R S
  400 hvc0     S      0:00              \_ /bin/sh -e /etc/init.d/networking start
  402 hvc0     S      0:00                  \_ ifup -a
  490 hvc0     S      0:00                      \_ /bin/sh -c run-parts  /etc/network/if-up.d
  491 hvc0     S      0:00                          \_ run-parts /etc/network/if-up.d
  492 hvc0     S      0:00                              \_ /bin/sh /etc/network/if-up.d/mountnfs
  502 hvc0     S      0:00                                  \_ mount -a -O _netdev
  503 hvc0     S      0:00                                      \_ /bin/sh /sbin/mount.glusterfs server1:testvol /mnt/testvol -o rw,_netdev

This pretty clearly showed that "networking" was responsible for causing the mount attempt. Since networking clearly happens before $remote_fs, I changed the requirements and reordered. The new order in /etc/rcS.d showed that fuse was going to start before networking and subsequent reboots proved that to work correctly.

I'll be working with the package maintainer for gluster-client to see if a proper solution can be implemented.

by on January 25, 2013

Linked List Topology with GlusterFS

Here’s a nice post about creating a linked list topology for a distributed-replicated setup. The idea is that it is easier to add a single server to a replicated volume by spending a bit of extra time prepping a linked list of bricks. The default topology would leave the author with the need of adding a pair of servers at a time:

The drawback to this setup is when servers are added, they must be added in pairs. You cannot have an odd number of servers in this topology.

Read the post to learn more about how (and why) he implemented a linked list topology.

by on October 17, 2012

Replacing a GlusterFS Server: Best Practice

Last month, I received two new servers to replace two of our three (replica 3) GlusterFS servers. My first inclination was to just down the server, move the hard drives into the new server, re-install the OS (moving from 32 bit to 64 bit), and voila, done deal. Probably would have been okay if I hadn't used a kickstart file that formatted all the drives. Oops. Since the drives were now blank, I decided to just put it in place, using the same gfid and let it self-heal everything back over.

This idea sucked. I have 15 volumes, and 4 bricks per server. Self-healing 60 bricks brought the remaining 32 bit server to it's knees (and I filed multiple bugs against 3.3.0 including that the load for self heal doesn't balance between sane servers). After a day (luckily I don't have that much data) of having everyone in the company mad at me, the heal was completed and I was a bit wiser.

Today I installed the other new server. I installed CentOS 6.3, created the LVs (I use lvm to partition up the disks to make resizing volumes easier should the need arise and to allow me to do snapshots before I make any major changes), and one new hard drive (My drives aren't that old. No need to replace them all).

I then added the new server to the trusted pool and used replace-brick to migrate one brick at a time to the new server. I also changed my placement of bricks to fit our newer best-practices.

oldserver=ewcs4
newserver=ewcs10
oldbrickpath=/var/spool/glusterfs/a_home
newbrickpath=/data/glusterfs/home/a
gluster peer probe $newserver
gluster volume replace-brick ${volname} ${oldserver}:${oldbrickpath} ${newserver}:${newbrickpath} start

I monitored the migration.

watch gluster volume replace-brick ${volname} ${oldserver}:${oldbrickpath} ${newserver}:${newbrickpath} status

Then committed the change after all the files were finished moving.

gluster volume replace-brick ${volname} ${oldserver}:${oldbrickpath} ${newserver}:${newbrickpath} commit

Repeat as necessary.

As for performance, it met my performance requirements: nobody calling me or emailing me to say that anything's not working or is too slow. My VM's continued without interruption, as did mysql - both hosted on their own volumes. As long as nobody noticed, I'm happy.

by on October 8, 2012

How to expand GlusterFS replicated clusters by one server

This has come up several times in the last week. "I have 2n servers with 2 or 4 bricks each and I want to add 1 more server. How do I ensure the new server isn't a replica of itself?"

This isn't a simple thing to do. When you add bricks, replicas are added with each set of bricks in the replica count. For "replica 2", each 2 new bricks added forms a replica pair. To prevent your two new bricks on the new server from being replicas of themselves, you'll need to move an old brick to the new server. This is done with the replace-brick command.

Replica Expansion Diagram

So first we move server1:/data/brick2 to server3:/data/brick2

volname=myvol1
from=server1:/data/brick2
to=server3:/data/brick2
gluster volume replace-brick $volname $from $to start

Monitor for completion

watch gluster volume replace-brick $volname $from $to status

Once it's completed then commit the change

gluster volume replace-brick $volname $from $to commit

Check your data to ensure it's all working right. If not, panic! Well, I suppose you could come join us in the IRC channel to help you figure out why, but it really should just work. First thing we're going to tell you is to check the logs, so might as well do that too.

Ok, now that your data's all moved, your volume is completely operational and all it's circuits are functioning perfectly, you're ready to add your two new bricks.

Save yourself some time and just format the brick store that's mounted at server1:/data/brick2. You'll have to wipe it and it's xattrs anyway, so that's much quicker.

gluster volume add-brick $volname server1:/data/brick2 server3:/data/brick1
gluster volume rebalance $volname start

And you're all set.

by on September 17, 2012

Howto: Using UFO (swift) — A Quick Setup Guide

This sets up a GlusterFS Unified File and Object (UFO) server on a single node (single brick) Gluster server using the RPMs contained in my YUM repo at http://repos.fedorapeople.org/repos/kkeithle/glusterfs/. This repo contains RPMs for Fedora 16, Fedora 17, and RHEL 6. Alternatively you may use the glusterfs-3.4.0beta1 RPMs from the GlusterFS YUM repo at http://download.gluster.org/pub/gluster/glusterfs/qa-releases/3.4.0beta1/

1. Add the repo to your system. See the README file there for instructions. N.B. If you’re using CentOS or some other RHEL clone you’ll want (need) to add the Fedora EPEL repo — see http://fedoraproject.org/wiki/EPEL.

2. Install glusterfs and UFO (remember to enable the new repo first):

  • glusterfs-3.3.1 or glusterfs-3.4.0beta1 on Fedora 17 and Fedora 18: `yum install glusterfs glusterfs-server glusterfs-fuse glusterfs-swift glusterfs-swift-account glusterfs-swift-container glusterfs-swift-object glusterfs-swift-proxy glusterfs-ufo`
  • glusterfs-3.4.0beta1 on Fedora 19, RHEL 6, and CentOS 6: `yum install glusterfs glusterfs-server glusterfs-fuse openstack-swift openstack-swift-account openstack-swift-container openstack-swift-object openstack-swift-proxy glusterfs-ufo`

3. Start glusterfs:

  • On Fedora 17, Fedora 18: `systemctl start glusterd.service`
  • On Fedora 16 or  RHEL 6 `service start glusterd`
  • On CentOS6.x `/etc/init.d/glusterd start`

4. Create a glusterfs volume:
`gluster volume create $myvolname $myhostname:$pathtobrick`

5. Start the glusterfs volume:
`gluster volume start $myvolname`

6. Create a self-signed cert for UFO:
`cd /etc/swift; openssl req -new -x509 -nodes -out cert.crt -keyout cert.key`

7. fixup some files in /etc/swift:

  • `mv swift.conf-gluster swift.conf`
  • `mv fs.conf-gluster fs.conf`
  • `mv proxy-server.conf-gluster proxy-server.conf`
  • `mv account-server/1.conf-gluster account-server/1.conf`
  • `mv container-server/1.conf-gluster container-server/1.conf`
  • `mv object-server/1.conf-gluster object-server/1.conf`
  • `rm {account,container,object}-server.conf

8. Configure UFO (edit /etc/swift/proxy-server.conf):
+ add your cert and key to the [DEFAULT] section:
bind_port = 443
cert_file = /etc/swift/cert.crt
key_file = /etc/swift/cert.key
+ add one or more users of the gluster volume to the [filter:tempauth] section:
user_$myvolname_$username=$password .admin
+ add the memcache address to the [filter:cache] section:
memcache_servers = 127.0.0.1:11211

9. Generate builders:
`/usr/bin/gluster-swift-gen-builders $myvolname`

10. Start memcached:

  • On Fedora 17: `systemctl start memcached.service`
  • On Fedora 16 or  RHEL 6 `service start memcached`
  • On CentOS6.x `/etc/init.d/memcached start`

11. Start UFO:

`swift-init main start`

» This has bitten me more than once. If you ssh -X into the machine running swift, it’s likely that sshd will already be using ports 6010, 6011, and 6012, and will collide with the swift processes trying to use those ports «

12. Get authentication token from UFO:
`curl -v -H 'X-Storage-User: $myvolname:$username' -H 'X-Storage-Pass: $password' -k https://$myhostname:443/auth/v1.0`
(authtoken similar to AUTH_tk2c69b572dd544383b352d0f0d61c2e6d)

13. Create a container:
`curl -v -X PUT -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername -k`

14. List containers:
`curl -v -X GET -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname -k`

15. Upload a file to a container:

`curl -v -X PUT -T $filename -H 'X-Auth-Token: $authtoken' -H 'Content-Length: $filelen' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername/$filename -k`

16. Download the file:

`curl -v -X GET -H 'X-Auth-Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername/$filename -k > $filename`

More information and examples are available from

=======================================================================

N.B. We (Red Hat, Gluster) generally recommend using xfs for brick volumes; or if you’re feeling brave, btrfs. If you’re using ext4 be aware of the ext4 issue* and if you’re using ext3 make sure you mount it with -o user_xattr.

* http://joejulian.name/blog/glusterfs-bit-by-ext4-structure-change/

by on August 31, 2012

GlusterFS replication do's and don'ts

GlusterFS spreads load using a distribute hash translation (DHT) of filenames to it's subvolumes. Those subvolumes are usually replicated to provide fault tolerance as well as some load handling. The advanced file replication translator (AFR) departs from the traditional understanding of RAID and often causes confusion (especially when marketing people try to call it RAID to make it easier to sell). Hopefully, this should help clear that up.

When should I use replication?

Fault Tolerance

The traditional filesystem handled fault tolerance with RAID. Often, that storage was shared between servers using NFS to allow multiple hosts to access the same files. This would leave the design with several single points of failure, the server cpu, power supply, raid controller, NIC, motherboard, memory, software, and the network cable, and switch/router.

Traditionally many web host implementations overcame this limitation by using some eventually consistent method (ie. rsync on a cron job) to keep a copy of the entire file tree on the local storage for every server. Though this is a valid option, it comes with some limitations. Wasted disk space is becoming an increasing problem as more and more data is being collected and analyzed. Synchronization failures sometimes go unchecked, any user-provided content was not immediately available, etc.

By replacing the fault tolerance of RAID with replication allows you to spread that vulnerability between 2 or more complete servers. With multipath routing using two or more network cards, you can eliminate most single points of failure between your client and your data.

This works because the client connects directly to every server in the volume1. If a server or network connection goes down, the client will continue to operate with the remaining server or servers (depending on the level of replication). When the missing server returns, the self-heal daemon or, if you access a stale file, your own client will update the stale server with the current data.

The down side to this method is that in order to assure consistency and that the client is not getting stale data, is needs to request metadata from each replica. This is done during the lookup() portion of establishing a file descriptor (FD). This can be a lot of overhead if you're opening thousands of small files, or even if you're trying to open thousands of files that don't exist. This is what makes most php applications slow on GlusterFS.

Load Balancing

By having your files replicated between multiple servers, in a large-file read-heavy environment, such as a streaming music or video provider, you have the ability to spread those large reads among multiple replicas. The replica translator works on a first-to-respond basis so if requesting a specific file becomes popular enough that it starts to cause load on a server, the less loaded server will respond first ensuring the optimal performance2 to the end-user.

What's a poor way to use replication?

A copy on every server

Some admins, stuck in the thought process that if it's not on the server it won't be available if a server goes down, increase the number of replicas with each new server. This increases the number of queries and responses necessary to complete the lookup and open a FD which makes it take even longer. Additionally, writes are written to every replica simultaneously so your bandwidth for writes is divided by the number of replicas.

You probably don't really want that behavior. What you want is to have the file available to every server despite hardware failure. This does require some prediction. Decide what your likelihood of simultaneous failure is between your replicas. If you think that out of N machines, you're only likely to have 1 machine down at any one time, then your replica count should be 2. If you consider 2 simultaneous failures to be likely you only need to have replica 3.

When determining failure probability, look at your system as a whole. If you have 100 servers and you predict that you might have as many as 3 failures at one time, what's the likelihood that all three servers will be part of the same replica set?

Most admins that I've spoken with use a simple replica count of 2. Ed Wyse & Co, Inc. has fewer servers so a replica count of 3 is warranted as the likelihood of simultaneous failures within the replica set is higher.

Across high-latency connections

GlusterFS is latency dependent. Since self-heal checks are done when establishing the FD and the client connects to all the servers in the volume simultaneously, high latency (mult-zone) replication is not normally advisable. Each lookup will query both sides of the replica. A simple directory listing of 100 files across a 200ms connection would require 400 round trips totaling 80 seconds. A single drupal page could take around 20 minutes.

To replicate read-only data3 across zones, use geo-replication.

Summary

Do

  • Use replica 2 or 3 for most cases

  • Replicate across servers to minimize SPOF

Don't

  • Require that your clients are also servers (they may be, but that should be a decision that's independently made)

  • Replicate to every server just to insure data availability

  • Replicate across zones

These are, of course, general practices. There are reasons to break these rules, but in doing so you'll find other complications. Like every bit of advice I offer on this blog, feel free to break the rules but know why you're breaking the rules.

1only when using the FUSE client
2there are further improvements slated to this routine to more actively spread the load
3as of 3.3 geo-replication is one-way