Sunday, June 10, 2012

Through the Rabbit Hole - RabbitMQ Clustering


Paulo Coelho - Waiting is painful. Forgetting is painful. But not knowing which to do is the worse kind of suffering.


RabbitMQ is a messaging broker. Someone publishes messages to a queue and 170px-Alice_par_John_Tenniel_02someone else consumes them and acts upon them. You can read all about it on the interwebs. It's the ultimate decoupling mechanism, which is your best ally in the fight against complexity. There are several things I like about RabbitMQ:
  1. It's fast
  2. It's cross-platform
  3. It supports multiple workflows (work queues, RPC, pub-sub)
  4. It has a nice admin interface (web UI, REST API, command-line)
  5. It has a vibrant community
RabbitMQ is designed to scale out. It supports clustering of multiple nodes. You can use it to distribute the load across multiple nodes and have redundancy. There is a very good clustering guide here, but managing and administering a RabbitMQ cluster is anything but trivial.
In this post I'll take you on a journey through all the problems, issues and gotchas I ran into and how I solved them. I put up a public GitHub project called elmer with a full-fledged cluster control and administration script that embodies all the lessons I learned.

Background

Some context first - I'm a member of the services team at Roblox. We decided to use RabbitMQ to provide a buffer for an internal SOLR-based search service (and other services in the future). The idea is to queue update requests coming from bazillion web site front ends and the service (distributed on multiple boxes) can consume and process them when it feels like it.
I was responsible to create a somewhat generic component that the publishers (web site instances) and the consumers (search service instances) can both use. Our glorious Ops team provisioned Three Red Hat Enterprise boxes and installed Erlang and a recent version of RabbitMQ (2.8.2) on all boxes.

The Goal

My goal was to create a simple cluster that consists of two disc nodes and one RAM node. As long as one disc node is up your cluster configuration (exchanges, queues, bindings) persists and you can freely add/remove nodes to/from your cluster (although if you don't use highly available queues you might lose messages in queues on crashed nodes). I had already installed RabbitMQ locally and ran a bunch of tests on it. I read through the clustering guide and everything seemed very straight forward. I happily followed the step by step instructions and guess what? It worked! No problem whatsoever! So, what is this post about? Couldn't I just twit the link to the lustering guide and be done with it? Well, I didn't stop there. I wanted to test what happens when things go south. What happens if part of the cluster is not up? What happens if the entire cluster is down? What if it's too slow?. To do that I needed a way to remotely control the cluster, put it in various broken states and resurrect it for the next test.

How to Build a Cluster

I'm somewhat of an old Python hand, so I went for a Fabric/Python based solution. Fabric is a Python library and command-line tool for talking to remote machines over SSH. I created a project on GitHub called elmer. It’s a fabfile that also doubles up as a regular Python module that lets you fully manage and administer a RabbitMQ cluster. I will not discuss Python, Fabric or the specifics of Elmer in this post. The code is pretty short and reasonably documented.
Let's start with the ingredients:
  1. 3 remote Unix boxes (you can try VMs) accessible via SSH
  2. Latest Erlang and RabbitMQ installed in the default location and configuration
  3. The management plugin (http://www.rabbitmq.com/management.html)
  4. Python 2.6+ (for the Python rabbitmqadmin script) 
  5. The management command-line tool (http://www.rabbitmq.com/management-cli.html)
You talk to RabbitMQ through 3 scripts: rabbitmq-server, rabbitmqctl and rabbitmqadmin. Don't ask me why three and not one. They are used correspondingly to start a RabbitMQ (launch it), to control it (stop, reset, cluster nodes together and get status) and to manage it (declare vhosts, users, exchanges and queues). Creating a cluster involves just rabbitmq-server and rabbitmqctl. Suppose the hosts are called rabbit-1, rabbit-2 and rabbit3. Here is the official transcript:
On each host start the RabbitMQ server (starts both the Erlang VM and the RabbitMQ application if the node is down):
rabbit-1$ rabbitmq-server -detached
rabbit-2$ rabbitmq-server -detached
rabbit-3$ rabbitmq-server -detached

This works perfectly on a pristine cluster with all the nodes down. However, if the nodes were already part of a cluster then the last disc node to go down must be started first because all the other nodes look up to it as the ultimate truth of the cluster state. This is important if you try to recover a cluster with existing state. However, if you want to do a hard reset and just build your cluster from scratch it is a major PITA. In theory, restarted nodes will suspend for 30 seconds waiting for the last disc node to start. In practice... you know what they say about practice and theory (also who wants to wait 30 seconds?). I had to learn it the hard way, because the error messages are not great. I will tell you later how to restart your cluster nodes safely.

Moving on... it's time to cluster your nodes together. You pick one node and keep it running (let's say rabbit-1). Then you stop the RabbitMQ application on another node (let's say rabbit-2), reset it and cluster it to rabbit-1:

rabbit-2$ rabbitmqctl stop_app
rabbit-2$ rabbitmqctl reset
rabbit-2$ rabbitmqctl cluster rabbit@rabbit-1

You can repeat the same procedure for rabbit-3. This will create a cluster where rabbit-1 is the only disc node and rabbit-2 and rabbit-3 are RAM nodes. If you want rabbit-2 to be a disc node too then add it at the end of the cluster command for both rabbit-2 and rabbit-3:

rabbit-2$ rabbitmqctl cluster rabbit@rabbit-1 rabbit@rabbit-2
rabbit-3$ rabbitmqctl cluster rabbit@rabbit-1 rabbit@rabbit-2

That's all. Pretty simple, right? Right, except if you want to change your cluster configuration. You’ll have to use surgical precision when adding and removing nodes from the cluster. What happens if a node is not restarted yet, but you try to go on with stop_app, reset and start_app? Well, the stop_app command will ostensibly succeed returning "done." even if the target node is down. However, the subsequent reset command will fail with a nasty . I spent a lot of time scratching my head trying to figure it out, because I assumed the problem is some configuration option that affects only reset.

Another gotcha is that if you want to reset the last disc node you have to use force_reset. Trying to figure out in the general case what node was the last disc node is not trivial.

RabbitMQ also supports clustering via configuration files. This is great when your disc nodes are up, because restarted RAM nodes will just cluster based on the config file without having to cluster them explicitly. Again, it doesn't fly when you try to recover a broken cluster.

Reliable RabbitMQ clustering

It comes down to this: You don't know what was the last disc node to go down. You don't know the clustering metadata of each node (may it went down while doing reset). To start all the nodes I use the following algorithm:
  • Start all nodes (at least the last disc node should be able to start)
  • If not even a single node can start you're hosed. just bail out.
  • Keep track of all nodes that failed to start
  • Try to start all the failed nodes
  • If some nodes failed to start the second time you're hosed. just bail out.
This algorithm will work as long as your last disc node is physically Ok.

Once all the cluster nodes are up you can re-configure them (remember you are not sure what is clustering metadata of each node). The key is to force_reset EVERY node. This ensures that any trace of previous cluster configuration is erased from all nodes. First do it for one disc node:

stop_app
force_reset
start_app

Then for every other node (either disc or RAM):

stop_app
force_reset
cluster [list of disc nodes]
start_app

Controlling a Cluster Remotely

You can log in to every box and perform the abovementioned steps on each box manually. That works, but it gets old really fast. Also, it is impractical if you want to build and tear down a cluster as part of an automated test, which is exactly what I needed. I wanted to make sure the components that publish and consume messages can handle node and/or cluster shutdown.

The solution I picked was Fabric. One serious gotcha I ran into is that when I performed the build cluster algorithm manually it worked perfectly, but when I used Fabric it failed mysteriously. After some debugging I noticed that the nodes start successfully, but by the time I try to stop_app the nodes are down. This turned out to be a Fabric newbie mistake on my part. When you issue a remote command using Fabric it starts a new shell on the remote machine. When the command is finished the shell is closed sending a SIGHUP (Hang up signal) to all its sub-processes including the Erlang node. One of my co-workers figured it out and suggested to use nohup to let the node keep running.

Administering a Cluster Programmatically

Administration means creating virtual hosts, users, exchanges and queues, setting permissions and binding queues to exchanges. The first thing you should do if you didn't already is install the management plugins. I'm not sure why you have to enable it yourself. It should be enabled by default. The web UI is fantastic and you should definitely familiarize yourself with it. However, to administer a cluster remotely there is an REST-full management API () you can use. There is also a Python command-line tool called rabbitmqadmin that requires Python 2.6. Using rabbitmqadmin is pretty simple. The only issue I found is that I could use only the default guest account to administer the cluster. I created another administrator user called admin, set its permissions to all (configure/read/write) and gave it a tag of "administrator" (additional requirement of the management API), but I kept getting permission errors. I'm still looking into this one.

The Elmer project allows you to specify a cluster configuration as a Python data structure (see the sample_config.py) and will set up everything for you.

What's Next?

In a future post I'll talk about setting up a local cluster on Windows. Local clusters are awesome for testing. The step by step instructions on the RabbitMQ web site are for Unix and there are some subtle points when you try to do it on Windows.

Maybe I’ll also talk about the RabbitMQ client-side and how to write a robust client (either a producer or consumer) that can survive temporary cluster failures.

Take Home Points:

1. RabbitMQ is cool
2. The cluster admin story is not air-tight
3. Programmatic administration is key
4. Fabric is an awesome tool to remotely control multiple Unix boxes

3 comments:

  1. Nice! The clustering management business is a bit brittle at the moment, but we're working right now to make it much nicer.

    ReplyDelete
  2. I'm glad you like it. I'm looking forward to your improvements.

    ReplyDelete
  3. Nice use case. We are also using Rabbit for our project with celery. We are looking forward to distribute the tasks to multiple machines.

    Your experience is really helpful to us.

    Thank you,
    Haridas N.

    ReplyDelete