Managing Your RabbitMQ Cluster

RabbitMQ is a great distributed message broker but not so easy to administer programmatically. In this tutorial I’ll show you how to create a cluster, add nodes, remove nodes, start and stop. As a bonus I’ll share a Fabric file that lets you take total control. The code is available on GitHub.

Quick Introduction to RabbitMQ

RabbitMQ is a very popular message queue. You can have multiple producers sending messages, and consumers can consume these messages in a totally decoupled way. RabbitMQ is very popular for several reasons:

It’s fast and robust.
It’s open source, but there is commercial support if you want it.
It runs on your operating system.
It is actively developed.
It is battle tested.

RabbitMQ is implemented in Erlang, which is a bit unusual, but one of the reasons it is so reliable.

Prerequisites

For the purpose of this tutorial I’ll use a local Vagrant cluster of three nodes. If you already have three available machines (virtual or not), you may use them instead. Pay attention to the ports and networking.

Install VirtualBox

Follow the instructions to install VirtualBox.

Install Vagrant

Follow the instructions to install Vagrant

Create a RabbitMQ Cluster

Here is a Vagrantfile that will create a local three-node cluster on your machine. The OS is Ubuntu 14.04 (Trusty).


    # -*- mode: ruby -*- 
    # vi: set ft=ruby : 
    hosts = { 
      "rabbit-1" => "192.168.77.10", 
      "rabbit-2" => "192.168.77.11", 
      "rabbit-3" => "192.168.77.12" 
    } 
    Vagrant.configure("2") do |config|
      config.vm.box = "trusty64" 
      hosts.each_with_index do |(name, ip), i|
        rmq_port = 5672 + i
        admin_port = 15672 + i
        config.vm.define name do |machine|
          machine.vm.network :private_network, ip: ip
          config.vm.hostname = "rabbit-%d" % [i + 1]
          config.vm.network :forwarded_port, guest: 5672, guest_ip: ip, host: rmq_port
          config.vm.network :forwarded_port, guest: 15672, guest_ip: ip, host: admin_port
          machine.vm.provider "virtualbox" do |v|
            v.name = name
          end
        end
      end 
    end

To create an empty cluster, type: vagrant up.

Configuring SSH

To make it easy to ssh into the cluster nodes, type: vagrant ssh-config >> ~/.ssh/config.

If you type: cat ~/.ssh/config, you should see entries for rabbit-1, rabbit-2 and rabbit-3.

Now you can ssh into each virtual machine by name: ssh rabbit-1.

Make Sure the Nodes Are Reachable by Name

The easiest way is to edit the /etc/hosts file. For example, for rabbit-1 add the addresses of rabbit-2 and rabbit-3.

1
2	192.168.77.11 rabbit-2
3	192.168.77.12 rabbit-3

Repeat the process for all nodes.

Install RabbitMQ

I will use apt-get here for Debian/Ubuntu operating systems. If your cluster runs on a different OS, please follow the instructions on the RabbitMQ installation page.

Note that sometimes a rather out-of-date version of RabbitMQ is available by default. If you want to install the latest and greatest, you may download a .deb package directly or add RabbitMQ’s apt-repository, using these instructions.

The current version of RabbitMQ on Ubuntu 14.04 is 3.2, which is good enough for our purposes. Verify for yourself by typing: apt-cache show rabbitmq-server.

Let’s go ahead and install it on each machine:

1
2	sudo apt-get update
3	sudo apt-get install rabbitmq-server -y

Feel free to use your favorite configuration management tool like Chef or Ansible if you prefer.

Note that Erlang will be installed first as a prerequisite.

Enable the RabbitMQ Management Plugin

The management plugin is really cool. It gives you an HTTP-based API as well as a web GUI and a command-line tool to manage the cluster. Here is how to enable it:

1
2	sudo rabbitmq-plugins enable rabbitmq_management

Get the Management Command-Line Tool

Download it from http://192.168.77.10:15672/cli/rabbitmqadmin. Note that the RabbitMQ documentation is incorrect and tells you to download from http://:15672/cli/.

This is a Python-based HTTP client for the RabbitMQ management HTTP API. It is very convenient for scripting RabbitMQ clusters.

Basic RabbitMQ Concepts

RabbitMQ implements the AMQP 0.9.1 standard (Advanced Message Queue Protocol). Note that there is already an AMQP 1.0 standard and RabbitMQ has a plugin to support it, but it is considered a prototype due to insufficient real-world use.

In the AMQP model, publishers send messages to a message broker (RabbitMQ is the message broker in this case) via an exchange. The message broker distributes the messages to queues based on metadata associated with the message. Consumers consume messages from queues. Messages may or may not be acknowledged. RabbitMQ supports a variety of programming models on top of these concepts such as work queues, publish-subscribe and RPC.

Managing Your Cluster

There are three scripts used to manage the cluster. The rabbitmq-server script starts a RabbitMQ server (launch it). The rabbitmqctl is used to control the cluster (stop, reset, cluster nodes together and get status). The rabbitmqadmin, which you downloaded earlier, is used to configure and administer the cluster (declare vhosts, users, exchanges and queues). Creating a cluster involves just rabbitmq-server and rabbitmqctl.

First, let’s start the rabbitmq-server as a service (daemon) on each of our hosts rabbit-1, rabbit-2 and rabbit-3.

1
2	sudo service rabbitmq-server start

This will start both the Erlang VM and the RabbitMQ application if the node is down. To verify it is running properly, type:

1
2	sudo rabbitmqctl cluster_status

The output should be (for rabbit-1):


    Cluster status of node 'rabbit@rabbit-1' ...
    [{nodes,[{disc,['rabbit@rabbit-1']}]},
     {running_nodes,['rabbit@rabbit-1']},
     {partitions,[]}]
    ...done.    

This means the node is not clustered with any other nodes yet and it is a disc node. It is also running as you can see that it appears in the running_nodes list.

To stop the server, issue the following command:

1
2	sudo rabbitmqctl stop_app

Then if you check the cluster status:

1
2	sudo rabbitmqctl cluster_status

The output should be:

1
2	Cluster status of node 'rabbit@rabbit-1' ...
3	[{nodes,[{disc,['rabbit@rabbit-1']}]}]
4	...done.

No more running nodes.

You can repeat the process for the other nodes (rabbit-2 and rabbit-3) and see that they know only themselves.

The Erlang Cookie

Before you can create a cluster, all the nodes in the cluster must have the same cookie. The cookie is a file that the Erlang runtime is using to identify nodes. It is located in /var/lib/rabbitmq/.erlang.cookie. Just copy the contents from rabbit-1 to rabbit-2 and rabbit-3.

Clustering Nodes Together

To group these separate nodes into a cohesive cluster takes some work. Here is the procedure:

Have a single node running (e.g. rabbit-1).
Stop another node (e.g. rabbit-2).
Reset the stopped node (rabbit-2).
Cluster the other node to the root node.
Start the stopped node.

Let’s do this. ssh into rabbit-2 and run the following commands:

1
2	sudo rabbitmqctl stop_app
3	sudo rabbitmqctl reset
4	sudo rabbitmqctl join_cluster rabbit@rabbit-1

Now type: sudo rabbitmqctl cluster_status.

The output should be:


    Cluster status of node 'rabbit@rabbit-2' ...
    [{nodes,[{disc,['rabbit@rabbit-1','rabbit@rabbit-2']}]}]
    ...Done.
As you can see both nodes are now clustered. If you repeat this on rabbit-1 you'll get following output:

    Cluster status of node 'rabbit@rabbit-1' ...
    [{nodes,[{disc,['rabbit@rabbit-1','rabbit@rabbit-2']}]},
     {running_nodes,['rabbit@rabbit-1']},
     {partitions,[]}]
    ...done. 

Now, you can start rabbit-2.

1
2	sudo rabbitmqctl start_app

If you check the status again, both nodes will be running:


    Cluster status of node 'rabbit@rabbit-2' ...
    [{nodes,[{disc,['rabbit@rabbit-1','rabbit@rabbit-2']}]},
     {running_nodes,['rabbit@rabbit-1','rabbit@rabbit-2']},
     {partitions,[]}]
    ...done.   

Note that both nodes are disc nodes, which means they store their metadata on disc. Let’s add rabbit-3 as a RAM node. ssh to rabbit-3 and issue the following commands:


    sudo rabbitmqctl stop_app
    sudo rabbitmqctl reset
    sudo rabbitmqctl join_cluster --ram rabbit@rabbit-2
    sudo rabbitmqctl start_app

Checking the status shows:


    Cluster status of node 'rabbit@rabbit-3' ...
    [{nodes,[{disc,['rabbit@rabbit-2','rabbit@rabbit-1']},
             {ram,['rabbit@rabbit-3']}]},
     {running_nodes,['rabbit@rabbit-1','rabbit@rabbit-2','rabbit@rabbit-3']},
     {partitions,[]}]
    ...done.

All cluster nodes are running. The Disc nodes are rabbit-1 and rabbit-2, and the RAM node is rabbit-3.

Congratulations! You have a working RabbitMQ cluster.

Real-World Complications

What happens if you want to change your cluster configuration? You’ll have to use surgical precision when adding and removing nodes from the cluster.

What happens if a node is not restarted yet, but you try to go on with stop_app, reset and start_app? Well, the stop_app command will ostensibly succeed, returning “done.” even if the target node is down. However, the subsequent reset command will fail with a nasty message. I spent a lot of time scratching my head trying to figure it out, because I assumed the problem was some configuration option that affected only reset.

Another gotcha is that if you want to reset the last disc node, you have to use force_reset. Trying to figure out in the general case which node was the last disc node is not trivial.

RabbitMQ also supports clustering via configuration files. This is great when your disc nodes are up, because restarted RAM nodes will just cluster based on the config file without you having to cluster them explicitly. Again, it doesn’t fly when you try to recover a broken cluster.

Reliable RabbitMQ Clustering

It comes down to this: You don’t know which was the last disc node to go down. You don’t know the clustering metadata of each node (maybe it went down while doing reset). To start all the nodes, I use the following algorithm:

Start all nodes (at least the last disc node should be able to start).
If not even a single node can start, you’re hosed. Just bail out.
Keep track of all nodes that failed to start.
Try to start all the failed nodes.
If some nodes failed to start the second time, you’re hosed. Just bail out.

This algorithm will work as long as your last disc node is physically OK.

Once all the cluster nodes are up, you can re-configure them (remember you are not sure what is the clustering metadata of each node). The key is to force_reset every node. This ensures that any trace of previous cluster configuration is erased from all nodes. First do it for one disc node:

1
2	stop_app
3	force_reset
4	start_app

Then for every other node (either disc or RAM):


    stop_app
    force_reset
    join_cluster [list of disc nodes]
    start_app

Controlling a Cluster Remotely

You can SSH into every box and perform the above-mentioned steps on each box manually. That works, but it gets old really fast. Also, it is impractical if you want to build and tear down a cluster as part of an automated test.

One solution is to use Fabric. One serious gotcha I ran into is that when I performed the build cluster algorithm manually it worked perfectly, but when I used Fabric it failed mysteriously. After some debugging I noticed that the nodes started successfully, but by the time I tried to stop_app, the nodes were down. This turned out to be a Fabric newbie mistake on my part. When you issue a remote command using Fabric, it starts a new shell on the remote machine. When the command is finished, the shell is closed, sending a SIGHUP (Hang up signal) to all its sub-processes, including the Erlang node. Using nohup takes care of that. Another more robust option is to run RabbitMQ as a service (daemon).

Administering a Cluster Programmatically

Administration means creating virtual hosts, users, exchanges and queues, setting permissions, and binding queues to exchanges. The first thing you should do if you didn’t already is install the management plugins. I’m not sure why you have to enable it yourself. It should be enabled by default.

The web UI is fantastic and you should definitely familiarize yourself with it. However, to administer a cluster remotely there is a RESTful management API you can use. There is also a Python command-line tool called rabbitmqadmin that requires Python 2.6+. Using rabbitmqadmin is pretty simple. The only issue I found is that I could use only the default guest account to administer the cluster. I created another administrator user called ‘admin’, set its permissions to all (configure/read/write) and gave it a tag of “administrator” (additional requirement of the management API), but I kept getting permission errors.

The Elmer project allows you to specify a cluster configuration as a Python data structure (see the sample_config.py) and will set up everything for you.

Take-Home Points

RabbitMQ is cool.
The cluster admin story is not air-tight.
Programmatic administration is key.
Fabric is an awesome tool to remotely control multiple Unix boxes.</server>