Saturday, June 16, 2012

Silver Bullets - Interface-Based Programming

COM is Love - Don Box

This is another post in the Silver Bullets series. This series presents a set of best practices, design and implementation principles to combat software complexity.

Interface-based programming takes information hiding and applies it to object-oriented programming. In this post I'll talk a lot about C++, but it applies to other statically typed programming languages like Java and C#. I will not discuss dynamically typed languages today.

In the Beginning there was COM...

It all started (for me) in the previous millennium. I was coding desktop applications on Windows and using the awesome COM (Component Object Model) technology. I liked the concept of component-based software. I was fascinated by the strong boundaries a COM object established and by the precision that went into declaring the interface in a separate language (MIDL).  It worked really well, but it was all very mysterious to me. I didn't understand what's going on exactly and how it all mapped to the C++ concepts I was familiar with.

When I joined a fresh startup armed with my beloved COM concepts I pushed very hard to convert a complex code base to a clean COM-based design. After a lot of persuasion on my side I managed to get most of the developers on my side and stopped development for a week to refactor the code base. Exhausted yet proud, I unveiled the shiny new component-based design to my fellow developers. Then, I tried to build it and all hell broke loose. My über-sophisticated refactoring didn't play very well with the Visual Studio build system and it was strange build errors galore. I got error messages from build steps I didn't even know existed. It took me and another developer two more weeks to get to a stable state with terrible hacks. Note, that it was totally my fault. The COM technology itself was flawless and the visual studio build system did exactly what it was supposed to do.

The Enlightenment

Then, I got my hands on the excellent book "Essential COM" by Don Box and started reading. Chapter 1 opened my eyes to a new world. Don Box explained in it the problem COM was trying to solve and basically developed COM from scratch using plain C++. No exotic CoCreateInstance() calls, no strange thread apartments. The basic idea is to physically separate interface from implementation. When you inherit from a concrete class in C++ or when you call a method through a pointer or reference to a concrete class your code must be linked against this concrete class and in cascading fashion to all the concrete classes it knows about (which often means the whole world).

Check out the following code for a fictional yet very realistic example. Class A wants to call the foo() method of B. It gets a pointer to a concrete instance of B and as a result now depends on B's implementation, which depends on a factory, a database and some fancy service. Each one of them probably depends on many other classes.

class A
{
public:
    A(B* b) { b->foo(); }
}

class B
{
public:
    void foo() 
    { 
        DB& db = Factory::GetDB();
        db.UpdateFooTable();
        FancyService::ExecuteFancyAction();
    }
}

Consider an alternative interface-based design. Instead of a pointer to a concrete instance of B, the A class now accepts a pointer to an interface IB. The B class implements IB, but A has no idea and doesn't have to be linked with B and its dependencies. At runtime the B class may be loaded from a dynamic/shared library and passed to A.

class A
{
public:
    A(IB* b) { b->foo(); }
}

struct IB
{
    virtual void foo() == 0;
}

C++ doesn't have real interfaces. But, it's got pure virtual functions, which can be used for the same effect. I declare C++ interfaces as structs that contain only pure virtual functions. The reason I use a struct and not a class is that structs are public by default, so it saves me the trouble of writing "public:" and it also communicates the fact that everything here is public. I prefer not to specify a destructor, because object lifetime management usually doesn't belong on an interface. Remember the using object doesn't know anything about the implementation behaind the interface, so it shouldn't know when and how to destroy it. I will discuss object lifetime management in a future post.

What Else Is Cool About Interfaces? 

Interfaces enable many other important best practices like dependency injection, mock-based testing, strategy pattern, interception and granular access to selective feastures.

Dependency Injection


Dependency injection sounds scary and I actually prefer the term Third Party Binding. What it means it that a component can't instantiate its own dependencies or even look them up via some lookup service. This way the component is really decoupled from the implementation of its dependencies and once tested you know it will function as desgined regardless of the environment it operates in.

IB* b = new B();
A* a = new A(b);

Mock-Based Testing


Thoroughly testing a component using interface-based programming is made possible by replacing the concrete dependecies with mock objects that implement the same interfaces. This is powerful stuff. It allows you to simulate ANYTHING: Database failures, out of memory, network failures, network delays, other buggy components, accelerated time, etc. Let's say we want to test a class  that fetches some data over the network through an interface called IDataFetcher. To make the example realistic the interface will be asynchronous and return a FutureData interface that has an IsReady() method that the caller can query.

struct IFutureData
{
    virtual bool IsReady() = 0;
    virtual std::string GetError() = 0;
    virtual const std::string& GetData() = 0;
};

struct IDataFetcher
{
   virtual IFutureData& FetchData(std::string url) = 0;
};

This is a real-world depedency that's really difficult to test with a concrete DataFetcher. But, with the following mock object it becomes a piece of cake.

struct MockDataFetcher : 
   public IDataFetcher,
   public IFutureData
{
   // IDataFetcher implementation
   IFutureData& FetchData(std::string url) { return this; }
   // IFutureData implementation
   bool IsReady() { return ready; }
   std::string GetError() { return error; }
   const std::string& { return data; }
   
   bool isReady;
   std::string error;
   std::string data; 
};

Suppose the object under test is an analyzer that aggregate data coming in from the data fetcher:
class DataAnalyzer
{
   DataAnalyzer(IDataFetcher fetcher);
   AnalysisResult Analyze();
};

You want to test the Analyze() method that fetches data using the IDataFetcher and by providing the MockDataFetcher and setting its isReady, error and data members you can fully control what the DataAnalyzer will encounter when it tries to fetch data.

Strategy Pattern


The strategy pattern is a common design pattern that shows up in many large systems. The gist of it is that some algorithm embedded in a bigger process has severl alternative implementations and you want to dyamically select the algorithm. For example, suppose you write a data compression class and depending on the nature of the data to encode (test, audio, video) you want to select as proper compression algorithm. By having all compression algorithm implement a simple unified interface like

struct ICompressionAlgorithm
{
    virtual void * compress(void * buff) == 0;
}

You can have a compressor class that performs many common actions for any type of data, but differs only in the compression algorithm it is using:

class Compressor
{    
public:    
    Compressor(ICompressionAlgorithm& algorithm);    
    void * Compress(std::string filename);       
private:    
    ICompressionAlgorithm& _algorithm;       
}

Then you can implement many different algorithms for different types fo data and instantiate the single Compressor class with different concrete compression algorithms:

class TextCompressionAlgorithm : public ICompressionAlgorithm { ... }
class VideoCompressionAlgorithm  : public ICompressionAlgorithm { ... }
class AudioCompressionAlgorithm : public ICompressionAlgorithm { ... }
And finally instantiate dedicated compressor objects for different types of data:
// Instantiate a concrete compression algorithm
VideoCompressionAlgorithm vca;
// Instantiate a compressor passing the compression strategy 
Compressor videoCompressor(vca);

Remember to be dilligent about who's in charge of cleaning up all these objects. In this case I use a reference for the strategy which puts the burden on the codethat instantiates the Compressor class.

Interception


Interception is another interesting use case where interfaces shine. Interception allows you to intercept (duh!) method calls and do something before/after each call. Aspect-oriented programming is all about interception of course, but normally people associate it with special frameworks like PostSharp or even whole languages like AspectJ or at least some laguage features like decorators in Python.

But, this is a lot of magic and many people don't like magic. With interfaces you can implement explicit interception by wrapping the actual object that performs the work with a wrapper that exposes the same interface performs some actions before the call, calls the the main object and then performs some actions after the call.

The classic example is logging. Suppose you want to log every call to a certain class. You could of course add the logging code inside the class methods, but that will require the target class to be aware of the log (is it a file? print to the console? send something over the network?) or pass a logger object that knows all about that. But, even a logger object may be too complicted and require modifying the target class. Here interaction via an interface comes to the rescue:

struct IWorker
{
    virtual void DoWork() = 0;
};

class RealWorker : public IWorker 
{
    void DoWork() { // do some serious work here }
};

class LoggingWorker : public IWorker
{
    LoggingWorker(IWorker worker, ILogger& logger)
    {
        logger.Log("Before DoWork()");
        worker.DoWork();
        logger.Log("After DoWork()");
    }   
};

By passing a LoggingWorker as an IWorker to code that used RealWorker before you get logging without modifying either the RealWorker or the using code. This works particularly well if you have a WorkerFactory that instantiates workers in a one stop shop, but that's outside the scope of this post.

More about COM and interfaces

COM goes much further than the C++ interfaces I discussed here. It is a binary component technology, usable in-process and out-of-process (and to some flaky degree even across machines using DCOM). It takes threading into account and it is cross-language inter-operable. It is very powerful, but can be very complicated to use and requires run-time support as well as some tooling to use effectively. All these capabilities were fine-tuned later and served as the foundations of the .NET framework. Interface-based programming is just one facet of COM, but it is the single most important aspect.

COM also served as the inspiration to XPCOM a cross-platform COM implementation used as the foundation for Mozilla/FireFox and other successful applications like ActiveState Komdo.

Take Home Points:

1. COM is/was cool
2. Interface-based programming is a mechanism for information hiding and loose coupling
3. Interfaces are the most important design feature for in-process component-based architecture
4. Model C++ interfaces as structs of pure virtual functions WITHOUT a destructor.

Sunday, June 10, 2012

Through the Rabbit Hole - RabbitMQ Clustering


Paulo Coelho - Waiting is painful. Forgetting is painful. But not knowing which to do is the worse kind of suffering.


RabbitMQ is a messaging broker. Someone publishes messages to a queue and 170px-Alice_par_John_Tenniel_02someone else consumes them and acts upon them. You can read all about it on the interwebs. It's the ultimate decoupling mechanism, which is your best ally in the fight against complexity. There are several things I like about RabbitMQ:
  1. It's fast
  2. It's cross-platform
  3. It supports multiple workflows (work queues, RPC, pub-sub)
  4. It has a nice admin interface (web UI, REST API, command-line)
  5. It has a vibrant community
RabbitMQ is designed to scale out. It supports clustering of multiple nodes. You can use it to distribute the load across multiple nodes and have redundancy. There is a very good clustering guide here, but managing and administering a RabbitMQ cluster is anything but trivial.
In this post I'll take you on a journey through all the problems, issues and gotchas I ran into and how I solved them. I put up a public GitHub project called elmer with a full-fledged cluster control and administration script that embodies all the lessons I learned.

Background

Some context first - I'm a member of the services team at Roblox. We decided to use RabbitMQ to provide a buffer for an internal SOLR-based search service (and other services in the future). The idea is to queue update requests coming from bazillion web site front ends and the service (distributed on multiple boxes) can consume and process them when it feels like it.
I was responsible to create a somewhat generic component that the publishers (web site instances) and the consumers (search service instances) can both use. Our glorious Ops team provisioned Three Red Hat Enterprise boxes and installed Erlang and a recent version of RabbitMQ (2.8.2) on all boxes.

The Goal

My goal was to create a simple cluster that consists of two disc nodes and one RAM node. As long as one disc node is up your cluster configuration (exchanges, queues, bindings) persists and you can freely add/remove nodes to/from your cluster (although if you don't use highly available queues you might lose messages in queues on crashed nodes). I had already installed RabbitMQ locally and ran a bunch of tests on it. I read through the clustering guide and everything seemed very straight forward. I happily followed the step by step instructions and guess what? It worked! No problem whatsoever! So, what is this post about? Couldn't I just twit the link to the clustering guide and be done with it? Well, I didn't stop there. I wanted to test what happens when things go south. What happens if part of the cluster is not up? What happens if the entire cluster is down? What if it's too slow?. To do that I needed a way to remotely control the cluster, put it in various broken states and resurrect it for the next test.

How to Build a Cluster

I'm somewhat of an old Python hand, so I went for a Fabric/Python based solution. Fabric is a Python library and command-line tool for talking to remote machines over SSH. I created a project on GitHub called elmer. It’s a fabfile that also doubles up as a regular Python module that lets you fully manage and administer a RabbitMQ cluster. I will not discuss Python, Fabric or the specifics of Elmer in this post. The code is pretty short and reasonably documented.
Let's start with the ingredients:
  1. 3 remote Unix boxes (you can try VMs) accessible via SSH
  2. Latest Erlang and RabbitMQ installed in the default location and configuration
  3. The management plugin (http://www.rabbitmq.com/management.html)
  4. Python 2.6+ (for the Python rabbitmqadmin script) 
  5. The management command-line tool (http://www.rabbitmq.com/management-cli.html)
You talk to RabbitMQ through 3 scripts: rabbitmq-server, rabbitmqctl and rabbitmqadmin. Don't ask me why three and not one. They are used correspondingly to start a RabbitMQ (launch it), to control it (stop, reset, cluster nodes together and get status) and to manage it (declare vhosts, users, exchanges and queues). Creating a cluster involves just rabbitmq-server and rabbitmqctl. Suppose the hosts are called rabbit-1, rabbit-2 and rabbit-3. Here is the official transcript:
On each host start the RabbitMQ server (starts both the Erlang VM and the RabbitMQ application if the node is down):
rabbit-1$ rabbitmq-server -detached
rabbit-2$ rabbitmq-server -detached
rabbit-3$ rabbitmq-server -detached

This works perfectly on a pristine cluster with all the nodes down. However, if the nodes were already part of a cluster then the last disc node to go down must be started first because all the other nodes look up to it as the ultimate truth of the cluster state. This is important if you try to recover a cluster with existing state. However, if you want to do a hard reset and just build your cluster from scratch it is a major PITA. In theory, restarted nodes will suspend for 30 seconds waiting for the last disc node to start. In practice... you know what they say about practice and theory (also who wants to wait 30 seconds?). I had to learn it the hard way, because the error messages are not great. I will tell you later how to restart your cluster nodes safely.

Moving on... it's time to cluster your nodes together. You pick one node and keep it running (let's say rabbit-1). Then you stop the RabbitMQ application on another node (let's say rabbit-2), reset it and cluster it to rabbit-1:

rabbit-2$ rabbitmqctl stop_app
rabbit-2$ rabbitmqctl reset
rabbit-2$ rabbitmqctl cluster rabbit@rabbit-1

You can repeat the same procedure for rabbit-3. This will create a cluster where rabbit-1 is the only disc node and rabbit-2 and rabbit-3 are RAM nodes. If you want rabbit-2 to be a disc node too then add it at the end of the cluster command for both rabbit-2 and rabbit-3:

rabbit-2$ rabbitmqctl cluster rabbit@rabbit-1 rabbit@rabbit-2
rabbit-3$ rabbitmqctl cluster rabbit@rabbit-1 rabbit@rabbit-2

That's all. Pretty simple, right? Right, except if you want to change your cluster configuration. You’ll have to use surgical precision when adding and removing nodes from the cluster. What happens if a node is not restarted yet, but you try to go on with stop_app, reset and start_app? Well, the stop_app command will ostensibly succeed returning "done." even if the target node is down. However, the subsequent reset command will fail with a nasty . I spent a lot of time scratching my head trying to figure it out, because I assumed the problem is some configuration option that affects only reset.

Another gotcha is that if you want to reset the last disc node you have to use force_reset. Trying to figure out in the general case what node was the last disc node is not trivial.

RabbitMQ also supports clustering via configuration files. This is great when your disc nodes are up, because restarted RAM nodes will just cluster based on the config file without having to cluster them explicitly. Again, it doesn't fly when you try to recover a broken cluster.

Reliable RabbitMQ clustering

It comes down to this: You don't know what was the last disc node to go down. You don't know the clustering metadata of each node (may it went down while doing reset). To start all the nodes I use the following algorithm:
  • Start all nodes (at least the last disc node should be able to start)
  • If not even a single node can start you're hosed. just bail out.
  • Keep track of all nodes that failed to start
  • Try to start all the failed nodes
  • If some nodes failed to start the second time you're hosed. just bail out.
This algorithm will work as long as your last disc node is physically Ok.

Once all the cluster nodes are up you can re-configure them (remember you are not sure what is clustering metadata of each node). The key is to force_reset EVERY node. This ensures that any trace of previous cluster configuration is erased from all nodes. First do it for one disc node:

stop_app
force_reset
start_app

Then for every other node (either disc or RAM):

stop_app
force_reset
cluster [list of disc nodes]
start_app

Controlling a Cluster Remotely

You can log in to every box and perform the abovementioned steps on each box manually. That works, but it gets old really fast. Also, it is impractical if you want to build and tear down a cluster as part of an automated test, which is exactly what I needed. I wanted to make sure the components that publish and consume messages can handle node and/or cluster shutdown.

The solution I picked was Fabric. One serious gotcha I ran into is that when I performed the build cluster algorithm manually it worked perfectly, but when I used Fabric it failed mysteriously. After some debugging I noticed that the nodes start successfully, but by the time I try to stop_app the nodes are down. This turned out to be a Fabric newbie mistake on my part. When you issue a remote command using Fabric it starts a new shell on the remote machine. When the command is finished the shell is closed sending a SIGHUP (Hang up signal) to all its sub-processes including the Erlang node. One of my co-workers figured it out and suggested to use nohup to let the node keep running.

Administering a Cluster Programmatically

Administration means creating virtual hosts, users, exchanges and queues, setting permissions and binding queues to exchanges. The first thing you should do if you didn't already is install the management plugins. I'm not sure why you have to enable it yourself. It should be enabled by default. The web UI is fantastic and you should definitely familiarize yourself with it. However, to administer a cluster remotely there is an REST-full management API () you can use. There is also a Python command-line tool called rabbitmqadmin that requires Python 2.6. Using rabbitmqadmin is pretty simple. The only issue I found is that I could use only the default guest account to administer the cluster. I created another administrator user called admin, set its permissions to all (configure/read/write) and gave it a tag of "administrator" (additional requirement of the management API), but I kept getting permission errors. I'm still looking into this one.

The Elmer project allows you to specify a cluster configuration as a Python data structure (see the sample_config.py) and will set up everything for you.

What's Next?

In a future post I'll talk about setting up a local cluster on Windows. Local clusters are awesome for testing. The step by step instructions on the RabbitMQ web site are for Unix and there are some subtle points when you try to do it on Windows.

Maybe I’ll also talk about the RabbitMQ client-side and how to write a robust client (either a producer or consumer) that can survive temporary cluster failures.

Take Home Points:

1. RabbitMQ is cool
2. The cluster admin story is not air-tight
3. Programmatic administration is key
4. Fabric is an awesome tool to remotely control multiple Unix boxes