With the rise of the cloud, in particular, AWS, how does ‘traditional’ IBM MQ High Availability (HA) compare against its AWS alternatives—and is there any point in using them in the cloud?
I'll explore this in topic this post.
Traditional MQ High Availability Methods
Multi-Instance Queue Managers
When people explore IBM MQ High Availability, the first thing they’re likely to come across is ‘Multi-Instance queue managers’—the active/standby setup using a lease-based locking enabled network file system. This works well but does come with some overhead; managing client reconnection and ensuring an available and performant NFS store, for example.
Replicated Data Queue Managers (RDQM)
With the release of IBM MQ v9.1 (Long Term Support, released in v9.0.4 Continuous Delivery), IBM introduced Replicated Data Queue Managers—a highly available solution for Linux platforms that uses three servers in a HA group, each with a queue manager (this is to avoid split-brain scenarios).
One server is running a primary queue manager and synchronously replicating data to the other two secondary queue managers. In the event of a failure, control is assumed by another instance.
RDQM also comes with the optional addition of a floating IP address, shared between the three nodes, which goes a long way in simplifying management of client reconnection. Three servers is often two servers too many in a cloud environment, where we are looking to minimise the cost of the infrastructure. However, it does allow for an almost 100% available solution for those situations that require it.
The AWS Alternative
Auto Scaling EC2 Instances/EFS
A popular HA configuration for IBM MQ within AWS is to utilize Amazon EC2 instances in an ‘Auto-Scaling’ group, using Amazon’s ‘Elastic File System’ (EFS) as the underlying file storage for the queue managers data. Auto-Scaling group policies are configured to ensure the correct number of queue managers are running and can even be set to span across multiple availability zones (AZ) for extra resiliency. With EFS, automatic replication across multiple AZ’s and the ability to grow and shrink on demand is a built-in feature.
As with the other solutions, we still have one issue to solve, and that’s client reconnection in the event of a failover. By employing an AWS Elastic IP address, we can ensure a single, static IP is always assigned to the instance we desire, efficiently hiding queue manager failures to clients.
What About Containers?
So, where do containers fit in all of this? Containers with their lightweight and portable nature lend themselves well to MQ HA. Their rapid uptime and scalability are great, but when we’re talking about MQ there are things to consider: We need to:
- Ensure shared data volumes are constantly available to prevent message loss
- Configure our containers in a way that masks any failure or movement from connecting applications
- Understand when the queue manager within a container is ‘unhealthy’ and needs to be respawned
Elastic Container Service (ECS)
Amazon Elastic Container Service (ECS) is a managed container orchestration service which allows you to intelligently manage and scale your container applications through the use of task definitions and a programmatic API. Task definitions can allow us to define dependencies between containers and shared volumes, which is vital for when we need to persist data across the stateless containers which contain our queue managers.
Proactively monitoring the state of our queue managers—whether they be on containers, VMs or bare metal—is the most essential piece in our high availability puzzle. How do we know the actual queue manager is unhealthy rather than just the platform it is running on?
Simple Operating System and Server Resource Monitoring may not be enough to tell us our listener has failed to start, or even whether the queue manager has started correctly. We must inspect queue manager state at a more granular, MQ process level, to be confident about the health of our queue managers.
We can use AWS CloudWatch alarms within ECS to determine whether a queue manager is unhealthy within a container and should be purged. We can also utilize CloudWatch events to monitor key queue manager processes within our EC2 instances and terminate and scale on demand.
In summary, I’ve found that while traditional IBM MQ HA methods will always be considered when designing a highly available solution in the cloud, the elasticity, robustness and cost-optimization offered by AWS services provide a more robust, cloud-centric alternative.
Implementing proactive monitoring—whether it be by using proprietary AWS solutions or otherwise—is essential to ensuring a highly available queue manager estate.
Multi-instance give us a reliable, easy to use, well-known solution for MQ HA that doesn’t rely on additional monitoring solutions. The new RDMQ gives us far more robustness with the added expense of enabling three Queue Managers and their associated infrastructure.
Within an AWS environment the likes of autoscaling groups gives us some basic HA without the need for too much thought—however, additional monitoring needs to be instigated. Likewise, Containers are an excellent way to reduce the number of VMs running at any one time, but more monitoring needs to be implemented in order to achieve the level of sophistication that Multi-instance gives us for-free.
As with most solutions, it’s often a compromise as to which solution you choose, below is a table that compares 4 key components of high availability, and how each fare:
To learn more about enterprise-grade business-critical messaging and its various deployment options in private, public or hybrid cloud, check out the white paper below.
About Connor Smith
Connor Smith is an Integration Consultant working for Lightwell in the UK.
With expertise in IBM MQ and a strong focus on cloud technologies, Connor has a wealth of experience helping customers from a range of industries design and implement their integration solutions.