Utilization and High Availability Analysis: Containers for Microservices

Written by: Ali Kheyrollahi

This article was originally published on Byte Rot by Ali Kheyrollahi, and we are sharing it here for Codeship readers.

Microservices? Are these not the same SOA principles repackaged and sold under a different label? Not this time, I'll attend to that question in another post. But if you are considering microservices for your architecture, beware of cost and availability concerns. In this post, I'll look at how using containers (such as Docker) can help you improve your cloud utilization, decrease costs, and above all, improve availability.

Most Cloud Resources Are Under-Utilized

We almost universally underestimate how long it takes to build a software feature. I'm not sure if it's because our time is more precious than money, but for hardware almost always the reverse is true: We always overestimate the hardware requirements of our systems.

Historically, this could have been useful. Commissioning hardware in enterprises is usually a long and painful process, and anyway, this overestimation included business growth over the years and planned contingency for spikes. But in an elastic environment such as the cloud? Well, it seems we still do that. In the UK alone, £1 billion are wasted on unused or under-utilized cloud resources.

Some of this is avoidable, by making use of the elasticity of the cloud and scaling up or down as needed. Many cloud vendors provide such functionality out of the box with little or no coding. But many companies already do that, so why is waste so high?

From personal experience, I can give you a few reasons why my systems do that...

Instance redundancy

Redundancy is one of the biggest killers when it comes to computing costs. And things don't change a lot being in the cloud: Vendors' availability SLAs usually are defined in a context of redundancy, and to be frank, some of it is purely cloud related.

For example, on Azure you need to have your VMs in an "availability set" to qualify for VM SLAs. In other words, at least two or more VMs are needed because your VMs could be taken out for patching at any time. However, within an availability zone, this is guaranteed not to happen on all machines in the same availability zone at the same time.

The problem is, unless you're a company with a massive number of customers, even a small instance VM could suffice for your needs. Even in a big company with many internal services, some services might not need big resource allocation.

Looking from another angle, adopting microservices will mean you can iterate your services more quickly, releasing more often. The catch is that your clients will not be able to upgrade at the same time, and you have to be prepared to run multiple versions of the same service/microservice. Old versions of the API cannot be decommissioned until all clients are weaned off the old one and moved to the newer versions. Translation? Well, some of your versions will have to run on a shoestring budget to justify their existence.

Containerization helps you tap into this resource, reducing your cost by running multiple services on the same VM. A system usually requires at least two or three active instances, allowing for redundancy. Small services loaded into containers can be co-located on the same instances, allowing for higher utilization of the resources and reduction of cost.

Improved utilisation by service co-location

This ain't rocket science...

Resource redundancy

Most services have different resource requirements. Whether Network, Disk, CPU, or memory, some resources are used more heavily than others. A service encapsulating an algorithm will be mainly CPU-heavy while an HTTP API could benefit from local caching of resources. While cloud vendors provide different VM setups that can be geared for memory, Disk IO, or CPU, a system still usually leaves a lot of redundant resources.

This is possibly best explained in the pictures below. No rocket science here either, but mixing services that have different resource allocation profiles gives us the best utilization.

Co-location of Microservices having different resource allocation profile

What's that got to do with microservices?

Didn't you just see it?! Building smaller services pushes you towards building and deploying more services, many of which need the high availability provided by the redundancy but not the price tag associated with it.

Docker is absolutely a must-have if you're doing microservices, or you're paying through the nose for your cloud costs. In QCon London 2015, John Wilkes from Google explained how they "start over 2 billion containers per week."

In fact, to be able to take advantage of the spare resources on the VMs, they tend to mix their production and batch processes. One difference here is that the live processes require locked allocated resources, while the batch processes take whatever is left. They analyzed the optimum percentages, minimizing the errors while keeping utilization high.

Containerization and availability

As we discussed, optimizing utilization becomes a big problem when you have many many services -- and their multiple versions -- to run. But what would that mean in terms of availability? Does containerization improve or hinder your availability metrics?

I haven't been able to find much in the literature, but as I'll explain below, even if you don't have small services requiring VM co-location, you're better off co-locating and spreading the service onto more machines. It even helps you achieve higher utilization.

By spreading your architecture to more microservices, availability of your overall service (the one the customer sees) is a factor of availability of each microservice. For instance, if you have 10 microservices with availability of four 9s (99.99 percent), the overall availability drops to three 9s (99.9 percent). And if you have 100 microservices, which is not uncommon, obviously this drops to only two 9s (99 percent). In this term, you would need to strive for a very high microservice availability.

Hardware failure is very common. For many components, it goes above 1 percent (Annualised Failure Rate). Defining hardware and platform availability in respect to system availability is not very easy. But for simplicity and the purpose of this study, let's assume failure risk of 1 percent. At the end of the day, our resultant downtime will scale accordingly.

If Service A is deployed onto three VMs, and one VM goes down (1 percent), the other two instances will have to bear the extra load until another instance is spawned -- which will take some time. The capacity planning can leave enough spare resources to deal with this situation, but if two VMs go down (0.01 percent), it will most likely bring down the service as it would not be able cope with the extra load.

If the Mean Time to Recovery is 20 minutes, this alone will dent your microservice availability by around half of four 9s! If you've worked hard in this field, you know how difficult it is to gain those 9s. Losing them like that is not an option.

So what's the solution? This diagram should speak better than words:

Service A and B co-located in containers, can tolerate more VM failures

By using containers and co-locating services, we spread instances more thinly and can tolerate more failures. In the example above, our services can tolerate two or maybe even three VM failures at the same time.

Conclusion

Containerization (or Docker, if you will) is a must if you're considering microservices. It helps with increasing utilization, bringing down cloud costs, and above all, improving your availability.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.