Deploying to production can be risky. Despite all the mitigation strategies we put in place—QA specialists, automated test suites, monitoring and alerts, code reviews and static analysis—systems are getting more complex every day. This makes it more likely that a new feature can ripple into other areas of your app, causing unforeseen bugs and eroding the trust customers have in you. Alluding to the “canary in the coal mine,” canary deployments is when software developers release a new version of software to just a subset of users or systems. By enabling new software within just part of the user base, developers can monitor any problems it creates without causing major disruption. This lets organizations keep general customer trust high while freeing developers to focus on innovation and delivering excellent new features to customers.
The History of Canary Deployments
The term “canary deployment” comes from an old coal mining technique. These mines often contained carbon monoxide and other dangerous gases that could kill the miners. Canaries are more sensitive to airborne toxins than humans, so miners would use them as early detectors. The birds would often fall victim to the gas before it reached the miners. This approach helped ensure the miners’ safety—one bird dying or falling ill could save multiple human lives. In canary development, the small part of our user base which receives a new version acts as the canary. It detects potential bugs and disruption without affecting every other system running. Canary releases are typically short-lived and used to validate whether a release meets the requirements it set out to.
OK, But How Do I Make This Magic Happen?
The idea itself is straightforward, but there are a lot of nuances in how we should approach deploying software this way. Often, we must know ahead of time that we’ll be canary releasing.
Does the Release Need It?
Canary deployments have a cost. They add complexity to your processes and to your systems. The approach also presents challenges for supporting your product as you have customers with issues on a number of differing versions of the software. When determining whether to use a canary deployment for a release you need to determine whether it is worth the cost. New releases introduce new code, new components and new services all of which can carry some risk. Some questions you should have answers to before considering a canary deployment approach include:
Does the release represent a significant refactor of existing code?
How does this new version scale and perform in the real world?
Have you tested all the edge cases for functional issues?
Simple code changes are relatively easy to deploy and not a fit for canary deployment. Think larger scale changes and updates to code. For example, you may decide that changing the location of a button is not worth a canary, but that introducing a new reporting module is.
What Will Be Your Canary?
It is important to know what things in your system you can use to partition users. There are commonly two areas that make great partitions: users and instances. Whilst creating a two-way partition is a good start, a many-way partition is much better as it allows you to incrementally increase your exposure, whilst simultaneously gaining confidence in the release.
Almost all applications have some concept of a “user” and you may choose to group users by such factors as location and timezone. You could partition by geographical region, for example, creating a canary release for your Chinese customers to be deployed during their night. This would ensure that the business impact will be small whilst still receiving some feedback from the night owls. Alternatively, you could partition on pure percentage, only showing five percent of users the new version and seeing if your error counts spike or if your responsiveness slows down. Or you could partition based on your early adopter program membership if you have one (the Microsoft Windows Insider Program is a great example of an early adopter program). Try to choose a partition where trust is high or where the loss of customer trust will have a low impact.
Separating by user information is the most powerful way to partition your users. If you don’t have a method for doing so, another way to approach it is to use your application and service instances as canaries. If you have multiple instances of your application, you can configure only a subset of them to have the new software.
What Infrastructure Do I Need?
If you want to implement the ability to canary deploy in your system, there are many options. The system needs to be able to partition the user base somehow. You also want to ensure you can change this partition at runtime. This can be done in code (e.g. using a switch statement) or you can use your load balancers to route traffic based on request headers in the requests. You can also save some development time and purchase tooling that will make it easy to manage your user partitions.
How Do I Know if Something Goes Wrong?
Canary deployments will only be useful to you if you can track their impact on your system. You’ll want to have some level of monitoring or analytics in place in your application. These analytics must correlate to how you’re partitioning your features. For example, if you’re partitioning by users in a region, you should be able to see traffic volume and latency by each region. Some useful analytics are latency, internal error count, volume, memory usage and CPU usage. Fortunately, it’s easy these days to wire in analytics and monitoring. You can grab open source options with no upfront cost, or you can get great capabilities through purchasing commercial products. If you’re on a cloud platform, many of these metrics are built in. It’s usually not worth building it yourself- you will want to tweak an existing package according to your needs. It’s also a good idea to consider how you will rollback if your analytics reveal the canary release doesn’t meet your goals.
When Do I Release the New Version to Everyone?
As we discussed earlier, canary releases are ideal for validating that a change meets the requirement. You want to keep the canary short-lived (maximum of a few hours) and then you remove the canary, and fully roll it out. You shouldn’t be maintaining two different branches of the code as a result of creating a canary - the canaried release should be cut off your mainline.
Focus on Achieving Excellence, Not Avoiding Risk
If you implement canary deployments for your releases, you’ll feel a significant mental weight lift off of you. You’ll find yourself thinking less about production outages and disruptions. Instead, you’ll think more about how to push that next exciting feature to your customers.