Recently at Codeship, we'd been experiencing issues with our Docker registry provider. Over the past few months, we have unfortunately been passing on some stability issues with Quay.io to our customers. Due to a lack of resilience in the face of dependency downtime, we have been failing to start some of our customers builds.
Back in July, we published a postmortem of our most impactful incident caused by these outages. In this postmortem, we briefly touched on the corrective actions we had planned to address these problems.
This blog post explains how we decided upon that course of action and then how we delivered and validated an implementation.
Understanding the Problem
First off, why does downtime for Quay.io lead to builds not being started? We use Quay.io to store an important part of the Codeship Pro system. This component is a Docker image that contains our build agent.
jet, our CLI tool for running Pro builds on your local machine, runs on the same engine as this build agent during a hosted Codeship Pro build.
When your build hook arrives, we allocate it with an appropriately sized machine. This machine has a copy of the build agent on it. However, this is loaded onto the machine when we create the machine image itself. We build our Codeship Pro build machines with Amazon Machine Images (AMIs). When provisioning our fleet, we boot from this base AMI.
At Codeship, we like to be nimble and ship updates to our build agent regularly. Our release cadence for our Docker host AMI is too slow to keep up with the agent's release cycle. Also building AMIs is a slow process, certainly slower than building and shipping a Docker image. So we like to pull the latest build agent image the moment we pair your build hook with an instance.
Keeping these components decoupled is useful both for frequent updates to our build agent and the ability to build and test new Docker updates in isolation.
The downside of this process is that we are dependent on our Docker registry provider to be fully operational at the start of every build. The first failsafes we put in place were retries with exponential back-off. This served us well on the Pro platform for a long time. However, degraded performance from Quay.io was more frequent, sometimes for a relatively long period of time.
[caption id="attachment_5587" align="aligncenter" width="1999"]
The number of API (HTTP 500) errors that we experienced from quay.io over the last 60 days.[/caption]
Fixing the Problem
We started with identifying our desired outcomes and success criteria, and then did some high-level implementation discovery before launching into execution.
We mitigate the impact of degraded performance from our registry providers (as much as possible) on our customers. Builds should not fail to start because a single registry provider is not operational.
We retain our ability to ship changes to our build agent regularly.
We retain our ability to test-drive canary releases of our build agent.
High-level implementation details
We would rely upon on more than one registry provider.
We would identify and promote a new registry provider as our primary provider and Quay.io would become our failover provider.
We wanted to isolate this new behavior and responsibility to a single system (previously it existed in two places).
We can still ship a canary release.
If we purposefully degrade the performance of our primary registry provider, then our customers' builds should not be impacted.
We then set to planning out the specifics and fleshing out more implementation details. Who was our new primary registry provider going to be? Where was this logic going to live? Where was it not going to live? Where in the system needed refactoring to accommodate this change? Did we have and could we leverage any existing components in our system? And so on...
We eventually settled on something concrete.
Amazon ECR would become our new primary Docker registry provider. Why?
We could be certain the registry would be close to the build machines (we run our instances in AWS us-east-1 and we could put our registry there).
We had just implemented a brand-new Pro caching system backed by ECR. This meant we had experience and components to aid in our development.
Consequently this experience came with the data we needed to feel confident with ECR's stability.
We identified where this logic would live. This had multiple contributing factors.
The existing components we had for ECR were already in this system.
This system allocated the builds to the machines and it had all the information needed to orchestrate canary releases.
A bit of "horizon scanning" led us to believe that implementing it in this system meant we could better leverage this mechanism again for future endeavors.
We would instrument the heck out of it!
Then we executed. It looked something like this.
resilient_system = (two_engineers * one_week_development) / (plan_of_action) - (a_few_minutes * ecr_availability)
The last part of the formula is quite interesting. This is the bit where we purposefully break something, pause for dramatic effect, and then hopefully sigh in relief as everything resumes as normal in a quiet anticlimax.
To create this moment of chaos, my colleague did something simple. He deleted the build agent from our ECR registry. Knowing this error would be classified as a failure, we expected a failover to occur. Once we found the courage to squint through our tightly clenched eyes, we found everything to be still operating as we had hoped.
[caption id="attachment_5588" align="aligncenter" width="600"]
Blue: the number of attempted pulls from ECRRed: the number of attempted pulls from quay.io[/caption]
The metrics above show the behavior of a real failover test in production. The blue stacked line represents an attempt to pull from ECR, where the red stacked line is an attempt to pull from quay.io.
Just after 15:30, I deleted the image from ECR and let the system deal with the issue for 15 minutes before restoring the image. We continued to attempt ECR, but the moment there's a failure, we fall back to Quay.io. That's why you see a mixture of blue and red while the “outage” is occurring. I performed this failover just for the blog post, as the previous metrics we had just weren’t pretty enough.
Going forward at Codeship, it will now become policy to add a bit of “self-inflicted chaos” to all of our releases, so that we can feel confident we're doing our utmost to keep the chaos away from your builds.
Stay up to date
We'll never share your email address and you can opt out at any time, we promise.