AWS and Cloudbees brought us this live webinar on building resilient infrastructure. Venkatesh Aravamudan, partner solutions architect at AWS, as well as Tim Hosey, CloudBees’ senior DevOps consultant, reminded us that we must plan for eventual failures, both large and small.
Categories of Failure
The categories of failure Venkatesh outlined included
Code deployments and configuration,
Data and state,
Highly unlikely scenarios.
Do you have threat detection? What happens if your data gets corrupted? What if your consumers overload your services? You need to plan for these scenarios which leads to building resilience into the system.
What Is Resilience?
Resilience is the ability to respond to and recover from failures quickly. Achieving this involves using high availability deployments, implementing disaster recovery methods, and using continuous improvement techniques.
Resilience Is a Shared Responsibility
Cloud platforms and customers share a responsibility for creating resilience. Cloud platforms must ensure resilience for hardware and infrastructure of the platform. Cloud vendor customers must pay attention to “resilience in the cloud.” In other words, the customers are responsible for designing and building resilient solutions that run on cloud services.
How Does AWS Provide Resilience of the Cloud?
AWS has 102 availability zones across thirty-two regions. Zones offer physical separation, which means resiliency against power interruptions, hardware issues, and other types of failures. Their infrastructure is connected on their own networks.
AWS has a culture of resilience. From their ownership model to their deployment and resolution processes, they maintain responsiveness and coordination across the organization.
How Can You Build Resilience in the Cloud?
AWS provides best practices and tooling to help decide what best fits your business’ needs.
The framework includes:
Defining and measuring resilience goals set at the workload level, not the organization level.
Identifying and mitigating risks by prioritizing and fixing.
Continuous code refinement to catch issues early and often.
Continuous integration/continuous deployment through automation.
Continuous testing of modeled failures to practice responding.
Continuous observability so you can see issues.
Recovering quickly via planning and building for resiliency.
Keeping in mind that everything is prone to failure, you can combine the resiliency built into AWS and your own built-in resiliency to your design to respond to and recover from these failures.
Resilience in CloudBees CI
From here, Tim Hosey spoke about how CloudBees can help you build resiliency into deployment. Their enterprise CloudBees CI service is designed to deliver high availability using Jenkins running on Kubernetes. That includes:
Active-active, which provides rolling restarts.
HA, which means recovering easily and quickly from a node failure.
Balanced loads across replicated controllers.
Jobs getting redeployed when the node running them fails.
Rolling updates to the CI servers, coming soon!
The major components of CloudBees CI include:
Pod templates define how to run the jobs. These jobs run on agents. And the controllers manage the jobs and agents. Since controllers and agents are running separately, they can be more resilient.
Because agents are running the jobs, they should have their own resource pools to prevent interference with controllers. Running agents and controllers on the same hardware can result in resource bottlenecks—high CPU, disk, or memory use—that can cause controllers to fail. So, it’s best to keep them on separate hardware.
Tim did a walkthrough of setting up an HA CloudBees CI configuration. Some points to note from the demo:
The HA configuration requires EFS due to the need for concurrent read/write.
The CloudBees operations center UI gives you a number of Kubernetes options, including secrets.
CloudBees CI UI shows you progress while things are being created.
The Jenkins controller page becomes available once everything is spun up.
A job that was killed was resumed on another agent automatically.
In this webinar, we learned that there are two main parties involved in building resiliency: the cloud provider and the customer. As a cloud provider, AWS does its part by having a culture built around resilience. Their customers can use their well-architected framework to help plan for and design their resilience in the cloud. CloudBees CI is an enterprise HA solution that provides resilient Jenkins CI for your pipelines. CI/CD is one of the main pillars of resilience since it allows us to respond quickly to issues.
You can use what Venkatesh and Tim taught us to start thinking about how your own organization might improve its resilience against inevitable failures.