Creating a Disaster-proof, Multi-environment Jenkins Infrastructure

Written by: Hannah Inman

Disaster Proof

It is Wednesday morning. You just received a notification that Jenkins is down. You pull over, whip out the jet-pack and sure enough, you are getting an Nginx error... you try to ssh into Jenkins Master and nothing happens... you log into AWS and the EC2 instance running Jenkins Master has terminated. It is gone. That email AWS sent you 10 days ago must have been swallowed whole by your black hole of a filter. Unfortunately for you, it is release day TODAY… What do you do?

This has been a favorite interview question of ours for quite some time. We have heard everything from a series of manual steps a team must engage in, a couple of automated steps with manual steps to be performed sporadically in between and of course the infamous "I don't know, I never thought about that."

There are many problems that can arise from this scenario and sometimes it can be hard to consider them all:

  • Can the jobs be recreated quickly, manually or at all?
  • Is the job history (approvals, release days, build numbers) gone forever? Auditors will have a field day.
  • Do you know what version all of your plugins were (Not all versions play well with each other)?
  • Are the artifacts still around to deploy? Must you recreate all artifacts?
  • Do you have a configuration management tool that spells out all the steps to recreate your instance once it is up in AWS? Will a new instance come up in its place? I hope there were no manual steps...
  • Many more!

Our talk at Jenkins World is designed to assist you in answering these and related questions. The ideas discussed will influence you to take the necessary precautions and ensure that a palpable recovery plan for Jenkins mishaps exists.

The goal is to have everyone answer the original question... Jenkins is down, what do you do? Easy. My Jenkins infrastructure is designed to automatically remediate during a disaster. This is how…


Our team started off with a single Jenkins Master setup. This made sense for us, as I am sure it makes sense for many shops out there. A single master that every single team visits to run their jobs and manage each workflow.

Some problems we were facing:

  • As our team was releasing changes to Jenkins several times a day at least (Jenkins as code), we found that we were breaking it faster, too. A broken workflow on production release day is not fun for anyone.
  • Teams wanted changes made to the Jenkins environment at odd hours of the day (past merge freeze), and the DevOps team was not available 24 hours a day just to make sure your QA jobs are working. Because making changes to a job could allow developers access into sensitive areas and change production sites, Jenkins was kept locked down.

However, as Jenkins becomes a service provided to the developers via pull-requests and release cycles, you may consider splitting up Jenkins further to reap even more benefits. Some of you have been brave enough to venture into a Jenkins Master setup for each team!

There are several benefits from splitting up Jenkins into team environments.

  • Smaller blast radius: this means if you make one mistake (or a team commits a mistake) you decrease the risk of ruining all teams’ jobs, workflows and artifacts.
  • Teams can enhance their Jenkins setup as they please (via pull-requests of course).
  • Condensed list of jobs: If your organization has hundreds of jobs that keep it afloat how do you find the jobs you deem important for you and your team? Jenkins allows you to create filters, but those filters are even more wasted space on your screen and you require more than just a band-aid. A separate Jenkins for your team means all the jobs you care about are all that you see.

Splitting up Jenkins, one per team, is nice. What if we split Jenkins up into environments, QA and PRD? That is exactly what we did and there were some benefits:

  • Developers were empowered to make changes to their jobs in QA. Because QA could only affect non-production applications and infrastructure, the doors were opened (slightly). Developers could make quick changes though the UI to their QA jobs. These would of course be overwritten by our automated Jenkins setup. But a quick test (and lost manual changes each time our Jenkins clean-up job ran) allowed/forced developers to write the necessary Groovy to make their changes permanent.
  • An even smaller blast radius. Changes were automatically applied to our QA Jenkins environment. This meant that if it was release day for application teams, our changes would not affect their workflow of getting their code into production. This means DevOps, and teams could continue enhancing their Jenkins environment (in QA) and vetted changes would make their way into the production version of Jenkins. There was no longer a need to roll-back that change because a critical fast-path has to go to production NOW.

Topics that will be covered in our Jenkins World talk:

  • Using Docker to manage Jenkins master and agent infrastructure.
  • AWS UserData and cloud-init to configure the instances.
    • Groovy scripts are responsible for configuring the entirety of Jenkins. From setting up users, installing plugins, adding all credentials, configuring the ECS Plugin and more.
  • AWS automation to provision and maintain AWS resources
    • All AWS resources are managed programmatically. Pull-requests allow one to changes/enhance the resources necessary to keep Jenkins running and self remediating: EC2, AutoScaling Groups, Elastic Load Balancers, Launch Configuration, ECS, VPC, Security Groups, Route53, Lambda, AWS Certificate Manager, etc.
  • Separating Jenkins into multiple environments e.g. QA / PRD
    • Because Jenkins is a service the team is providing and is managed by release cycles, it makes sense to treat the CI/CD environment with CI/CD as well. You would never deploy an application to production without testing first...right? Separating Jenkins into a QA and PRD environment allows you to apply changes to CI environment quickly while not affecting production negatively.

This is a guest post by Steven Braverman, Manager of DevOps at ReachLocal and Adam Keller, Platform Engineer at CloudPassage. Attend their talk at Jenkins World . Still need to register? Apply code JWHINMAN at checkout for 20% off your conference pass.


Learn More

Want to learn the latest in Jenkins? Subscribe to the Jenkins newsletter , Continuous Information . This monthly newsletter contains all of the latest interesting and useful happenings in the Jenkins community, sent directly to your inbox. 






Stay up to date

We'll never share your email address and you can opt out at any time, we promise.