Web-scale Enterprise Jenkins using CloudBees Jenkins Platform - Private SaaS Edition
Jenkins World 2016 is now over and boy was it a blast! We had a huge turnout, great speakers, informative tech sessions, certifications, a great keynote by Kohsuke Kawaguchi and another by Sacha Labourey, CEO of CloudBees, Inc. For Sacha’s keynote demo , we wanted to shine the spotlight on the flagship product from CloudBees: CloudBees Jenkins Platform - Private SaaS Edition, and showcase the key capabilities of the product and how it enables enterprises to achieve CI/CD, even at extreme scale. Our goal was to set up the world’s largest enterprise Jenkins cluster using Private SaaS Edition and demonstrate how enterprises can “go from code check-in to production in under an hour.”
Setting the Stage
During the keynote demo, Sacha highlighted the following use-cases with Private SaaS Edition, showing how enterprises can go from “code check-in to production in under an hour” -
Set up an enterprise Jenkins cluster on AWS within minutes, using Private SaaS Edition
Onboard a new project team on the cluster with the click of a button (provision a new Jenkins controller)
Provision CI jobs of an entire GitHub org and spin up agents on-demand using the power of the GitHub Organization Folder plugin
Showcase the auto-healing capability of Private SaaS Edition - in the event a Jenkins controller or agent were to crash, the cluster should automatically provision a new controller/agent without loss of data, and in a matter of minutes
We also showed a live “world’s largest enterprise Jenkins cluster” using Private SaaS Edition, running real CI/CD workloads.
In this blog we want to talk about how we got Private SaaS Edition to set up and manage the above cluster, and also share key lessons we learned along the way. Read on and enjoy!
“Go Big, or Go Home!”
We wanted to showcase the above use-cases on a Private SaaS Edition cluster running on AWS with 2000 Jenkins controllers and 10,000 executors, at any given point in time.
Let’s set some context here for our readers - why 2000 controllers? Is that even important?
Here’s why it makes sense - with Private SaaS Edition, enterprises can essentially spin up a Jenkins controller for every single active development project. So, each project could essentially get its own CI/CD workspace - with Jenkins controllers and agents to handle the project’s CI/CD workloads. Isn’t that awesome!
“It’s a Marathon, not a Sprint!”
We wanted the audience to get a visual representation in real-time of the above use cases as Sacha walked through each step. So, we needed the following -
A dashboard (Blue Ocean-based) to show the size and health of the cluster in real-time
Additional health metrics from the cluster persisted in Elasticsearch (ES), that would be displayed on the dashboard
We built a prototype fulfilling the above requirements in less than four weeks. The next step was to scale this prototype to the world’s largest enterprise Jenkins cluster. We spent the next week determining the budget and appropriate EC2 instances to run the different components.
As we got the cluster to ~1400 Jenkins controllers, a couple hundred “worker” VMs and a few thousand executors, we seemed to have hit a dead-end. We were unable to scale the cluster any higher. Marathon became unresponsive thus preventing the spinning up of additional Jenkins controllers or agents. After a good deal of troubleshooting we discovered (quick shout-out to Dario for confirming our suspicions on a crazy Saturday morning!) that Marathon was stuffing a lot of state data (our state data) into Zookeeper zNodes. And once that happens, Marathon gets very confused and becomes unresponsive. Fortunately, we could make a relatively minor tweak to Zookeeper’s configuration, and increase the zNode limit significantly. This fix did the trick. The pressure valve was opened and we quickly scaled the cluster to 2000 Jenkins controllers
Side note : There is good news on this front. The very next release of Marathon (1.4.0) is supposed to address the limitation referenced above.
“Houston, We Have a Problem!”
Two days before Sacha’s keynote demo, disaster struck! The cluster we had so painstakingly set up was destroyed inadvertently. Here’s what happened - we had spun up a bunch of test clusters on AWS as part of this effort and in a haste to clean up these test clusters, our primary demo cluster was destroyed instead! To destroy a Private SaaS Edition cluster, you would type the following command in the CLI -
Except this time, someone accidentally also did this -
bees-pse destroy -f
The -f here is the same as the -f in the quintessential “rm -rf /.” In other words, if you have administrative privileges, you can delete the whole cluster without any verification.
This was a disaster scenario in every sense of the word. We had to get the entire cluster up and running for rehearsals the next day. Well as it turns out, we had designed Private SaaS Edition with these types of situations in mind from the get-go -
In a Private SaaS Edition cluster, the $JENKINS_HOME data is continually backed up (snapshots stored in EBS)
Customers can recreate their cluster from a snapshot, in the event of a disaster
“Bright, Sunshiny Day!”
Two hours later after the above disaster struck, we had the cluster back up and running - 300+ VMs, 2000 Jenkins controllers, 9000+ executors, Elasticsearch and our prototype monitoring system! A few hours later we were able to scale the cluster even more, and hit our goal of running 10,000 executors on this cluster. Here are some key stats from this cluster -
12 TB RAM (total cluster)
~320 EC2 instances
~2M jobs run in a given day
TBs of data in Elasticsearch
So in conclusion, if you are an enterprise looking for a turnkey solution to setup a Jenkins cluster at scale on your private cloud, do take Private SaaS Edition out for a spin and let us know your thoughts.
In the immortal words of Gordon Gecko - “Greed is good,” so maybe for next year’s CloudBees Keynote Demo at Jenkins World we can showcase the world’s largest enterprise Jenkins cluster that can span multiple regions and multiple cloud service providers, in a matter of minutes!
Finally, a quick plug for Stephen Connolly’s great blog post where he presents a blueprint to scale Jenkins to an even larger scale. We highly recommend it.
Kal Vissa, Senior Product Manager
John Pampuch, Engineering Manager
Stay up to date
We'll never share your email address and you can opt out at any time, we promise.