Web-scale Enterprise Jenkins using CloudBees Jenkins Platform - Private SaaS Edition

Introduction

Jenkins World 2016 is now over and boy was it a blast! We had a huge turnout, great speakers, informative tech sessions, certifications, a great keynote by Kohsuke Kawaguchi and another by Sacha Labourey, CEO of CloudBees, Inc. For Sacha’s keynote demo, we wanted to shine the spotlight on the flagship product from CloudBees: CloudBees Jenkins Platform - Private SaaS Edition, and showcase the key capabilities of the product and how it enables enterprises to achieve CI/CD, even at extreme scale. Our goal was to set up the world’s largest enterprise Jenkins cluster using Private SaaS Edition and demonstrate how enterprises can “go from code check-in to production in under an hour.”

Setting the Stage

During the keynote demo, Sacha highlighted the following use-cases with Private SaaS Edition, showing how enterprises can go from “code check-in to production in under an hour” -

  • Set up an enterprise Jenkins cluster on AWS within minutes, using Private SaaS Edition

  • Onboard a new project team on the cluster with the click of a button (provision a new Jenkins master)

  • Provision CI jobs of an entire GitHub org and spin up agents on-demand using the power of the GitHub Organization Folder plugin

  • Showcase the auto-healing capability of Private SaaS Edition - in the event a Jenkins master or agent were to crash, the cluster should automatically provision a new master/agent without loss of data, and in a matter of minutes

Screen Shot 2016-10-11 at 10.44.41 PM.png

We also showed a live “world’s largest enterprise Jenkins cluster” using Private SaaS Edition, running real CI/CD workloads.

Screen Shot 2016-10-11 at 10.41.53 PM.png

 

In this blog we want to talk about how we got Private SaaS Edition to set up and manage the above cluster, and also share key lessons we learned along the way. Read on and enjoy!

“Go Big, or Go Home!”

We wanted to showcase the above use-cases on a Private SaaS Edition cluster running on AWS with 2000 Jenkins masters and 10,000 executors, at any given point in time.  

JW demo 2000 mastesr 9000 executors.png

 

Let’s set some context here for our readers - why 2000 masters? Is that even important?

Here’s why it makes sense - with Private SaaS Edition, enterprises can essentially spin up a Jenkins master for every single active development project. So, each project could essentially get its own CI/CD workspace - with Jenkins masters and agents to handle the project’s CI/CD workloads. Isn’t that awesome!

“It’s a Marathon, not a Sprint!”

We wanted the audience to get a visual representation in real-time of the above use cases as Sacha walked through each step. So, we needed the following -

  • A dashboard (Blue Ocean-based) to show the size and health of the cluster in real-time

  • Additional health metrics from the cluster persisted in Elasticsearch (ES), that would be displayed on the dashboard

We built a prototype fulfilling the above requirements in less than four weeks. The next step was to scale this prototype to the world’s largest enterprise Jenkins cluster. We spent the next week determining the budget and appropriate EC2 instances to run the different components.

As we got the cluster to ~1400 Jenkins masters, a couple hundred “worker” VMs and a few thousand executors, we seemed to have hit a dead-end. We were unable to scale the cluster any higher. Marathon became unresponsive thus preventing the spinning up of additional Jenkins masters or agents. After a good deal of troubleshooting we discovered (quick shout-out to Dario for confirming our suspicions on a crazy Saturday morning!) that Marathon was stuffing a lot of state data (our state data) into Zookeeper zNodes. And once that happens, Marathon gets very confused and becomes unresponsive. Fortunately, we could make a relatively minor tweak to Zookeeper’s configuration, and increase the zNode limit significantly. This fix did the trick. The pressure valve was opened and we quickly scaled the cluster to 2000 Jenkins masters

Side note: There is good news on this front. The very next release of Marathon (1.4.0) is supposed to address the limitation referenced above.

“Houston, We Have a Problem!”

Two days before Sacha’s keynote demo, disaster struck! The cluster we had so painstakingly set up was destroyed inadvertently. Here’s what happened - we had spun up a bunch of test clusters on AWS as part of this effort and in a haste to clean up these test clusters, our primary demo cluster was destroyed instead! To destroy a Private SaaS Edition cluster, you would type the following command in the CLI -

bees-pse destroy

Except this time, someone accidentally also did this -

bees-pse destroy -f

The -f here is the same as the -f in the quintessential “rm -rf /.” In other words, if you have administrative privileges, you can delete the whole cluster without any verification.

This was a disaster scenario in every sense of the word. We had to get the entire cluster up and running for rehearsals the next day. Well as it turns out, we had designed Private SaaS Edition with these types of situations in mind from the get-go -

  • In a Private SaaS Edition cluster, the $JENKINS_HOME data is continually backed up (snapshots stored in EBS)

  • Customers can recreate their cluster from a snapshot, in the event of a disaster

“Bright, Sunshiny Day!”

Two hours later after the above disaster struck, we had the cluster back up and running - 300+ VMs, 2000 Jenkins masters, 9000+ executors, Elasticsearch and our prototype monitoring system! A few hours later we were able to scale the cluster even more, and hit our goal of running 10,000 executors on this cluster. Here are some key stats from this cluster -

12 TB RAM (total cluster)

~320 EC2 instances

~2M jobs run in a given day

TBs of data in Elasticsearch

 

Conclusion

So in conclusion, if you are an enterprise looking for a turnkey solution to setup a Jenkins cluster at scale on your private cloud, do take Private SaaS Edition out for a spin and let us know your thoughts.

In the immortal words of Gordon Gecko - “Greed is good,” so maybe for next year’s CloudBees Keynote Demo at Jenkins World we can showcase the world’s largest enterprise Jenkins cluster that can span multiple regions and multiple cloud service providers, in a matter of minutes!

Finally, a quick plug for Stephen Connolly’s great blog post where he presents a blueprint to scale Jenkins to an even larger scale. We highly recommend it.

Kal Vissa, Senior Product Manager
John Pampuch, Engineering Manager
CloudBees

 

Add new comment