Does this sound familiar? You start out your Jenkins controller with a small number of users and a small number of builds. Everything is going great. Word gets around about how amazing it is, and soon other people in your organization want in on the action. The jobs start to get more complex. Then your company starts integrating with more external services, adding even more users. It’s all good; everything is still fun…until one day, out of nowhere, you have an outage and your users complain that your system is painfully slow.
Jenkins is celebrated for how easy it to get started with, but as demands on Jenkins grow, its performance can degrade. Fortunately, the most common reasons your Jenkins instance is slow are easy to diagnose and correct:
Poorly tuned JVM arguments
Non-optimal garbage collection
Let’s look at each of these in turn.
1. Non-performant Plugins
One of the most common ways to accidentally introduce load is through well-intentioned plugins. Many plugins are fine when your Jenkins instance is small, but they tend to unleash problems at larger scales.
Recently a CloudBees customer thought they had an authentication issue with their Jenkins controller. Every time they logged in, the response was incredibly slow, and they couldn’t figure out why. To diagnose what was going on, we opened up the slow requests log. (You can find this little-known but incredibly useful sub-directory from the Jenkins home directory.) slow requests logs every web request to Jenkins that takes more than a few seconds to run (including API requests). It also logs where each request came from, what each request was for, what was in the stack trace, and what Jenkins was trying to do in response to the request.
When we looked at the slow requests for this customer, the log showed that the request to load the Jenkins home page took 87 seconds—an insane amount of loading time. The slow requests log showed that Jenkins was trying to load data from a plugin that showed the percentage of time every job was successful throughout its history and add that information to the home page. The customer had several years’ worth of build history (which they were not rotating). As the history and the number of jobs grew, the plugin began choking the system with the enormous amount of data it wanted for every GET command. With a little research, we determined that the plugin had only one release six years prior, and a tiny number of installs. This information told us two important things: a) no one was maintaining this plugin, and b) no one cared about the data it was producing. Simply disabling this plugin eliminated the customer’s load time problem.
Maintaining and periodically cleaning up your inventory of plugins to keep your Jenkins instances is considered a best practice to ensure system stability and optimal performance. Having fewer plugins also typically means less upgrade risk and less verification effort needed during an upgrade.(CloudBees customers can use the handy Plugin Usage Analyzer to help.)
2. Poorly Tuned JVM Arguments
Another culprit for slow Jenkins instances is poorly optimized JVM settings. Every JVM argument has overhead and therefore must earn its place. Your goal is to optimize your arguments so that Jenkins works within your parameters but otherwise is left alone to work its mojo.
Take the case of a customer we'll call Big Bank. This Fortune 50 bank contacted us because their users had to wait several minutes to log into their application. And once users finally logged in, navigating the UI was painfully slow. Throughput for the application was 92%, meaning that a full 8% of system time was spent just in garbage collection cycles.
To understand what was going on, we used our internal enterprise version of GCeasy (a great garbage collection log analyzer) to analyze Big Bank’s situation. The results showed several JVM arguments that were forcing the Garbage First (G1) garbage collection algorithm to work overtime to stay within the constraints of the arguments’ limitations. When it comes to optimizing Jenkins, the KISS approach works best. In Big Bank’s case, we kept it simple by removing the arguments that were mucking up the works. We also updated their JDK.
The results were electric. Application throughput jumped from 92% to 99.5%, and garbage collection over a 72-hour period went from 42,000 cycles of 20 seconds each down to 2800 cycles of less than 1 second. Achievement unlocked!
(Of course, some JVM arguments are necessary to tweak Jenkins for each production system. We recommend you read this article on preparing Jenkins for support to brush up on the arguments we find useful, and other best practices are available.)
3. Non-optimal Garbage Collection
Another situation that commonly causes Jenkins slowness is using the wrong Garbage collection settings. Garbage collection is voodoo science; when it comes to cleaning up the system efficiently, the algorithm knows what it’s doing far better than any third-party add-on does. For an example, let’s look at the story of a customer we'll call Big Shipping Company.
Big Shipping contacted us because their high availability (HA) failover had become an every-other-day, if not daily, occurrence. Using GCeasy to analyze the data, we found that the company’s HA failover was set to 10 seconds, yet their garbage collection pauses typically ranged between 12 and 23 seconds. This discordance between what the system expected and what it was experiencing caused the garbage collection to frequently stop the world, which in turn was causing multiple production outages and user downtime.
But HA failover was not Big Shipping Company’s only problem. In addition to the low failover setting, the company was using two JVM arguments they didn’t need:
G1New SizePercent=20 and
MaxMetaspaceExpansion=64M. We also found that the
system.gc() method calls in their system were explicitly invoking a garbage collection cycle outside the Jenkins garbage collection algorithms. Altogether, the system was moving about as smoothly as a teenager’s first time driving a manual transmission.
To stop Jenkins from constantly stalling, we had to fine-tune it. We removed the JVM arguments the company didn’t need and added one it did,
+DisableExplicitGC (see above), to block any calls to force the G1 algorithm into action. With these adjustments, Big Shipping’s max garbage collection pause time went from a range of 12–23 seconds down to 660 ms and an average pause time of 89 ms—a 3500% increase in performance!
“By optimizing garbage collection, we increased Jenkins performance by 3500%.”
Optimize Jenkins Periodically for Best Results
Like anything written in Java, Jenkins is not a “set it and forget it” application. With every release of the JDK comes bug fixes, memory leak fixes and updates to the garbage collection algorithm. But that doesn’t mean that optimizing your Jenkins setup requires constant maintenance.
After helping hundreds of customers with slow Jenkins, we’ve learned that the best approach is to stay out of its way. By disabling non-performant plugins, removing all but the essential JVM arguments and optimizing the garbage collection algorithm’s ability to do its magic, your Jenkins instance will continue to run smoothly and provide a satisfying user experience from install and beyond.