Troubleshooting Jenkins Performance: Kubernetes Edition - Part 2
If you didn’t read Part 1 of Troubleshooting Jenkins Performance Issues: Kubernetes Edition, go for it! Part 1 covers the basics about:
The most common APM tools in the Jenkins’s ecosystem
The basics on performance management for applications running on Kubernetes
Data capture approaches: traditional (during a service outage) vs modern (during high peaks of resources consumption)
The Use Case: Environments without any production-ready application performance monitoring tool
Unfortunately, the use case we are detailing in this section is quite common in our Support channel. A Jenkins admin opens a Support case and they don’t have access to the application performance monitoring (APM) reports or they don’t have permission to configure alerts to trigger the data capture. Those of us in CloudBees Support don’t want to let them down, so we've come up with the following workaround:
Configure CloudBees Monitoring to trigger a pipeline job which runs cbsupport to collect performance data per builds and archive them as reference. They can then be analyzed internally by the company or shared with CloudBees Support.
Configure CloudBees Monitoring to extract stats based on Java Melody to assess the evolution of the performance incident (See all the options offered by Java Melody in this demo.)
Additionally, if the performance issue is managed by the CloudBees Support Team, once the performance issue is solved, the support engineer usually creates a probe for Jenkins Health Advisor in order to notify other Jenkins admins about this performance incident and provide them with a solution or workaround.
Figure 7: Diagram representing the troubleshooting process in modern approach using Cloudbees CI features
This approach has its pros and cons as listed below:
It works for CloudBees CI in both the traditional and modern approaches
The performance scripts are executed using Pipeline Nodes and Processes thus test/checks can be applied to controller and build nodes according to the agent definition
It is not dependent on the APM vendor, even if you don’t have any in place yet
It will work if the controller is still somewhat able to run the pipeline responsible for collecting data. In other words, monitoring the performance of an application inside the same application is coupling the process
It needs to be configured per Jenkins instance
Next, we will detail the steps taken to apply this approach as tested in real support cases. Once upon a time, a couple of CloudBees customers (Cool Software Organization and Big Retailing Company) opened a case reporting that the navigation across their Jenkins controllers UI was slower than usual. (Pages were taking several minutes to fully load).
They wanted to understand the reason and guess what? Neither company had any APM tool that was production-ready at that moment.
1. Assess the initial situation
Having installed the Monitoring plugin, it would be a matter of hours, days or weeks before end users report the symptoms again while Monitoring is fed by Jenkins metrics.
Now we can understand how the application performance is affected and the relationship between the observed symptoms and the consumption of the computing resources.
In the case of Big Retailing Company, the first five hours of data were enough to validate that they were suffering from high consumption of CPU with the 95th percentile at 100 percent of CPU, so we continue with the troubleshooting process.
Figure 8: Java Melody reports % CPU in one day from Jenkins Monitoring for Big Retailing Company impacted instance. The violet color line divides the area with data vs. no data.
The case of Cool Software Organization was a different story - after having data collected for a full month, it could be concluded that its performance issue was not related to the CPU consumption. Then, one of the Jenkins admins in this organization remembered that he had recently made changes on the NFS storage and the new storage server was not fully aligned to recommendations on our NFS Guide. So, we did not continue with the following steps of this troubleshooting process.
Figure 9: Java Melody report % CPU in one month from Jenkins Monitoring for Cool Software Organization impacted instance. The violet color line points to the time the organization increased the CPU resources available for this instance. In this case, the same symptoms were reported by end users during that month.
2. Create data collector jobs
At this point, we are preparing data collector jobs for capturing threads dumps for jenkins.big-retailing.com when it reaches CPU spikes for a significant amount of time. Ensure that Pipeline Utility Steps is installed because it is required for this job.
The pipeline logic (demo: performance.groovy) and script resources (demo: jenkinshangWithJstack.sh and jenkinsmemory.sh) are stored with the Jenkins admin’s Shared Libraries. Note that this implementation is easy to extend for other tests like Disk Input/Output.
At the moment of writing this post, cbsupport does implement the batch mode (CE-3948) and that’s why it is not used in this particular context.
We created a pipeline job to collect CPU data like this:
We need to run the data collection inside the controller node, thus at least one executor for the controller node is needed (remove it after troubleshooting because it is not recommended to run builds in the controller node). It is not possible to configure it via groovy (JENKINS-23534). Note that you might need to disable Operation Center options to force the number of executors for controllers by going Manage Jenkins > Configure Global Security > Client controller on-controller executors.
Once you configured the Pipelines, build them manually to validate that they work correctly. The following signatures need to be approved in Script Security:
Send artifacts to your favorite Artifact Repository Manager (e.g. Google Storage Plugin for GKE) so it is recommended to adapt the post > success > zip step accordingly. Especially for heap dumps. Otherwise, you could get a
Include notification to your favorite channel.
3. Create Alarms
Having installed the Cloudbees Monitoring plugin in the issued Jenkins application, go to Manage Jenkins > Configure System > Alerts. (Alarms are only available for CloudBees CI). Then, we create an alarm title -“CPU over 85% for 3 minutes.”
Local Metric Gauge: vm.cpu.load - The rate of CPU time usage by the JVM per unit time on the Jenkins controller. It is proportional to the number of CPU cores being used by the Jenkins controller JVM. (This value goes from zero to one for one CPU unit, zero to two for two CPU units …)
Alert if above: 0.85 in case the controller has assigned one CPU core
Alert after (sec): 180
Recipients: Trigger a build for a remote/local job link to the cpu data collector (demo: Jenkinsfile)
Figure 10: Alarms configuration “CPU over 85 percent for three minutes”
For memory troubleshooting, the alert configuration would be something like “Memory over 85 percent for three minutes.”
Local Metric Gauge: vm.memory.heap.usage The ratio of vm.memory.heap.used to vm.memory.heap.max. (This is a value between zero and one inclusive)
Alert if above (ratio): 0.85
Alert after (sec): 180
Recipients: Trigger a build for a remote/local job link to the memory data collector (demo: Jenkinsfile)
4. Analyze the Data Package
When CPU was over 85 percent for three minutes, the pipeline data collector job was triggered and it prepared the data package.
Once we received the CPU data package via CloudBees Support Uploads, we loaded the thread dumps in our enterprise version of fastthread.io (Note that you could integrate this analysis within performanceMemory.groovy via FastThread API) Additionally, we passed the threads package to Advisor.
In case of memory data packages, heapdump is loaded into our own service of heaphero.io. (See HeapHero API for integration with performanceMemory.groovy). Other DSEs prefer desktop solutions like Eclipse MAT.
5. Apply Measures
After analyzing the thread dumps from jenkins.big-retailing.com, we observed due to the large number of items (including folders), this instance was highly impacted by the weather column.
The short-term measure was removing the health metrics folder from all folders. But the long term solution goes through scaling horizontally the job load across more controllers (see CloudBees controller sizing guidelines and Calculating how many jobs, controllers, and executors are needed for Jenkins)
There is an existing probe to detect this issue, but it was not raised in Advisor because the probe is looking at the slow request folder. These types of records were not enabled during the Support bundle generation.
Once we applied the mentioned workaround, I requested another APM report from the Monitoring plugin the following day. The graph shows how the 95th percentile dropped from 100 percent to 85 percent so we are on the right track.
Figure 8: Java Melody report % CPU in one day from Jenkins Monitoring for Big Retailing Company before applying mitigation measures (on the left side of the violet color line) and after (on the right side of the violet color line)
Once the root cause analysis is found and mitigated, it is advisable to remove the described troubleshooting performance alarms in this post from the Managed controller/Team. At this moment, the performance of the instance can be monitored by using standard health checks instead. A failed health check will notify the Jenkins admins the check requires their attention (and we might need to repeat the full process). Thanks to the Operations Center Monitoring Plugin it can be configured for all the controllers from the Operation Center by setting a controller Health Checks.
The inspiration for this post came from my Squad colleague Darío Villadiego when he talked about a groovy script to be launched in the Jenkins Script Console to collect performance data (jstack) applying some logic on the top of the Metrics API. Thus, kudos to this man.
Finally, thanks for their reviews and comments to Owen Mehegan, Joost van der Griendt, Viktor Farcic, Arnaud Heritier, Pierre Beitz and Allan Burdajewicz.
About Carlos Rodriguez Lopez
Carlos is a Lead DevOps Support Engineer at CloudBees with a background in Java web development and tooling for data analysis. "Never stop learning" is his professional mantra. Check out his GitHub Profile.
Stay up to date
We'll never share your email address and you can opt out at any time, we promise.