Here at CloudBees we’re walking the walk and eating our own dog food. On our CloudBees DevOptics team we ran through our own DevOps metrics and what we experienced gave us insight into our own processes.
We all know it’s not easy to implement DevOps best practices on a large-scale. It’s a complicated task. Many times unnecessary steps in automation increase pipeline complexity and result in slow feedback cycles. I’ve seen many instances where teams failed to capture defects earlier in the process. How do you stop that from happening?
Over the years, I’ve realized that routine, well-defined measurement and monitoring of the DevOps process has tangible benefits. It not only helps in catching defects proactively but can also help you ascertain if your fixes have the desired effect or not. To illustrate these benefits, I’ll describe our own best practices for DevOps continuous monitoring and how that proved to be beneficial for us.
Monitoring our own DevOps environment
We use CloudBees DevOptics to monitor our own DevOps environment. Below are some of the screenshots from our DevOptics dashboard. These screenshots show the variation in the change failure rate (CFR) of the end-to-end acceptance tests that we run against staging and production. You can see that the CFR at production decreased from 13.6% to 7.7% and then 3.0% over a period of 90 days.
Similarly, the CFR at staging dropped from 24.3% to 7.3% and then 3.7% during the same period. By monitoring the CFR, we were able to catch issues in the staging environment and address them before they reached production.
What does change failure rate mean?
If you don’t know what CFR is, I’ll take a step back and explain. The CFR is one of the four key DevOps metrics identified in the 2018 Accelerate State of DevOps Report; the other three metrics are Deployment Frequency (DF), Mean Lead Time (MLT) and Mean Time to Recover (MTTR). Ideally, organizations seeking rapid, frequent deployments with DevOps should see a decline in the CFR over a period. Any abrupt spike or a consistent increase in the CFR indicates process issues. After running diagnostics, our team saw that most of our own acceptance tests were failing.
To explain further, the CI/CD pipeline for CloudBees DevOptics Value Streams uses ephemeral Docker test environments. It contains the Docker images for the CloudBees DevOptics UI, Jenkins master and Selenium for executing the acceptance tests.
These test environments are created and torn down during each execution of the pipeline that runs the tests against staging or production so that every pipeline run is autonomous and self-contained. The issue, however, is that if there is a network communication issue when building the images before the test run (e.g., not being able to communicate with an external resource needed to build the images) or network communication between the images during the test run (e.g., a container dies during the test or becomes unresponsive) - we have test failures that are false positives.
Also, Selenium tests (which is the framework used to run our acceptance tests) are notoriously brittle, when compared to unit or API integration tests. This is because any slowness in the UI, changes to the UI layout or unexpected behavior can cause a test to fail. So, all of these factors were contributing to ‘failures’ of these tests, that aren’t actually failures in the system. This leads to devs ignoring the test failures and eventually, little to no confidence in the tests.
Steps I took
What I did to address these issues was add retries to the steps in the pipelines where we are building and starting the Docker test environment. So, if for example, an external resource is unreachable when building the images, hopefully, it is available on the next retry. Or, if the environment fails to start, we retry the spinning up of the environment. I also wrapped these retries in timeouts so it doesn’t just retry forever.
Also, I updated the actual tests to make them more resilient. This involved adding better timeouts to tests and confirming that elements were present in the UI before trying to interact with them.
These efforts overall lead to more stable tests and an environment, which we were able to track by a reduction in the CFR of these gates. This is important because by proving we could decrease our own CFR, we increased developer confidence in the tests and removed pipeline failures that frequently blocked our own CI/CD process. And, it only took minimum testing and training (TAT).
In the end, I was thrilled to see how monitoring key DevOps metrics gave us a clear picture of our own DevOps environment. It helped the team hone in on our specific problems for continuous improvement.
What else can DevOptics do for you?
CloudBees DevOptics also splits the lead time into Mean Queue Time, Mean Processing Time and Mean Idle Time at the gate level. Additionally, you can also measure Run Activity which is the mean of runs at a gate per day. It’s important to note that individual metrics can only say so much. By comparing multiple metrics over a period, you can better understand what’s throttling the velocity of the value flow in your delivery process. This allows you to see the impact of your actions and ascertain you’re on the right track. In case you need to drill down, you can export the metrics to .csv.
In addition to the benefits explained above, the product also helps you improve collaboration by acting as a single source of truth and providing real-time actionable intelligence to teams. You can gauge the productivity of teams and track resource allocations.
CloudBees DevOptics helps you stay on top of your value stream with a holistic view of the software delivery process to monitor, measure and manage DevOps performance. It helps you map various activities in your software delivery process and reduce waste by monitoring value streams. To learn more, check out our additional resources or start using CloudBees DevOptics for Free.