Why am I writing this?
The other day I heard a very true quote from W. Edwards Deming: “Without data you are just another person with an opinion” and that reminded me of how hard, but at the same time important, it is to measure quality.
But let’s step back first and ask ourselves, why do we want to track metrics? Metrics help us:
- Understand if we are getting better, version after version
- Understand the pain our customers suffer
- Drive our roadmap to harden the areas with the most customer pain
- Understand the impact of the quality activities on our deliverables
In that spirit, I put together a set of CloudBees Jenkins Enterprise metrics to track:
- Bugs affecting customers, per release version, per component: this is effectively the issues escalated by support affecting customers which require an emergency fix. Our current main goal is to reduce these down to zero.
- Bugs internally reported, per release line: this metric captures how many issues we catch internally during development, testing and dogfooding before they affect customers.
- Bugs reported by the professional services team, per release line: since 1.6.0, the professional services team engages with customers for installs. This means that it’s unlikely that customers would report issues with installations, it would be the PS team who does it. We decided to track this to make sure we were not masking install/CLI issues that may be getting worse over time (at the end, it’s impacting customers directly).
Process and period of time
I went back to 1.5.0 which was released in November 2016 and started to make sure that all the issues had the correct affected version in JIRA and the right component assigned, up until the latest release 1.7.0. This way we could compare pears with pears and apples with apples. Results: you can find the following charts showing the evolution of the releases from 1.5.0 to 1.7.0. We will be comparing 1.5.x to 1.6.x.
As a summary:
- We have reduced from six installation issues to one, and from four upgrade issues to zero. Upgrades are not scary for customers anymore.
- We have less than 50% customer-facing issues in 1.6.x compared to 1.5.x, from 20 to nine escalated issues. This is a significant achievement, given that there were double the number of resolved issues in 1.6.x than there were in 1.5.x.
- ElasticSearch is the next hot topic, with three issues affecting customers. This has to be our focus area.
How we got there
Before the CloudBees Jenkins Enterprise launch it was very clear that the biggest customer problems were around the installation process and upgrades. So we decided to focus our efforts on hardening those two areas. How?
- Automation audit - understand how we were continuously testing upgrades and install, the cases that were missing and the plan to automate those. After this effort, all the upgrade scenarios are covered.
- Installation testing - we performed several bug bashes, which are collective testing sessions, to cover installations in all the supported platforms. Automation was improved on that front as well.
- Release testing - after defining the P1 CloudBees Jenkins Enterprise cases for the launch, and while those are getting automated, we are consistently testing them release over release in a manual/automated fashion.
- Internal dogfooding of CloudBees Jenkins Enterprise prior to public releases - we use our software internally before releasing it to customers so that we experience real usage ourselves, first - and prevent issues from impacting our customers.
- Release blockers and escalated issues analysis - for every release, we analyze the issues that blocked the releases as well as the ones escalated by customers to make sure we have actions in the backlog to prevent those from happening in the future. This way we close the circle of failing, fixing, analyzing, coming up with preventive actions and learning from our mistakes. Then we incorporate that new knowledge to our daily development
And then what? Continuous improvement
Now that ElasticSearch has become the hot potato we are working on better understanding what the problems are and putting a quality improvement plan together to harden testing in that area. So far, we have executed a full regression of ES cases and found some issues. We are starting to make some investments in monitoring to better understand ES sizing and crashes. But in these metrics we are always missing all the cases where there is customer friction. There are workarounds, patches or a restart that fix the problem, but it never gets escalated to engineering. In order to have visibility on that front, the engineering and support teams are working together to start reporting in Zendesk on those, so that we can have another chart telling us the real support story. Stay tuned for the data analysis!