|“You’ll be stone dead in a minute,” from Monty Python
and the Holy Grail. Twickenham Film Studios (1975).
As an infrastructure guy, I’m always interested in learning about behind-the-scenes problems that plague other companies in the X-as-a-Service industry. There can be plenty to learn from how a company handles a customer facing issue, both in how they communicate during the problem and the level of detail they provide with their postmortem. This transparency also helps everyone up their own game in an industry whose overall growth ultimately depends on trust.
The recap itself was terse and contained barely just enough information to explain the events that occurred, as well as some new things they will be doing to prevent another DDoS in the future. But, in my opinion, it was missing a key piece: a mea culpa. Not once during or after the event did Zerigo ever apologize to their customers, or show remorse for the downtime their customers exhibited. To be fair, the outage wasn’t directly Zerigo’s fault - they were being attacked by an outside group. However, because Zerigo failed to keep their customers’ appetite for understanding satiated during the outage and because their postmortem was very weak, the majority of their customers did not have nice things to say during and after the outage.
Everybody Does it in Their Own Way
Monitoring service Boundary had two incidents that impacted its service recently and Boundary technicians blogged about them (http://blog.boundary.com/2012/08/01/riak-kobayashi-post-mortem/) and (http://blog.boundary.com/2012/08/01/streaming-post-mortem-731/). The level of technical explanation was quite high, and they were to the point with how they would fix the issue. These were well written postmortems.
A recent report by Pinboard (http://blog.pinboard.in/2012/07/a_bad_privacy_bug/) highlighted a problem that didn’t impact service, but instead exposed sensitive information to some customers. Again, the level of detail as to how the problem occurred was very high, and the communication was well done. Notice toward the end of the blog, the developer who wrote the post apologized and personalized the experience, even opening up discussion directly for anyone who wanted even more details. This action creates a huge level of trust.
Even the usually tight-lipped Microsoft has been exceedingly open about their outages around the Azure platform, as noted on their blog. They do even offer an apology, which I think may be a first for Microsoft!
One thing you don’t want to do is let someone else write your postmortem for you. Recently, Knight Capital lost quite a bit of money with some software trading errors. While none of their customers were impacted, the story was big enough in the media that a timely response would likely help stem any customer fear resulting from the event. The major analysis of the event, however, came from other groups like Nanex Research, with Knight Capital remaining very tight lipped about the situation.
Infrastructure developers have come to expect a higher level of technical explanation in outages. Twitter’s outage on July 26, 2012 was notable to me for a few reasons, in that they apologize for being down (good!), but explained away the issue with very little detail. While they don’t owe anyone a deeper technical explanation, the vague one they do give is akin to hand waving and doesn’t sit well with a lot of the technical community.
How We Operate, at CloudBees
The CloudBees engineering team has been hard at work to ensure better communication and visibility into our operations as we grow. In particular, we have been steadily working to improve our communication during events that impact customers, even minor events. We have recently overhauled our status website to give better feedback during outages, and to maintain content about postmortems. We have also enhanced the CloudBees toolbar with a status indicator. While these features still need improvement, it takes time to implement features that allow for utmost transparency. However, we are committed to making our Platform as a Service more transparent than your internal IT team would ever even consider.
In summary, your customers are your allies. They’ve given you their data, their time and their trust. More importantly with PaaS, your customers are often betting their business and their own customers on you. In return they expect good service, and if your service is not available, they expect good customer service in letting them know what the issue is and when the service will be available again. It’s hard to provide too much information in an outage. However, it’s really easy to alienate yourself from your customers, and have them pack up and leave as a result of not providing enough information about it. And when the outage is over, a simple apology, an explanation of what happened, and a well thought-out analysis of what you’re doing to prevent it from happening again is what it takes to maintain your customers’ trust.
Elite Developer and Architect
Caleb is an infrastructure guru with a deep background in building and managing large-scale systems. He’s been invoved directly with a number of open source projects, including KDE and KDevelop, Ice and OpenStack.
Follow Caleb’s adventures on Twitter.