|"You'll be stone dead in a minute," from Monty Python|
and the Holy Grail. Twickenham Film Studios (1975).
As an infrastructure guy, I'm always interested in learning about behind-the-scenes problems that plague other companies in the X-as-a-Service industry. There can be plenty to learn from how a company handles a customer facing issue, both in how they communicate during the problem and the level of detail they provide with their postmortem. This transparency also helps everyone up their own game in an industry whose overall growth ultimately depends on trust.
Because of the real-time nature of the web, in particular with a service like Twitter, the minute a service is not available, the story can spread quickly. What used to be isolated, localized incidents in the past now are communicated far and wide, demanding a greater level of scrutiny and explanation to meet customer expectations.
AWS: They’ve Come a Long Way
The first such service that comes to mind is Amazon Web Services. AWS has helped redefine the face of computing infrastructure over the past few years. They've also created new ways of explaining when problems occur. Those who depend on AWS know that when one of their services begins to have issues, AWS's classic explanation is "the service is experiencing increased latencies." This ambiguous and very passive explanation has become a running joke for many system administrators.
AWS has had a couple of major service disruptions over the years, and has gotten better each time about explaining exactly what went wrong each time. A glance through one of their more recent major outages from July 2012 shows a very detailed explanation, far beyond the original days of their cloud services, when they provided very few details to their customer community.
The Human Side: “I’m Sorry” Means a Lot – Along with Transparency
Offering a thorough technical overview of how things worked, why they failed, and what steps will be done to remediate is the best step in assuring your impacted customers feel good about continuing to utilize your services. AWS's explanation referenced above got high marks across the board, but above all I think the "In Conclusion" section is most important. In that section, they take away the technical piece and get to the human side. They apologize, and they personalize directly with the customer. The importance of doing that cannot be overstated. At the end of the day, the people impacted by an outage are just that: people. They want to be made to feel better, and an explanation and apology are the most effective way of doing that.
How Not to Handle Outages (ahem, Zerigo)
A more recent outage occurred with DNS provider Zerigo (note: this outage impacted CloudBees services). Zerigo was under a denial of service attack, which rendered many parts of their DNS infrastructure unusable. During this time, however, updates about what was happening from Zerigo were few and far between, and updates that did occur didn't offer much detail on when to expect further information or changes Zerigo would make to prevent another such attack in the future. In my experience, customers are willing to stand by your side if they know you're working hard to fix the problem - but you have to tell them that you are on the case. Zerigo did a poor job of keeping its customers in the loop as to what was going on.
Five days after the outage, Zerigo posted a
The recap itself was terse and contained barely just enough information to explain the events that occurred, as well as some new things they will be doing to prevent another DDoS in the future. But, in my opinion, it was missing a key piece: a mea culpa. Not once during or after the event did Zerigo ever apologize to their customers, or show remorse for the downtime their customers exhibited. To be fair, the outage wasn't directly Zerigo's fault - they were being attacked by an outside group. However, because Zerigo failed to keep their customers’ appetite for understanding satiated during the outage and because their postmortem was very weak, the majority of their customers did not have nice things to say during and after the outage.
Everybody Does it in Their Own Way
Monitoring service Boundary had two incidents that impacted its service recently and Boundary technicians blogged about them (http://blog.boundary.com/2012/08/01/riak-kobayashi-post-mortem/) and (http://blog.boundary.com/2012/08/01/streaming-post-mortem-731/). The level of technical explanation was quite high, and they were to the point with how they would fix the issue. These were well written postmortems.
A recent report by Pinboard (http://blog.pinboard.in/2012/07/a_bad_privacy_bug/) highlighted a problem that didn't impact service, but instead exposed sensitive information to some customers. Again, the level of detail as to how the problem occurred was very high, and the communication was well done. Notice toward the end of the blog, the developer who wrote the post apologized and personalized the experience, even opening up discussion directly for anyone who wanted even more details. This action creates a huge level of trust.
Even the usually tight-lipped Microsoft has been exceedingly open about their outages around the Azure platform, as noted on their blog . They do even offer an apology, which I think may be a first for Microsoft!
One thing you don't want to do is let someone else write your postmortem for you. Recently, Knight Capital lost quite a bit of money with some software trading errors . While none of their customers were impacted, the story was big enough in the media that a timely response would likely help stem any customer fear resulting from the event. The major analysis of the event , however, came from other groups like Nanex Research, with Knight Capital remaining very tight lipped about the situation.
Infrastructure developers have come to expect a higher level of technical explanation in outages. Twitter's outage on July 26, 2012 was notable to me for a few reasons, in that they apologize for being down (good!), but explained away the issue with very little detail. While they don't owe anyone a deeper technical explanation, the vague one they do give is akin to hand waving and doesn't sit well with a lot of the technical community.
How We Operate, at CloudBees
The CloudBees engineering team has been hard at work to ensure better communication and visibility into our operations as we grow. In particular, we have been steadily working to improve our communication during events that impact customers, even minor events. We have recently overhauled our status website to give better feedback during outages, and to maintain content about postmortems. We have also enhanced the CloudBees toolbar with a status indicator. While these features still need improvement, it takes time to implement features that allow for utmost transparency. However, we are committed to making our Platform as a Service more transparent than your internal IT team would ever even consider.
In summary, your customers are your allies. They've given you their data, their time and their trust. More importantly with PaaS, your customers are often betting their business and their own customers on you. In return they expect good service, and if your service is not available, they expect good customer service in letting them know what the issue is and when the service will be available again. It's hard to provide too much information in an outage. However, it's really easy to alienate yourself from your customers, and have them pack up and leave as a result of not providing enough information about it. And when the outage is over, a simple apology, an explanation of what happened, and a well thought-out analysis of what you're doing to prevent it from happening again is what it takes to maintain your customers’ trust.
Elite Developer and Architect
Caleb is an infrastructure guru with a deep background in building and managing large-scale systems. He's been invoved directly with a number of open source projects, including KDE and KDevelop, Ice and OpenStack.
Follow Caleb's adventures on Twitter .