In the last two weeks, CloudBees faced two outages: one related to its main infrastructure provider, Amazon, and the other to a bug in the Linux kernel. This report aims at sharing in greater detail how users of the CloudBees services have been impacted and how CloudBees has reacted to fix those problems.
Note: It is possible to know at all times the status of CloudBees’ services by visiting CloudBees’ status page.
(1) AWS Outage
A few days ago, Amazon Web Services (AWS) faced an outage on one of its data centers (a zone) in its US-EAST region. You can read the details of what happened on AWS’s RSS feed or on ZDNet, among others.
The most notable element of this outage is that even though only one zone was physically impacted by this outage (a region is comprised of multiple independent zones), an API endpoint (responsible for EBS storage) running on that zone didn’t fail cleanly, this in turn prevented the similar API endpoint in the other zones to handle any mutable request. Consequently, while the servers that had already been started before the outage on the three other zones kept functioning, most AWS API calls requesting some new resource (such as starting a new server backed by an EBS store) would fail in all zones as well. Since all 4 zones are theoretically supposed to work independently (i.e., a problem in one zone should not break the API in all 4 zones), this outage was pretty bad and a number of cloud vendors that had relied on a multi-zone architecture to improve the resiliency of their service got caught off-guard by this outage. This multi-zone single-point-of-failure condition is clearly a weakness that AWS has to urgently fix.
What impact did it have on CloudBees users and customers? It depends on the services and SLA they had subscribed to. Let’s go through the list.
Our Git, Subversion and Maven repositories weren’t impacted by the AWS outage: those services are relatively “static” (i.e. they don’t require much elasticity from the underlying IaaS), hence didn’t require many calls to the AWS API (which was down, for the most part), so those services went through nicely. No repository data was lost.
Concerning our Jenkins as a Service offering, it is being split in multiple zones. For customers that had their master hosted on the impacted zone, we would have restarted those instances on another zone, but, as discussed above, the AWS API was unable to serve the required requests for any of the other zones (not just the impacted one), and prevented us from doing that migration. For the Jenkins masters that were not hosted in the impacted zone, some builds were slower to start, since we couldn’t start any new build machines in any of the 4 zones (again, while only one zone was impacted, the API went down on all 4 zones). In the last few months, we have worked to highly reduce our dependency on the AWS API for our Jenkins as a Service offering, this effort helped us to not be too sensitive to the outage despite AWS API not being functional. No Jenkins data was lost.
On our PaaS deployment platform, customers that happened to run on the impacted zone and who hadn’t set their application as “highly available” within CloudBees may have had their instance impacted. Free applications were most vulnerable to downtime since those are not clustered. Customers who were operating under an HA setup were able to keep running throughout the outage. However, since AWS’ API was dead for all 4 zones, we weren’t able to restart new nodes in a healthy zone to bring the cluster back to its cruising size. Customers running in Europe or another data center (HP, etc.) weren’t impacted at all by the AWS outage: CloudBees core PaaS servers, which are fully HA, remained up and running at all time [*]. It is worth noting that CloudBees is not only able to replicate applications amongst multiple zones, but also amongst multiple regions, which offers a very high level of availability. No application data was lost.
Concerning the CloudBees’ MySQL service, customers who had standalone MySQL instances running on the impacted zone couldn’t access their data anymore (but the application accessing it was probably impacted, as well). Customers who had opted for the CloudBees clustered database offering and who had their master node in the impacted zone were impacted and couldn’t write/update data (we needed to have access to a working EBS API to perform the switch). Clustered customers with their master in another zone weren’t impacted. Last but not least, customers with a database running in a healthy zone weren’t impacted by this outage. No databases were lost.
We are constantly working to improve our resilience to IaaS outages. As a result, the customers who opted for CloudBees’ HA features didn’t suffer from AWS’s recent outage. If anything, this outage should remind our customers that they can decide, on an application by application basis, what type of SLA they want by selecting an appropriate service level, already offered by CloudBees to deliver high availability.
(2) Leap Second Linux Bug
A little after midnight on July 1 GMT, the CloudBees monitor alerts indicated unusually high CPU levels across a number of servers. We narrowed the problem to an apparent Linux kernel issue that resulted in CPU exhaustion after the leap second took effect. We responded by restarting the affected EC2 instances, which restored normal operations for most of our users’ applications.
There were cases, however, where some applications needed to be moved to new AWS EC2 instances. The migration to the new set of EC2 instances was rolled out gradually over a three-hour window to minimize the impact to our users. At this time, our environment is operating normally, but we continue to monitor it closely.
**[Update: We have determined that some applications with clustered (HA) configurations experienced downtime when their applications were restarted. Our investigation found that this was due to a router configuration problem which resulted in requests failing to make it to the running application instances. Subsequent app restarts resolved the issue, but this did result in downtime for some “HA” configured apps. Apps configured with New Relic uptime monitoring were properly alerted when these apps became inaccessible.]**