Recently our DNS (domain name service) provider (Zerigo) came under a distributed denial of service attack (DDOS). DNS is a critical service and this caused outages of applications and services for some users - DNS is distributed - so this effect was not observed evenly (i.e. what you experienced was probably different to what someone else experienced). DDOS are hard to combat as they are conducted from many compromised desktop machines (usually unknown to the owner of the machine) - you can read more about them here.
We use a service like Zerigo as they have servers in many continents, allowing for high availability of DNS. However, this attack was targeted and sustained- and clearly this global distribution was not able to handle it. We were forced to make changes which take time to propagate to your local DNS service (DNS updates always take some amount of time due to TTL - even in that case, some DNS servers can cache entries for longer than they are meant to).
Following is a sequence of events and what is being planned to mitigate this issue:
On Monday July 23rd, at 3:30AM UTC Cloudbees engineers in the APAC region were alerted by our infrastructure monitoring to a higher than normal number of alerts of accessibility within the infrastructure. While these types of alerts are not uncommon due to the large number of moving pieces within the Cloudbees system, at certain thresholds they typically indicate larger scale problems such as network or connectivity issues.
Over the next 90 minutes, specific accessibility checks performed by the engineering team indicated that systems were still reachable, and that the problem seemed more likely related to networking issues from the monitoring system itself.
Around 5:00AM UTC engineers began to realize the problem seemed more likely related to DNS issues. At this time our engineers were able to confirm that our DNS provider, Zerigo, was having an outage related to a DDoS attack.
Over the next few hours engineers began moving pieces of infrastructure around, and reconfiguring DNS entries to work around the attack. DNS availability was intermittent, but working. Engineers in different parts of the APAC region were experiencing different levels of availability of Cloudbees services based on where they were.
At 10:00AM UTC, as engineers in the US-EAST region began coming available, it became evident that the DNS outage experienced by Zerigo would be long term. At this point we began our plans for migrating DNS to a new provider. Over the next 60 minutes we manually copied over all production facing DNS entries in order to bring systems back online as quickly as possible. At around 11:00AM UTC we switched over to the new DNS provider and watched as propagation began.
There were a number of factors of this event we want to address:
1) It took a long time for our engineering team to recognize the issue. This issue stems from the fact that our monitoring alerted it was unable to reach certain services, but when we accessed those services it was clear they were running okay. However, in many cases we reach services through a Virtual Private Network (VPN) which uses a separate name resolution system that our production facing one. Thus, when monitoring said services were down, and we verified they were in fact, up, the initial inclination was that monitoring was reporting faulty status data.
Because we have not experienced a DNS outage before, we were not initially looking at the DNS service as a potential vector for an outage. To better handle this scenario, we will be adding monitoring of our DNS system health specifically to our internal and external monitoring systems.
2) Our choice of DNS provider, Zerigo, was made a few years ago during a time when our infrastructure needs were different than today. Zerigo was a great choice at the time, and gave us easy scaled DNS with global capacity. However, today demonstrated some of the limitation to their infrastructure, and also to their public updates of status reports. While we have no doubts today’s events will make Zerigo’s DNS infrastructure even stronger, we feel it’s time to move our DNS to a provider that has greater resources and capacity to handle large scale DDoS attacks. We will be making assessments of these providers over the next few weeks.
3) Our cloudbees.com and cloudbees.net top level domain TTL settings are set for very long expiry times, and are currently not configurable with our domain registrar. This created problems today in that after we switched our DNS to a new provider, some customers still retained cached DNS data and were not able to retrieve the records from the new provider for extended periods of time. We also intend to fix this over the next few weeks.
4) Our status site, status.cloudbees.com, was unavailable in most cases due to the DNS outage. We have gone to great lengths to decouple this site from our running infrastructure so it’s available even if other pieces are not. However, today showed another weakness we need to address.
5) Certain aspects of our RUN@Cloud infrastructure remained available during the DNS outage, because DNS for certain critical pieces are still served by the stax.net domain, which is hosted by a different DNS provider. Some customers may not have been impacted as a result, and we will use the knowledge we gained today with this separation of services and DNS providers as we move forward with hardening our setup.
We are very sorry for the inconvenience this caused - thankfully this is a problem that can be addressed and will be.
— Caleb Tennis, Elite Developer