We want to apologize for the Service Outage that happened on Thursday 7/31 starting at 6:30PM UTC. We caused you a lot of trouble and we are really sorry!
After digging into our logs, we reconstructed the series of events:
It started with poor database performance around 6:30PM UTC, which resulted in a growing backlog of events in our Sidekiq queues. As a result, we hit the memory limit of our Redis instance. This caused dropped jobs, since Sidekiq wasn't able to enqueue more jobs.
After we noticed that our Redis was completely full, we started discarding some jobs to allow Sidekiq to connect some new workers. But this didn't resolve the issue. Sidekiq grabbed some jobs but was still unable to process them. They hung during the database operations. The majority of the builds in the queue were log updates, which are INSERT statements in our log table. Browsing through the query list in postgres revealed a lot of hanging INSERT statements. We had to terminate all hanging queries to allow Postgres to accept new queries. This helped to resolve the issue.
What caused the outage?
There were multiple failures happening which caused the long outage.
Our monitoring/alerting failed. We use Librato to visualize key infrastructure metrics. Librato is also responsible to observe the metrics and alert PagerDuty if some key metrics are exceeding thresholds. Due to a configuration error, we did not send all metrics to Librato and, therefore, couldn't receive alerts from PagerDuty on our phones. This caused us to notice the incident 45 minutes after it began. About 40 minutes into the incident, NewRelic began to trigger PagerDuty as more key metrics started to exceed thresholds. After receiving PagerDuty alerts from NewRelic, we immediately started taking actions. We added more alerts to Librato, which would fire up PagerDuty if key metrics are missing. In addition, we adjusted the thresholds to include not only upper boundaries, but also lower boundaries and fire up alerts if metrics exceed or undercut these thresholds.
Postgres couldn't process INSERT/UPDATE statement. We pushed Heroku's monitoring data into NewRelic and Librato, and from our data point of view nothing looked odd. We are currently in contact with Heroku to get more data to figure out what happened under the hood.
We thought this was an issue with Sidekiq not being able to connect to Redis, but discovered the Postgres issue after we resolved the memory issue in Sidekiq. This resulted in a longer outage. This was a human error during debugging.
Again, we are sorry for causing issues on your side.
Ben from the Codeship
Bad database performance (writes got stuck)
Log update queue gets filled up
Redis memory full
Free memory in Redis
Terminate all currently running database queries
Up and running again
Status Update Timeline
Resolved - Builds continue running fine. We'll keep monitoring and write a post-mortem! Jul 31, 2014 - 9:15PM UTC
Monitoring - The builds are running fine again. We will keep monitoring. Jul 31, 2014 - 8:49PM UTC
Update - We've resolved our database issue. We are currently restarting our build infrastructure to resume work on the latest builds. Jul 31, 2014 - 8:40PM UTC
Identified - We've traced the issue to our database and are currently looking to fix the issue. Jul 31, 2014 - 8:21PM UTC
Update - We're having memory issues with our queuing system and are working on a fix. Jul 31, 2014 - 7:46PM UTC
Investigating - We are currently seeing problems with builds not running on our test servers. We are investigating and will keep you up-to-date. Jul 31, 2014 - 7:39PM UTC
Stay up to date
We'll never share your email address and you can opt out at any time, we promise.