On this blog I hope we can talk about how some interesting bits work - some will be an approximation (as the detail is boring) or perhaps misinformation (in case the person writing the blog forgets exactly how a system works that they didn’t themselves build !). This post is one of these…. see also
Any reasonably complicated system has a bunch of background tasks, work queues and long running workflows/jobs going on.
Like anyone else, we used various application frameworks to build this in to our apps - however, as we grew over time we out grow this disparate approach. We had workflows/tasks that spanned systems, servers and sometimes even data centres.
We needed some tool to co-ordinate this in a more homogenous fashion, deal with retries - timeouts - and most importantly of all - not cause accidental “denial of service attacks”. These accidental DOS attacks (lets call them ADOS) happen when there has been a system outage - either planned downtime, a failure, or otherwise, and suddenly things come up again - and then the background tasks flood the system with something like a death ray of requests all at once.
So - we looked around - trialled tools like Celery, Octobot, Quartz and lots more (I forget really). All had strengths, and all kind of did what we need. But we decided to go with a bespoke solution we called “TaskMaster” - that we would make precisely fit our exactly needs (this is kind of an example of the “Not Invented Here” syndrome in action - however - we really did to due diligence).
This system is built in Erlang, has a https/rest interface and has been very very reliable and improved things a lot for us, improved recoverability and more.
The types of tasks we use this for, for example:
- Hibernating unneeded applications
- cleaning up stray instances
- ensuring clean deploys/undeploys
- provisioning accounts
- provisioning repositories