Background Task Scheduling - How We Do It

On this blog I hope we can talk about how some interesting bits work - some will be an approximation (as the detail is boring) or perhaps misinformation (in case the person writing the blog forgets exactly how a system works that they didn’t themselves build !). This post is one of these…. see also 

Any reasonably complicated system has a bunch of background tasks, work queues and long running workflows/jobs going on.

Like anyone else, we used various application frameworks to build this in to our apps - however, as we grew over time we out grow this disparate approach. We had workflows/tasks that spanned systems, servers and sometimes even data centres.

We needed some tool to co-ordinate this in a more homogenous fashion, deal with retries - timeouts - and most importantly of all - not cause accidental “denial of service attacks”. These accidental DOS attacks (lets call them ADOS) happen when there has been a system outage - either planned downtime, a failure, or otherwise, and suddenly things come up again - and then the background tasks flood the system with something like a death ray of requests all at once.

TaskMaster!

So - we looked around - trialled tools like Celery, Octobot, Quartz and lots more (I forget really). All had strengths, and all kind of did what we need. But we decided to go with a bespoke solution we called “TaskMaster” - that we would make precisely fit our exactly needs (this is kind of an example of the “Not Invented Here” syndrome in action - however - we really did to due diligence).

This system is built in Erlang, has a https/rest interface and has been very very reliable and improved things a lot for us, improved recoverability and more.

TaskMaster
 
TaskMaster deploy

 

TaskMaster scheduling

 

The types of tasks we use this for, for example:

  • Hibernating unneeded applications
  • cleaning up stray instances
  • ensuring clean deploys/undeploys
  • provisioning accounts
  • provisioning repositories
Some of these are periodic tasks, some are one shot - with limited retries (a periodic task is really a one shot task that causes itself to be requested again at the end). All work within the concept of a “gate” - a “gate” limits the throughput to a given endpoint. Each task is really a request to hit some endpoint at some future point in time - in accordance with certain rules. 
 
So implementing a task is easy - it is a webhook ! (a https endpoint that acts on the task). Simple, yet effective. We also have a central place to look where we can see bottlenecks, where things are building up, and a history of what went on. 

The not invented here syndrome

This syndrome (NIH) is well documented - and something that is often on developers minds. There is always a temptation to reinvent (it is fun !) - but that is not to say that all re-invention is bad - there are times where it makes sense. This great article by my friend Dhanji sums this dichotomy up nicely I think.
 
“Why then does this NIH attitude proliferate throughout software development companies, both new and experienced, young and old? I believe there are a couple of reasons.” more.