Background Task Scheduling - How We Do It

Written by: Michael Neale
3 min read
Stay connected

On this blog I hope we can talk about how some interesting bits work - some will be an approximation (as the detail is boring) or perhaps misinformation (in case the person writing the blog forgets exactly how a system works that they didn't themselves build !). This post is one of these.... see also

Any reasonably complicated system has a bunch of background tasks, work queues and long running workflows/jobs going on.

Like anyone else, we used various application frameworks to build this in to our apps - however, as we grew over time we out grow this disparate approach. We had workflows/tasks that spanned systems, servers and sometimes even data centres.

We needed some tool to co-ordinate this in a more homogenous fashion, deal with retries - timeouts - and most importantly of all - not cause accidental "denial of service attacks". These accidental DOS attacks (lets call them ADOS) happen when there has been a system outage - either planned downtime, a failure, or otherwise, and suddenly things come up again - and then the background tasks flood the system with something like a death ray of requests all at once.


So - we looked around - trialled tools like Celery, Octobot, Quartz and lots more (I forget really). All had strengths, and all kind of did what we need. But we decided to go with a bespoke solution we called "Taskcontroller" - that we would make precisely fit our exactly needs (this is kind of an example of the "Not Invented Here" syndrome in action - however - we really did to due diligence).

This system is built in Erlang, has a https/rest interface and has been very very reliable and improved things a lot for us, improved recoverability and more.

The types of tasks we use this for, for example:

  • Hibernating unneeded applications

  • cleaning up stray instances

  • ensuring clean deploys/undeploys

  • provisioning accounts

  • provisioning repositories

Some of these are periodic tasks, some are one shot - with limited retries (a periodic task is really a one shot task that causes itself to be requested again at the end). All work within the concept of a "gate" - a "gate" limits the throughput to a given endpoint. Each task is really a request to hit some endpoint at some future point in time - in accordance with certain rules.

So implementing a task is easy - it is a webhook ! (a https endpoint that acts on the task). Simple, yet effective. We also have a central place to look where we can see bottlenecks, where things are building up, and a history of what went on.

The not invented here syndrome

This syndrome (NIH) is well documented - and something that is often on developers minds. There is always a temptation to reinvent (it is fun !) - but that is not to say that all re-invention is bad - there are times where it makes sense. This great article by my friend Dhanji sums this dichotomy up nicely I think.

"Why then does this NIH attitude proliferate throughout software development companies, both new and experienced, young and old? I believe there are a couple of reasons." more .

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.

Loading form...
Your ad blocker may be blocking functionality on this page. Please disable for an improved experience.