On this blog I hope we can talk about how some interesting bits work - some will be an approximation (as the detail is boring) or perhaps misinformation (in case the person writing the blog forgets exactly how a system works that they didn’t themselves build !). This post is one of these….
At the core of the CloudBees platform is the agent - sadly he doesn’t look anything like Hugo Weaving (that is agent smith), but more like a whole lot of erlang modules (each module is really self contained program):
When a server is provisioned, it is added to a pool (depends what the purpose of the pool is - the largest pools of course contain the application servers). The server is also provided with a special configuration file “metadata.json” which essentially provides the environment settings - and also tells the server what role it will take on - the file could tell it that it is a “genapp server” (ie one of the main types of servers that make up applications) or perhaps an “autoscale” service (an internal service). Generally - the servers are not terribly different up until the the point that this metadata is provided - and then the server continues starting itself up.
One of the first things the server does is connect to the broker (mentioned previously) and register itself - this means it can then commence receiving messages (create an app, create a database etc). In the case of the application servers - that means it is ready to be allocated applications as needed. In the case of the autoscale system above - it will receive stat data and then make calculations/recommendations which are then published out via the message bus (in that sense it is a service - however it talks only via a message bus, not the web).
Each agent module may do other preparation in setup - but generally the server provisioning is simple and quick (each server is based on an image with most of what it needs already installed on it - including the agent itself).
The agent also checks that it itself is up to date.
So why do we call this “the agent” (actually “the stax agent” or “stax-agent”) ? Autonomy. Each agent is responsible for having almost complete control (within its parameters) of controlling the server - tracking the server health.
This pattern comes up again and again: agent receives a message - that message is an instruction - but really it is more like a recommendation to take an action - it may take action, it may ignore it.
When things go wrong - the agent can “panic” - this brings it to our attention - this usually means either 1) the agent isn’t able to take care of the system any more 2) a bug - this never happens - who would write bugs? or 3) the “heartbeat has stopped”. The heartbeats are simple and effective form of monitoring - part of the agents setup (based on the metadata.json) is to register itself with the monitor - the monitor - on seeing a heartbeat - then latches on to it. Once it has latched on - it is not unlike a cardio heartbeat - it expects a regular heartbeat - and if it fails (for whatever reason - it doesn’t care) - page the doctors ! Call the crash cart !
This heartbeat failure is simply a lack of heartbeat. Should we want it to go away permanently - we say “bye” to the latch. As to why it fails - the agent could have a problem, the server may have terminated/be unavailable - or the network have an issue - we don’t really care too much. What we do care about is either starting the heartbeat or replacing the server (this is where the analogy of a crash cart breaks down !) as fast as possible. These heartbeats serve as a handy early warning system (we can pre-empt issues) - but in reality - they are a last resort. Redundancy and other monitoring systems generally have done a fair bit of work before a “panic”.
When we start a server, and the agent is initialised - what is really running is an erlang process which boots up the main erlang supervisor process. A typical Erlang application is made up of many processes, and should one die - it is the job of the supervisor to restart and keep it running (it is actually a supervision tree - some processes watch other processes and so on). The key here is that the application setup and supervision code is very light and minimal - less to go wrong the “higher up” the supervision tree you are.
We actually use the e2 framework here - which has some nice simple ways to do common patterns of applications - you can read more about it there.
Note that an erlang process is an internal concept - the actual applications and services are run as separate OS processes, of course - for that we use the “runit” supervision utility (that has been through various iterations, but it is currently the tool of choice).
Genapp and ClickStacks
You may have heard recently about ClickStacks - these are essentially plugins that talk to the agent via a library we internally call “genapp” - I hope there will be more of this code to share with you shortly. You can see the actual clickstacks on the CloudBees github group. I also have some experimental ones on my github account. You can think of genapp as the plugin contract/interface used when stacks are “plugged in” to the agent at runtime. There will be more on this in future posts.
I hope this has shed just a tiny bit of light.