Monitoring with Bosun

Written by: Barry Jones
10 min read

Bosun is a monitoring and alerting system developed by the good folks at Stack Exchange, then open sourced for the rest of us. It's written in Go, meaning its monitoring agents can run anywhere that Go can drop a binary… which is just about everywhere. So what exactly does it do and how does it compare to the likes of New Relic, CloudWatch, Nagios, Splunk Cloud, Server Density, and other monitoring tools?

What Is Bosun?

At its simplest, Bosun is a system for receiving measurements from a lot of different sources, organizing it, tracking it, and allowing triggers based on changes over time.

The system uses an agent written in Go called scollector to send these metrics in to the receiving host. The metrics are only limited by what the scollector can send, rather than any limitation in Bosun itself. That means there are ample opportunities to expand on available data. Numerous external collectors are already available, and writing custom collectors via your own shell scripts isn't particularly complicated.

Out of the gate, this setup and structure is already very appealing. It's free, deployable virtually anywhere, agents are compilation-free drop-in binaries, and it's heavily customizeable. That's a solid formula.

It also leaves the scope of this article to be very open-ended unless we narrow the scope. With that in mind, for this overview and comparison, we will focus on out-of-the-box functionality.

What's In the Box?

Baked right into scollector is the ability to monitor cpu, disk, network, memory, and processes on Windows, Mac, and Linux. From there, it adds some more OS specific features like .NET processes, AppDomains, Active Directory, and SQL Server on Windows or Yum on Linux. VMWare, icmp (ping), snmp, and AWS' EC2 and ELB monitors are also built in. There's additional support included for Apache, Cassandra, Chef, CloudFlare, Elasticsearch, HAProxy, Puppet, and Redis just to name a few. These, of course, can be expanded on.

How Do I Use It?

First, we’ll set up a server to receive information from the various remote monitors that we might be using. That server will also act as our web host to view, track, and alert on these values. Next, we'll set up a remote agent to send information about another machine back to our main monitoring server. Finally, we’ll look at how to set up an alert, some of the advanced metrics that you can track, and how to collect custom information that the agent might not already know about.

1. Set up a server to collection information

To get started, you'll need to set up a Bosun Monitor server/container to receive and aggregate information.

You can try this out quickly with Bosun's Docker setup, although Stack Exchange itself says they DO NOT use this setup in production. They don't say that you shouldn't, only that they do not. At the moment, I'd tend to agree with this advice because I left the monitor running from the Docker container monitoring three to four servers over the course of a few days and experienced a hung monitoring server after a while. The Bosun GitHub repo does have code to relay the data in case the monitor is down though. For a detailed production setup, check out this blog post.

To run the Docker version to try things out, just pull up a publicly accessible server/VM, make sure Docker is installed, and then...

docker run -d -p 4242:4242 -p 8070:8070 stackexchange/bosun

That will install the monitor listening on port 8070 with all of its dependencies, including OpenTSDB and HBase.

From here, you can grab your VM's IP address and open up a browser to IP:8070 to see the basic web interface, including hosts that it's receiving information from on the right-hand side. Currently, it should be collecting only from itself, but you can flip through the metrics to get an idea of the level of granularity and detail you're able to get from the system. You might need to wait a few minutes for it to gather enough information to display.

I found that I need a 1GB VM for the Docker image to run properly, BUT that's running the monitor, HBase, and OpenTSDB all on one machine. In a production setup, you're probably just looking at just the monitor itself storing the data in an outside OpenTSDB host, just as you’d expect with any other production database setup.

2. Set up clients to send information

Now that you've got your monitoring server set up, hop over to another machine/VM to set up the scollector. The scollector is a portable Go binary. Download the correct one for your machine, then pull up a console, and run...

scollector -h <BOSUN MONITOR IP>:8070

And that's it. The agent will immediately begin collecting and sending in all of the out-of-the-box information it can for that host.

3. Create an alert

I'm not going to duplicate Bosun's own alert creation quickstart, which provides a great example of how to set up alerts on different metrics. Alerts boil down to selecting a metric and a host(s), using an expression to monitor the metric, creating a rule on that metric to trigger an alert, and then sending the alert.

This is where you start to see the strongest point of Bosun, the expressions. Expressions are a language of their own that are basically map/reduce aggregation calls, including average, change count, standard deviation (very helpful), median, etc. A simple example from the quickstart shows a readable...

avg(q("sum:rate{counter,,1}:os.cpu{host=your-system-here}", "1h", ""))

We can see that it's taking the average “avg" of the result of querying the past hour “1h” of the “os.cpu” metric for the specified hostname “host” of “your-system-here”.

Rules are equally clear...

alert cpu.is.too.high {
    template = test
    $metric = q("sum:rate{counter,,1}:os.cpu{host=your-system-here}", "1h", "")
    $avgcpu = avg($metric)
    crit = $avgcpu > 80
    warn = $avgcpu > 60
}

A warning will trigger when the value exceeds 60 percent, and a critical status will be triggered once it exceeds 80 percent. The alert will use the template 'test' which uses Bosun's own template language and can be customized to your needs.

template test {
    subject = {{.Last.Status}}: {{.Alert.Name}} on {{.Group.host}}
    body = `<p>Alert: {{.Alert.Name}} triggered on {{.Group.host}}
    <hr>
    <p><strong>Computation</strong>
    <table>
        {{range .Computations}}
            <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
        {{end}}
    </table>
    <hr>
    {{ .Graph .Alert.Vars.metric }}
    <hr>
    <p><strong>Relevant Tags</strong>
    <table>
        {{range $k, $v := .Group}}
            <tr><td>{{$k}}</td><td>{{$v}}</td></tr>
        {{end}}
    </table>`
}

In order to send those alerts, Bosun requires an SMTP server. Currently that is the only means of sending alerts, although a number of incoming mail handling systems from providers (Sendgrid, Mailgun, Cloudmailin, etc.) can make directing those emails to an http endpoint fairly trivial to trigger other actions. In that circumstance, you could set up the templates to send JSON or XML data.

4. Track advanced metrics

Bosun provides a lot of advanced usage examples that cover handling thing like:

  • Tracking sessions on each HAProxy frontend:

$current_sessions = max(q("sum:haproxy.frontend.scur{host=*,pxname=*,tier=*}", "5m", ""))
$session_limit = max(q("sum:haproxy.frontend.slim{host=*,pxname=*,tier=*}", "5m", ""))
$q = ($current_sessions / $session_limit) * 100
  • Alerting on forecasted future disk space:

$days_to_zero = (forecastlr(q("avg:6h-avg:os.disk.fs.percent_free{$filter}", "7d", ""), 0) / 60 / 60 / 24)
  • Using Macros to avoid reusing alert definitions:

macro host_based {
warnNotification = lookup("host_base_contact", "main_contact")
critNotification = lookup("host_base_contact", "main_contact")
warnNotification = lookup("host_base_contact", "chat_contact")
critNotification = lookup("host_base_contact", "chat_contact")
}
  • Setting alerts of alarms that deviate from the norm without having to hard code an alert threshold:

$history = band($metric, $duration, $period, $lookback)
$past_dev = dev($history)
$past_median = percentile($history, .5)
$current_median = percentile(q($metric, $duration, ""), .5)
$diff = $current_median - $past_median
warn = $current_median > ($past_median + $past_dev*2) &amp;&amp; abs($diff) > 10 &amp;&amp; $hit_percent > 1
  • Conditional alerts. When you expect an alert in certain circumstances, that’s okay, such as not sending an alert for using Swap when the mail queue is high.

$mail_q = nv(max(q("sum:exim.mailq_count{host=*}", "2h", "") > 5000), 1)
$metric = "sum:rate{counter,,1}:linux.mem.pswp{host=*,direction=in}"
$q = (median(q($metric, "2h", "")) > 1) &amp;&amp; ! $mail_q

5. Track your own metrics with External Collectors

Scollector can also send in information gathered from virtually any type of script to be aggregated, monitored, and acted on by the Bosun Monitor. These scripts are called external collectors. This data can be collected in one of two simple formats: simple data output and JSON data.

Simple data output format

Just use the standard output stream like so:

// metric timestamp value tag-key=tag-value tag-key=tag-value tag-key=tag-value
twitter.tweet_count 1441406996 0 query=stackoverflow-down
twitter.follower_count 1441406996 1337 account=stackoverflow

You'll notice no unit of measure is specified with the metric; it's just a number as far as Bosun is concerned. Those can be added to graphs or expressions later.

JSON data format

This format lets you include additional data with your metrics, which are just serialized instances of the opentsdb.DataPoint or the metadata.Metasend structs.

{"metric":"exceptional.exceptions.count","timestamp":1438788720,"value":5,"tags":{"application":"Careers","machine":"ny-web03","source":"NY_Status"}}
{"metric":"exceptional.exceptions.count","timestamp":1438788720,"value":0,"tags":{"application":"AdServer","machine":"ny-web03","source":"NY_Status"}}

How Does It Compare?

We've now got a monitoring server that can receive any metric we want, collect those metrics from anywhere we choose in a simple format, lets us use a bevy of reduction tools on any of those measures, and allows us to trigger multilevel alerts from a custom template to anywhere we choose. And it's free, except for the price of the server(s) it sits on. That's a pretty solid offering.

Here's a general overview of how it compares to some other offerings.

New Relic

New Relic makes its living with application monitoring, although it does include some server monitoring.

The application monitors are built to be integrated into the the running apps so that it can gather and report on stack level information, identify bottlenecks, and alert you to problems. This is a highly sophisticated system with pricing that varies from free to $149/month per host and higher. It generally earns its price as it provides invaluable information regardless of where your application is hosted.

Bosun certainly does not replace that. However, if you wanted to write your application level collector to output data in a way that Bosun could consume it, you still wouldn't quite have the application trace level details that go along with the New Relic data. New Relic is a great service that is difficult to comprehend replacing. It's also extensible.

The edge that Bosun has on New Relic is that you get to determine your own data retention with unlimited hosts, its alerting system is tremendously more customizable and flexible, and New Relic compatibility doesn't become a restriction of your monitoring system. Many people I know end up turning off New Relic's alerts, especially regarding pager integration, because they often tend to be false alarms, self-resolving, or expected behavior without any means of telling that to New Relic.

CloudWatch

CloudWatch provides monitoring and alerting on existing metrics tracked by AWS based on the type of AWS resource we're talking about.

The alerting system is fairly flexible, although not as much so as Bosun. As such, CloudWatch provides a great system for monitoring AWS resources but nothing beyond that. If you want to track something that AWS isn't already tracking, you can't do it with CloudWatch. If you're using AWS, CloudWatch is a piece of your monitoring infrastructure but not the whole.

Nagios

Nagios is the comprehensive grandfather of monitoring systems. It monitors everything, it's been doing it for years, and it has a fairly comprehensive network of plugins and solutions around it. Everything from UI customization to monitors. It's got open source and commercial solutions for just about everything.

It's intimidating just looking at it, and that's really where Bosun compares to it. Bosun is tremendously simpler than Nagios. That's one part design and one part lack of age. If I were to compare the two tools at a simple level, I'd consider Nagios the IT Admin solution, and I'd consider Bosun the DevOps solution.

Splunk Cloud

People who have Splunk love Splunk. Splunk is a very comprehensive solution with a formidable price tag. I know many people in the IT world who swear by everything Splunk has to offer as a monitoring solution, however, the price tag has prevented me from ever having an opportunity to thoroughly evaluate it. As such, I can’t provide much of a comparison.

Server Density

In the "As A Service" world, Server Density probably provides the best comparison to Bosun. As a host, it will consume literally anything, including Nagios data.

You can write your own plugins for the agent. Metrics can be piped in from an assortment of different cloud providers, so there's no monitoring lockin associated with it. Their web interface comes with a complete metric picture of what's going on with the server at any given time. There are even iPhone and Android apps. It also has fairly customizeable alerts based on the variety of metrics you might pass in.

As for pricing, it's much more comparable to cloud offerings than the other paid players I’ve mentioned.

Conclusion

Bosun's a great option in the open source monitoring space and has a lot of room to grow. As an ecosystem expands around it, it will certainly become a better and better option. Right now, stability of the monitor seems to be a known concern, but as soon as the production monitor setup is simplified, I would expect it to start gaining serious traction.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.