CloudBees Accelerator Agent Metrics, part 1

Written by: Electric Bee
5 min read

Welcome to the CloudBees Blog! My name is Eric Melski, and I'm a Senior Software Engineer at CloudBees (employee #1, by a whisker). I've worked on every component of CloudBees Accelerator, and I'm now Technical Lead for the product. I'm also the guy behind ElectricInsight. In this and future posts, I will take you on deep technical dives into Accelerator and Insight, with a focus on performance, scalability and analysis.
Understanding the performance of parallel, distributed systems can be difficult. Fortunately, Accelerator provides a wealth of data to facilitate performance analysis. In this first continuation of my presentation at the 2008 Customer Summit, I'll show you how to collect agent performance metrics , and how you can use them to get a quick overview of the performance characteristics of your build.

Collecting Agent Metrics

Every agent in the cluster automatically tracks performance metrics for every build that it participates in, so you don't have to do anything special to enable them. All you need to do is download them from the agents after your build has finished. Note that by default, the agent keeps performance data for the previous 100 builds that it participated in, so you don't have to grab the metrics immediately after your build completes.
The easiest way to get the metrics from the cluster is via cmtool :

cmtool --cm=... runAgentCmd "session performance buildId " > buildId. agentraw

This will write a dump of performance metrics from each agent that participated in the specified build to the file named buildId .agentraw. Next, you'll want to use the (unsupported) agentsummary utility to produce an aggregate summary of the metrics across all agents:

tclsh agentsummary buildId.agentraw > buildId.agentsum

Now we have something that we can dig into.

Overall time usage

The first thing to look at in the agent performance metrics is the "Overall time usage" section. This divides the time used by the agents into several coarse-grained buckets:

Overall time usage:
  Startup:           4.65s ( 0.0)       20 intvls, avg   232.5ms
  Cmd setup:       253.15s ( 0.4)    42281 intvls, avg     6.0ms
  Command:       36479.41s (61.7)    82823 intvls, avg   440.5ms
  Emake request:  7732.26s (13.1)   763879 intvls, avg    10.1ms
  A2A request:    1968.67s ( 3.3)    52112 intvls, avg    37.8ms
  Command end:     448.59s ( 0.8)    82823 intvls, avg     5.4ms
  Return:         4874.27s ( 8.2)    42281 intvls, avg   115.3ms
  Idle:           6726.62s (11.4)    42261 intvls, avg   159.2ms
  End:             671.13s ( 1.1)

These timers correspond to the following activities:

StartupWaiting for the first commands from emake after the connection is established.
Cmd setupPreparing the execution context (environment, working directory, etc.) for a build command
Ver updatesProcessing file version updates from emake
CommandRunning build commands (compiles, links, etc)
Emake requestWaiting for emake to service a request for file data or metadata
A2A requestWaiting for file data transferred from other agents via P2P
Command endWaiting for the instructions from emake after finishing execution of a build command.
ReturnReporting file usage to emake, including the contents of any files created
IdleWaiting for emake to provide more commands to run
EndA special case of Idle time, this is the amount of time between the last build command and the end of the agent's participation in the build

For each timer, the metrics show:

  • The total time spent in the timer across all agents;

  • The portion of time spent in the timer as a percentage of the total time across all agents;

  • The number of times the timer was active;

  • The average amount of time the timer was active each time it was in use.

Note that not all timers are available in all versions of Accelerator.
Of these timers, Command is perhaps the most interesting, because it represents the actual work performed during your build. The other timers all correspond to inefficiency of some sort, whether due to the overhead introduced by running a distributed build across the network, or due to serializations in the architecture of the build itself.
One interesting factoid about the Command timer is that you can use its percentage to estimate the X factor , or speedup relative to a serial build. Simply multiply the Command percentage by the number of agents that participated in your build. For example, if the Command timer represents 65 of the total time, and your build used 8 agents, then the X factor is about 5.2x:

0.65 * 8 == 5.2x better than serial

Two caveats about this value: first, it's only an estimate. In my experience it seems to be reasonably accurate in the back-of-the-envelope sense, but it's by no means perfect. Second, it only really works if the agents are running on hardware comparable to the system you would have used for a serial build. If the agents are slower or faster than the baseline host, that will skew the number correspondingly.

Are my values "good"?

Of course the whole point of looking at these timers is to determine if your build is performing well, or if there is room for improvement. Therefore, we need to establish some guidelines for evaluating the values in these timers. To that end, here are some very rough guides:

GoodAcceptableWarning!
Command> 6050 - 60< 50
Emake request< 1010 - 15> 15
Idle/End< 1010 - 20> 20
Return< 1010 - 20> 20

Remember, Command time represents all the time spent doing the actual work in your build. Ideally, we'd like to see that at 100, but of course that target is unattainable -- if nothing else, there is always overhead introduced simply because we have to move data across the network. Still, it's not uncommon to see builds with Command time representing 80 or more of the total time. Anything over 60 is pretty good, really. Less than 50 means there's probably room for improvement. That's not a guarantee that you can do better, of course. It may simply be the nature of the beast, or you may find that the cost of improving the performance outweighs the benefit. If you do see a low percentage for Command time, the thing to do is look at the other timers to determine where the time is going instead. That will in turn provide some guidance to the root cause of the problem.
For example, if the Emake request timer is high (greater than 15), that indicates a problem with the communication between emake and the agents. The things to check next in this case are:

  • Network bandwidth between emake and the agents.

  • System load on the emake host.

  • Disk performance on the emake host.

  • Number of agents participating in the build (too many agents can cause emake to thrash).

Another possibility is that the Idle and End timers are high (more than 20 combined). This usually indicates that your build has significant serializations, or that it had a large number of conflicts. Use ElectricInsight to investigate further.
Finally, if the Return timer is high (greater than 20), you'll want to check:

  • Network bandwidth between emake and the agents.

  • Disk performance on the emake host.

  • Disk performance on the agent host.

It's important to note that in general, any one metric will not exactly pinpoint a specific problem; rather, each metric should be thought of as a diagnostic tracer. The more tracers you combine, the more precise your diagnosis will be.

Next time

There is lots of other interesting data in the agent performance metrics which you can use to further analyze build performance. In my next post, we'll look at the Usage records section.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.