CloudBees Accelerator vs distcc - samba reloaded
In an earlier post I compared the performance of ElectricAcclerator and distcc by building samba using each tool in turn on the same cluster. In that test I found that Accelerator bested distcc at suitably high levels of parallelism, but that distcc narrowly beat Accelerator at lower levels of parallelism. At the time I chalked the difference up to slightly higher overhead associated with Accelerator. But you must have known I couldn't just leave it at that. I had to know where the overhead was coming from, and eliminate it, if possible. The exciting conclusion is after the break.
To recap, I previously found that samba is a very CPU-intensive build, despite being written entirely in C. This fact was demonstrated empirically by examining the performance on one dual-core host using just gmake -j at varying levels of parallelism. Past -j 2 , the performance degraded sharply. For the distcc and emake tests, I used a cluster of 12 dual-core hosts. Eleven served as workers, for a total of 22 CPU's. The remaining host was used as the build host (and cluster manager, for emake tests). Here are the original results:
Until we got to about 11 CPU's, distcc appeared to have a slight edge on emake. I had lots of theories that could explain the difference: maybe our Electric File System (EFS) was slower than ext3fs, or maybe the Electric Agent was sluggish in supplying metadata to the EFS, or in processing file usage data from it. Maybe lock contention in emake itself was causing the problem.
Of course before I could test any of these theories I had to make sure I could reproduce the original behavoir. I set up a five-node cluster using the same dual-core hosts I used previously, plus one additional node to serve as the build host (unfortunately the other half of the cluster was reserved for other tests -- I'm not the only person doing work here at CloudBees, after all!). This gave me a total of 10 worker CPU's. After installing the latest version of Accelerator (4.5.0), I fired off a series of three builds each with Accelerator and distcc, using 10 workers. When those builds completed, I computed the average build time for each tool -- and found that Accelerator beat distcc by a small margin.
Deeper Into the Rabbit Hole
This result was wholly unexpected, given the results from the previous tests. The next step was to run a series of builds with varying numbers of workers, from 1 to 10. Here are the results:
Now the results are more in line with my expecation: with low levels of parallelism distcc appears to perform better, but Accelerator catches up and finally surpasses distcc once enough resources are engaged. The breakeven point has moved though, from about 11 CPU's to about 9 CPU's. In addition, there was an outlyer in both sets of results: with just one worker, Accelerator was consistently faster than distcc. That didn't fit well with my theories -- if the EFS was slow, for example, then Accelerator would have been slower than distcc with one worker.
A New Theory
As I puzzled over this new data, something clicked that caused me to remember a subtle difference between distcc and Accelerator: they use different strategies to determine how to allocate jobs to workers in the cluster. Accelerator prefers to fully load one host before running jobs on another host; distcc prefers to spread the load across as many hosts as possible before doubling up on any one host. The following images illustrate the result obtained when using these different strategies to assign five parallel jobs to a cluster of ten workers on five hosts:
This realization led to a new theory: perhaps the performance difference observed with low numbers of workers was simply an artifact of this difference in worker allocation strategies. We've already seen that this particular build is especially CPU intensive. Two jobs on one host have just two CPU's that they must share; two jobs on two hosts have four total CPU's available. This theory would also explain why emake "catches up" with distcc -- as more and more jobs are run in parallel, distcc is forced to assign multiple jobs to a single host. Eventually, both systems have fully loaded all the available cluster nodes, so the difference in allocation strategies becomes moot.
Armed with this theory, I altered my benchmark so that emake would use the same allocation strategy as distcc, by explicitly enabling and disabling agents via the cluster manager. For example, for a trial with two agents, I enabled one agent each on two cluster nodes. This technique allowed me to better compare the relative performance of distcc and Accelerator. Here are the results from this test:
With the allocation strategy out of the equation, Accelerator actually has a small, consistent edge over distcc (about 2) on small clusters. And the previous test showed that Accelerator scales better than distcc, so on large clusters the difference is even more pronounced (about 15).
Should We Change Accelerator?
An obvious question is whether we would consider changing the allocation strategy in Accelerator. The answer is probably no. The strategy we use, although suboptimal for this particular build, actually works very well across a wide variety of builds. One of the key advantages of this strategy is that it allows Accelerator to minimize network overhead, since agents on a single host can share various kinds of data directly. There are relatively few builds that skew so heavily towards CPU utilization, so changing the strategy to benefit those special cases at the expense of the more common case seems unwise.
A Champion Vindicated
Although we previously declared Accelerator the victor versus distcc when building samba, it was not without some reservations. With the new results shown here, I'm satisfied that we made the correct decision: CPU-for-CPU, Accelerator is more efficient and scales better than distcc, at all cluster sizes.