CloudBees Accelerator Agent Metrics, part 2

Written by: Electric Bee
7 min read
Stay connected

CloudBees Accelerator agent metrics provide a tremendous amount of data that you can use to analyze and improve the performance of your builds. Last time we saw how to collect the metrics and we explored the data presented in the Overall time usage section. This time, we'll look at the data in the Usage records section.

Usage records

As a build is executed by Accelerator, the Electric File System tracks all filesystem and registry accesses made by commands in the build. This usage log is the means by which modifications made on cluster nodes are propagated back to the emake host, and it also forms the basis for our conflict detection algorithm. Because it is such a core piece of the system, we track numerous metrics specifically related to the usage log. Those metrics are reported in the Usage records section of the agent performance metrics.
After your build completes, use cmtool and agentsummary to get the agent metrics, then find the Usage record section. It looks something like this:

Usage records:
  Append                 0 (  0.0),    0.0 per job;       25 raw (  0.0)
  Blind create           0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Blind truncate         0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Create              4835 (  2.5),    3.7 per job;     5425 raw ( 89.1)
  Create key             0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Create(dir)          380 (  0.2),    0.3 per job;      476 raw ( 79.8)
  Delete key             0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Delete value           0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Failed lookup     112444 ( 59.3),   85.8 per job;        0 raw (  0.0)
  Link                 760 (  0.4),    0.6 per job;        0 raw (  0.0)
  Lookup             19715 ( 10.4),   15.0 per job;  7507140 raw (  0.3)
  Lookup key          1426 (  0.8),    1.1 per job;     8385 raw ( 17.0)
  Modify                33 (  0.0),    0.0 per job;       42 raw ( 78.6)
  Modify atts            1 (  0.0),    0.0 per job;     1714 raw (  0.1)
  New name            4080 (  2.2),    3.1 per job;   116978 raw (  3.5)
  Read               43476 ( 22.9),   33.2 per job;    85315 raw ( 51.0)
  Read key               0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Read value          2274 (  1.2),    1.7 per job;     2274 raw (100.0)
  Rename                 0 (  0.0),    0.0 per job;       11 raw (  0.0)
  Rename(dir)            0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Set value              0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Set value if           0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Truncate               7 (  0.0),    0.0 per job;       32 raw ( 21.9)
  Unlink                 0 (  0.0),    0.0 per job;        0 raw (  0.0)
  Unlink(last)         225 (  0.1),    0.2 per job;      910 raw ( 24.7)
  Total             189656 (100.0),  144.7 per job;  7728727 raw (100.0)

As you can see, Accelerator tracks a wide variety of file and registry operations:

AppendAdding data to the end of an existing file.
Blind createCreating a new file.
Blind truncateCompletely overwriting an existing file.
CreateCreating a new file.
Create keyCreating a registry key (Windows only).
Create(dir)Creating a directory.
Delete keyRemoving a registry key (Windows only).
Delete valueRemoving a registry value (Windows only).
Failed lookupstat() on an non-existent file.
LinkCreating a hard link to an existing file.
Lookupstat() on an existing file.
Lookup keyLookup (or attempt to lookup) a registry key.
ModifyRewrite a portion of the contents of an existing file.
Modify attsChange the attributes of a file, such as access permissions.
New nameDoes not correspond to a specific operation; this is a placeholder used to facilitate reporting usage to emake.
ReadRead the contents of a file.
Read valueRead (or attempt to read) the contents of a registry value (Windows only).
RenameRename a file.
Rename(dir)Rename a directory.
Set valueSet a registry value (Windows only).
Set value ifSet a registry value, with additional constraints on the value (Windows only).
TruncateOverwrite the contents of an existing file.
UnlinkRemove a hard link to a file or directory.
Unlink(last)Delete a file or directory.

For each type of operation, the agent reports the following data:

  • Total number of records of that type reported to emake.

  • Portion of the total usage records reported represented by that type.

  • Average number of records of that type per job in the build.

  • Total number of raw records of that type reported by the EFS.

  • Portion of the raw records of that type reported to emake by the agent.

Why are there so many types of operations?

As of Accelerator 4.3.0, the EFS and agent track 25 different types of operations, almost double the variety that we tracked in Accelerator 3.5.0. Many of the different types are fairly subtle variations on one another. For example, the distinction between Create and Blind create is simply the flags used in the system call that created the file. Similarly, Create could be considered a special case of Modify . The question inevitably arises: wouldn't it be simpler to track fewer types of operations?
The answer of course is yes, it would be simpler, but that simplicity would come at the cost of superior efficiency. For example, although Create is a special case of Modify usage, it has one very significant difference: Modify usage implies that a command used the previous contents of the file before making changes, but Create implies that the command did not use the previous contents. This seemingly trivial distinction is the difference between commands that must be serialized and commands that can run in parallel: two commands that modify the same file must be serialized (think of compile steps that each update a shared .PDB file), while two commands that each overwrite the same file can be run in parallel.

Reported Usage versus Raw Usage

You may have noted that there is sometimes a significant difference between the raw usage reported by the EFS to the agent, and the usage ultimately reported by the agent to emake. The simple explanation is that the agent has some smarts to eliminate redundant usage records, in order to minimize the number of records emake must manage and examine for conflicts.
A trivial example is eliminating redundant Read operations from the usage: the first time a file is read during a given job, the agent logs the usage. The second time the file is read during the same job, the agent discards the usage because it is redundant with the usage already logged for the file.
A more sophisticated example is eliminating Read operations on new files: when a file is created during a job, the agent logs the usage. If the file is later read during the same job, the agent discards the usage -- the read cannot impact the dependency analysis that emake will perform, since it is reading data that was generated in the same job, rather than data generated during a different job in the build.
By volume, the most significant application of this technique is in the reduction of lookup operations. It's astonishing just how many lookup operations occur in the course of an average build, and even more so how many of them are redundant. In the sample shown above, only a tiny fraction of the lookups were actually significant. The rest were discarded by the agent before ever being sent to emake, dramatically reducing the amount of data emake must manage. If anything, the sample shown here is on the low end in terms of the number of lookups generated -- most builds generate far more. One customer's build generated about 200 million lookups during a 50 minute long build.
Some types of usage, such as failed lookups, are not actually directly reported by the EFS. Instead, these types of records are synthesized by the agent to replace longer sequences of operations -- again, in order to reduce the amount of data that emake must track.

What's normal?

The usage profile of most builds is surprisingly similar: the vast majority of usage is lookups (both failed and successful); reads are the next most common, followed by creates. Just about everything else occurs so infrequently that it's not worth calling out explicitly. To put concrete values on it:

Failed Lookup50 - 60
Read25 - 30
Lookup10 - 15
Create2 - 10 (NB -- usually 1 or 2 per job in the build)
Everything else2 - 5

How can I use this data?

There are two ways you can use the usage metrics to improve build performance. First, you can use the usage profile as a simple "gut check" to spot anomalous behavior. For example, a build that logs significantly more Create usage than others may be doing unnecessary additional work, such as building targets repeatedly. A build that logs excessive Read usage may indicate problems like badly factored header files which cause every compile to read every header. It's hard to predict exactly what the problems might be; the key is to keep your eyes open for anything out of the ordinary.
The second consideration relates specifically to failed lookup usage. As you've seen, failed lookups account for the majority of all usage. Earlier I noted that failed lookups correspond to stat()-like operations on non-existent files, but what does that really mean?
The most common source of failed lookups is include path searches. Suppose your build specifies an include search path to the compiler, for example:

... -I/tools/megawidgets/include -I/tools/apache/include 
  -I/tools/expat/include -I/tools/tcl/include -I/tools/perl/include ...

Now, imagine your source files contain #include directives like this:

#include "math.h"

When the compile looks for the file math.h , it will search for it in each directory specified in the include search path. Of course, the file exists in only one directory, so each attempt to find it in the other directories generates failed lookup usage. Now imagine a system in which the include search path contains dozens of directories. Every compile step in the build will search each of those directories for each header file included. You can see how quickly it adds up:

      500 compiles
  *    20 headers per compile
  *    50 dirs in include path
  ----------------------------
  500,000 failed lookups

Depending on the build, there may be nothing you can do about this. But if you are able to, you could investigate optimizing the include search path:

  • Eliminate stale entries.

  • Reorder the include path so that more commonly used headers are found earlier.

  • Customize the include path per target. Some builds use a single include path for all compile steps, but if the megawidgets headers (for example) are used only by some targets, adding that directory to the include path for the other targets only bloats the search space.

Unfortunately, it's hard to predict exactly how much impact changes like these could have on your build, although I can say that my experience has consistently shown that reducing the size of your build in any dimension often results in surprising performance benefits due to second-order effects. I encourage you to go for the "low hanging fruit" -- make those changes that are easy to make and measure the impact. If the modifications pay off, find the next lowest hanging fruit and repeat the process. Keep going as long as the cost of making the changes is less than the benefit you see.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.