Data Mining CloudBees Accelerator Annotation: Bill of Materials

Written by: Electric Bee

7 min read

Stay connected

CloudBees Accelerator annotation files contain a gold mine of information about your build, such as the dependencies between jobs in the build, the time required to run each job, the exact command-line and environment used to invoke each command in the build, and even every file read and written by each job in the build. Many people have correctly speculated that they could use the file access data in annotation to create a bill of materials for the build, similar to so-called "configuration records" in ClearCase. In this post, we'll look at how we can do that using the annolib library.

Getting Started

The first step is to generate an annotation file with an appropriate level of detail. Annotation is not enabled by default, and even if it is enabled, not all information is included by default, because there is a performance impact when using some of the more verbose types of annotation. In this case, we need only file level annotation, which incurs less than a 2 performance penalty. To enable file level annotation, you must add the following options to your emake command line:

--emake-annodetail=file --emake-annofile=emake.xml

When your build completes, you'll find the file emake.xml contains an XML annotated build log. Each job is identified by a <job> tag, which includes an <opList> tag that contains an <op> tag for each file read or written by the job. For example:

<opList>
<op type="read" file="/blog/subdir" filetype="dir" isdir="1"/>
<op type="create" file="/blog/baz" found="0"/>
<op type="read" file="/blog/foo"/>
<op type="read" file="/blog/symlink_to_bar" filetype="symlink"/>
<op type="read" file="/blog/bar"/>
</opList>

Each operation is tagged with a type attribute that tells you the nature of the operation; in addition to the read and create operations in this example, you may see lookup , modify , unlink , rename , link , modifyAttrs , and append operations. These are very closely related to the usage types described in the agent performance metrics that we've previously explored . For purposes of generating a simple bill of materials, we're not so much concerned about the differences between the types of operations except to group them into two buckets: read operations, and everything else.
Besides the type of the operation, the <op> tag gives the full path to the file and the type of the file, one of dir , symlink or file . Many operations do not explicitly declare the file type, because the default value is file according to the annotation DTD, so we save a little bit of space by not including that attribute when it is redundant.

Constructing the bill of materials

The term bill of materials has many different definitions in the build space, but here we are specifically defining it as the list of source files read in the process of generating the final output of the build, subject to the following constraints:

  1. Directories and symlinks should be excluded.

  2. Files created during the build should be excluded, even if they are read by a later job in the build.

  3. Makefiles should be excluded, for compatibility with ClearCase.

You may find that you can do a lot of analysis with standard utilities like grep , but I find that beyond relatively simple tasks, it's easier to use annolib , an annotation processing library created for this purpose. Annolib is implemented as a loadable Tcl extension, so using it means writing a short Tcl script leveraging the facilities in the library. Based on the requirements above, I wrote the following annolib script:

switch -glob -- $tcl_platform(os) {
    Windows* {set InstallDir C:/ECloud/i686_win32   ; set ext dll}
    Linux    {set InstallDir /opt/ecloud/i686_Linux ; set ext so }
    SunOS    {set InstallDir /opt/ecloud/sun4u_SunOS; set ext so }
}

load $InstallDir/bin/annolib.$ext

set anno 
set xml   r]
$anno load $xml

set bom {}

foreach file  {
    if {  != "file" } {
        continue
    }

    foreach op  {
        foreach {job type file} $op { break }
        if { $type != "read" } {
            break
        }

        if {  == "parse" } {
            continue
        }

        lappend bom $file

        break
    }
}

foreach file $bom {
    catch {puts $file}
}

You can run the script like this tclsh bom.tcl emake.xml ; it will print to standard out a list of all the source files read by the build. Here's how it works, line by line:
Line 07 Load the annolib library, using the install directory we determined previously.

Lines 01-05Make a guess about the install directory where annolib will be found based on the platform that the script is running on. If you have installed ElectricInsight to a different location, you'll need to change these paths.
Lines 09-11Create an anno object which will hold the data extracted from the annotation file, then open the annotation file specified on the command line and instruct the anno object to load it.
Line 13Initialize our result set to empty.
Line 15Iterate through the files used by the build. returns an unsorted list of all the files referenced in all of the <op> tags in the annotation file.
Lines 16-18Constraint #1: check the type of the file. If the type is anything but a regular file, skip to the next file.
Line 20Iterate through the operations performed on the current file. returns a list of all the operations that refer to the specified file, in order of occurrence in the build, so earlier operations in the list occurred earlier in the build.
Line 21Each operation in the list returned by is formatted as a tuple consisting of the job identifier for the job that owns the operation; the type of the operation; and the name of the file. This line extracts those three fields into separate variables for use.
Lines 22-24Constraint #2: if type of the operation is not "read", skip to the next file. This trick works because the operations are given in order. If we see any other type of operation before we see a read, then we can conclude that this file is one that was created during the build.
Lines 26-28Constraint #3: if the type of the job is "parse", skip to the next operation on this file.
Line 30If we get to this point, then we must have a read operation from a non-parse job that is not preceded by any write operations. Therefore, this file should be included in the bill of materials.
Line 32Since we've already made a decision about whether or not to include this file in the result, we need not look at any other operations on the file.
Lines 36-38Print each file in the bill of materials to standard output, one per line.

An exercise for the reader

This simple script just scratches the surface of what you could do with a bill of materials report. For example, after generating the list of input files, you could query your SCM system for version information on each file and include that in the output. Or you could restrict the output to only those inputs that contribute to a specific output of the build, rather than all the outputs as this script does. What can you do with the gold in your annotation files?

About CloudBees

CloudBees powers Continuous Delivery . We help organizations developing mobile, embedded systems and enterprise web/IT applications deliver better software faster by automating and accelerating build, test, and deployment processes at scale. Industry leaders like Qualcomm, SpaceX, Cisco, GE, Gap, and E*TRADE use CloudBees solutions and services to boost DevOps productivity and Agile throughput.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.

Loading form...
Your ad blocker may be blocking functionality on this page. Please disable for an improved experience.