Due to the nature of CloudBees products, we frequently field questions about how to best configure a large number of machines. I'll touch on a few points that we have found particularly interesting, both when helping customers or setting up our own internal machines.
Pick the Right Hardware for the Task
There is an interesting balance between cost and performance. Of course we all can come up with a super high-end machine that outperforms anything out there, for a mere $50000 (for a slightly outdated benchmarking of current systems look at Extreme Tech's article Intel Quad-Core Performance, Top to Bottom) . Now, if you have a cluster of 100 machines it becomes obvious that this sort of setup is prohibitively expensive. At the other extreme, using Netbooks for your high performance tasks would probably result in unhappy users. As always, you will have to make a decision between speed and cost when choosing hardware, and the choices you make will determine your build speed and scalability. Obviously choosing the right system is an individualized process but consider these factors when trying to identify the right hardware:
Central Build Machines (emake) - As much RAM as possible; four or more cores; Raid 0 if possible.
Cluster Manager Machine - As much RAM as possible; two or more cores; large disk.
Build Agent Machines - One GB RAM per agent; one core per agent (on less CPU intensive builds you can oversubscribe a litte, we often see two cores for three agents); Raid 0 on very I/O intensive builds; the size of the largest build per agent for the disk size plus enough space to keep the disk utilization below seventy percent.
Network Setup - Even if you can make all machines super fast, to do a lot of stuff in parallel, keep in mind that if you need to push a lot of data over the network you want a big enough switch to deal with the data. From a topology perspective, you probably want to make sure that machines that work together are on a single switch. That way inter machine communication won't affect the overall network load.
A special note on the cluster manager machine: unlike the other components, the performance requirements of this machine are highly dependent on the number of users and builds that will be supported. If you run five hundred builds a day with fifty users, the above recommendation should be fine. If you run ten thousand builds a day with five thousand users, you will have to scale the system:
Use a machine with four cores minimum.
You want to run the database server on a separate machine.
Ensure a Consistent Environment across All Machines
It is critical to have a consistent setup amongst your machines, especially when machines are used in a grid/cluster fashion. Apply changes done to one machine across all machines (after first testing the effect on one). While most aspects of the build environment can and often are virtualized using the CloudBees Accelerator product, you may still have some tools that are installed locally. The main reasons for that are:
Network performance - less data transfer by not having to transfer the tools.
Tool constraints - some tools need to be installed locally using their respective installers (like Visual Studio if you plan to use solutions and projects). This is mainly due to licensing schemes, as well as the way these tools store their settings in the registry.
To mirror our tool chain we run scheduled tasks that sync tool chains on the relevant machines. We happen to use rsync , but pretty much any commandline driven syncing tool will work. We run some internal tools to compare the configuration/setup of machines. This ranges from looking for specific tools, the tools version and configuration, all the way to checking for installed hotfixes, comparing dll versions and specific system settings.
Create a Step-by-Step Instruction Set to Get from a Blank Disk to the Final Machine
Having a step by step instruction set for setting the machines up from basic OS install all the way through the tool chain, including info on licenses etc. allows us to get new machines online very reproducibly and quickly. I find that the more detail we put into the instructions, the better the generated machines are. So here the instructions include info ranging from what device driver to use to what shortcuts to add to the desktop.
Machine duplication can be your friend
Cloning, a very cool feature, can help deploying/maintaining large numbers of identical machines. There are several solutions out there, some free, some commercial (no claim that this list is complete... but you get what I mean):
And many more listed at Wikipedia on Disk Cloning
As long as the hardware is fairly identical, this model works pretty well. As time progresses and newer machines enter the machine pool, the number of images you need tends to increase. The makers of these cloning tools get better and better dealing with varying hardware, and different operating systems have different degrees of sensibility to hardware changes (guess which one is least sensitive). Another important factor to consider is the network load that you will endure if you clone regularly. If you do the cloning once during initial machine setup using a private network/behind a switch you can avoid that issue.
So What About Virtualization?
Virtual machines are another approach to allowing deployment of large numbers of machines from a common baseline. There are various offerings out there that allow the creation/management of virtual machines, such as:
These solutions provide a great way to deal with a variety of issues:
Varying machine demand
Diverse machine configurations
Reduced hardware/operational cost
Quick crash recovery/consistent machine state
There are, however, some drawbacks. When sizing the virtual host machines, a number of issues have to be considered in the equation:
Will the use of the virtual machines be CPU bound or I/O bound?
Will all machines stress the system at the same time?
Are there peak demands that need to be met?
Log When You Make Changes to Machines
Whenever anyone needs to make a change to a shared machine, ensure that the change is logged in an easily accessible location. This may feel like a lot of overhead (and a difficult thing to get people to do), but when you have one critical machine that blue screens all of a sudden, and nobody knows what has changed on it recently, it will seem like a cheap investment. Use a blog, a mailing list, a tweet or a wiki page for tracking that information. It doesn't have to be fancy or even super detailed to be effective, but something that says "I am planning to change this on this machine". Besides a certain level of accountability and traceability, it also generates a nice set of clues and tips for others to see as you navigate the wonderful waters of machine configuration.
Avoid Tools that are Not Parallel/Distribution Safe
One of the more common support issues CloudBees receives is that a particular tool is not working. It is working when running in the original build, but when run with the CloudBees Accelerator product the build breaks. There are a number of possible explanations for this, but in this section I want to focus on the tool chain being used. We have found that there are a number of tools out there that will not work reliably when run multiple times simultaneously on one machine, or in a distributed fashion on multiple machines. One good way to test what is causing trouble is to reduce the number of active agents on each agent machine to one. You can do that in the Cluster Manager Agents page by disabling the extra agents. Another way would be to just log into a machine, and run the same command multiple times in parallel in different shells. Often times this will already demonstrate the tools inability to run stable under those conditions (oh the good old times when developers didn't have to worry about concurrency).
Visual Basic 6
Visual Basic 6 is one common tool that will not perform well when run in parallel on the same machine. The main issue we have seen is that VB 6 seems to use the registry to store references to ocx files, which potentially get corrupted if you have two similar builds running at the same time (as both builds will write their own values in there leaving the values in a potentially inconsistent state). CloudBees wrote a tool to force the serialization of VB6 steps on the agent machines, which has resolved this issue at a small performance penalty for these steps.
There is a recurring issue with the initialization code in cygwin.dll. There was a fix in cygwin 1.5.19, which partially fixed the issue, but the issue still seems to occur in more recent versions, although less frequently. See this help request for an independent description of the issue, and this link for a description of the fix. Basically starting many cygwin processes at the same time (without any preexisting cygwin processes), may result in fork errors, or get the started processes into a busy loop. One workaround is to have a cygwin shell start in the background on that machine and have that run all the time. The Accelerator product does this automatically starting with the 4.5.0 release.
I hope these general guidelines will help in making a first pass on deciding what to set up and achieve your goals. In the next part of this article I will present some more concrete tips on setting up machines.
Stay up to date
We'll never share your email address and you can opt out at any time, we promise.