At CloudBees, we've identified some best practices as we have gone through the set up of a virtual infrastructure to support our own engineering needs. If you've already gone through this exercise, the following may seem trivial. Or perhaps you have other nuggets of wisdom to share. (If so, please do!) Either way, I hope these tips will improve the experience of setting up a virtual build infrastructure.
As a first step toward success, establish reasonable goals for what can/should be done in a virtual environment. Also, make sure you have a rollout plan that allows for proper testing (both functional and performance).
In this part I will describe what we have and what we learned from our deployment. The second part will wrap up some additional findings and relate the findings to our CloudBees Accelerator product.
What Did We Look For?
Just in case you haven't already come up with a list of benefits for your own justification of a virtual infrastructure, here are some of the main reasons that we decided to go with a virtualized environment (you can also check out this previous post for some of the benefits):
An ever increasing number of configurations/operating systems made it hard to scale the available rack space.
A desire to reduce the large energy consumption caused by the increased number of physical machines, and to reduce increased complexity of managing the additional physical machines.
The need to deploy needed resources quickly on demand.
The desire for an ability to take an existing machine image and generate 20 identical new machines based upon it in a few minutes.
What We Got
We are using virtual machines primarily for testing and debugging:
Continuous integration testing : Every checkin launches a build which runs a series of fully automated tests on all supported platforms (currently seven Windows platforms and five Linux platforms).
Release qualification tests : QA runs a series of partially automated tests during release qualification which provision virtual machines for each supported platform and perform installation tests and large-scale performance and stress tests.
Manual testing : QA deploys virtual machines in a variety of configurations as needed for targeted testing as well as for verifying fixes.
Debugging : Developers deploy clusters of virtual machines for system analysis and debugging.
We also maintain a small set of physical hosts to serve as a fallback for our virtual machines. This provides protection against a catastrophic failure in our virtual infrastructure, as well as ensuring we have a means to analyze/debug issues that manifest due to timing or device drivers or are otherwise hard to work on in a virtual environment.
Note that we are not running our build machines as virtual machines. The main reason for that is that our build throughput (doing parallel builds using CloudBees Accelerator) in house can easily be handled by a few physical machines which have been in place for some time now (If it ain't broke, don't fix it...).
What Did We Learn?
It is important for you to figure out what you want to do in your virtual environment. Virtual infrastructures may be touted as a cure-all panacea, but in reality there are areas where they work well and places they simply don't.
The main three (usual) constraints are disk I/O, cpu utilization and memory utilization. No "one size fits all" formula exists to figure out the best setup, but some basic guidelines help.
Disk I/O
If your machines will do a lot of disk I/O, make sure that you have good infrastructure in place to support that. Examples for high disk I/O usage would be:
File servers with local storage in the virtual machine.
Database servers with local storage in the virtual machine.
Running compilers/links (Sorry... that would probably include all CloudBees Accelerator customers, so please pay close attention to this).
Large amount of file copy operations, for example during an installer build.
Less data intensive uses of virtual machines would be:
Web servers
DNS servers
DHCP servers
Database servers with remote storage
File servers with remote storage
There are two fundamentally different approaches for storing your virtual machines with distinct advantages/disadvantages.
Local Disk
Local physical disk in the virtual machine server allows you to avoid putting stress on your network. Of course, you could establish a separate backbone for your filer access, but that requires that you purchase and maintain more infrastructure. A purely local disk on the virtual machine server will, however, impact your virtual machines disk performance. As a general rule, one hard disk spindle will not do well when you have multiple virtual machines hitting it at the same time. You may be able to offset that by using very fast disks, using multiple disks in one machine or even doing RAID .
Distributed storage of the machine templates has its trade-offs as well. On the one hand, you get backups almost for free, but you do have the extra overhead of managing multiple copies of your virtual machines.
Remote Disk
Remote disk storage (a disk accessed through the network) means that you invest in disks only once. In general, the central solution will offer such nice things as redundancy, RAID, sophisticated volume management, snapshots etc. (All things you can get from a SAN ). On the downside, it introduces a potential bottleneck for the disk access done inside of the virtual machines. All virtual machines will have to go through the network to the central file server to perform any disk access to their "local" disk.
We do use a SAN in-house with a few specific adjustments to the infrastructure that make this a very fine solution for us:
High-speed network connection : We have a private backbone between the virtual machine servers and the SAN.
Small disk partitions : The SAN is partitioned into small pieces, each storing a maximum of five virtual machine images. At this level, we avoid introducing potential redundancy (using up more disk space), but greatly improve disk I/O in the virtual machines.
No snapshots : We are not backing up the virtual machines storage space with the SANs built in snapshot mechanism. While snapshots are great, we have found that in case of recovery there is a potential of getting out of sync with the virtual machine management software we are using, so the restored disk images don't usually do us much good. Instead, we make sure we have full manual backups of all virtual machines that are regularly refreshed.
iSCSI, not NFS or CIFS : We are using iSCSI . We have found that using NFS or CIFS will not result in good performance and may also affect stability.
CPU Utilization
In our environment, the virtual machines are generally very busy. In real terms, that translated into a ratio of 1.5 (virtual cores) to 1 (physical cores). On an eight core virtual machine server, we generally run about 12 single-core virtual machines. If we use multi-core virtual machines, we reduce the overall number of deployed virtual machines on the virtual machine server accordingly to maintain our desired virtual core to physical core ratio. While we could deploy more, this will generally result in poor performance of all virtual machines on that virtual machine server.
Memory Utilization
We have adopted a memory utilization strategy that is similar to our CPU utilization approach. We are making sure that our virtual machines do have enough memory available. Based on the tasks we run on the virtual machines, that means at least 2 GB of memory for each virtual machine.
The maximum number of virtual machines on a virtual machine server is directly related to the available memory on the virtual machine server and the memory used in the virtual machines. So, if you have four virtual machines configured with 2 gig bytes of memory each, your virtual machine server should have at least 9 GB of memory (i.e 2 GB for each of the virtual machines, plus some for the virtual machine server itself). The amount of memory dedicated to the virtual server OS varies between different products.
Please also read Part 2 of this article.