Lessons Learned from CloudBees Testing Teams

This blog is co-authored by Romain Verduci, Quality Engineer, CloudBees, and Terry Martin, Customer Success Architect, CloudBees.

Over hundreds of DevOps implementations, CloudBees has identified some lessons learned and best practices which we are delighted to share with the wider CloudBees and CD Foundation communities.

This article focuses on end-to-end testing and associated tools, where the objective is to test a real user scenario from start to finish. This will not only validate the system under test but also ensure that its subsystems work and behave as expected — given that we have real sub-systems to test.

Testing Strategy

Cloudbees uses an implementation of the agile test automation pyramid model from the book Succeeding with Agile by Mike Cohn. At the bottom, there are the unit or component tests, typically written by the developer. As one moves up the pyramid, one gets close to end-to-end testing, where user scenarios and dependencies are introduced. Near the middle of the pyramid, there are more stubs and mocks to emulate real systems. Approaching the top of the pyramid, the tests become more real-world where the application is tested against real subsystems.

Another perspective on the figure below is that tests at the base of the pyramid are typically more numerous and less costly. The more you approach the UI Tests (such as end-to-end testing, black box testing), the more potential for increased costs, depending on the type and number of supporting systems that will make up the top of the pyramid. As the figure notes, these are “slower” and more time-consuming.

the importance of test automation

Source: https://martinfowler.com/articles/practical-test-pyramid.html#TheImportanceOftestAutomation

Areas of Instability

End-to-end tests (top of the pyramid) are more likely going to be flaky for a variety of reasons. Stability is a requirement for automation and one of the objectives that is critical to achieve. Instability leads to test failures and a slowdown in the software development lifecycle (SDLC).

Testing with a real environment

Take a simple, hypothetical web application with servers networked, a web server, a database and an authentication service. We want to test this to see the user experience, i.e. what the user will see when using the application with a web browser. Now imagine a test failure of a page not loading. Where is the problem? Is it because of a slow network? Is it the database query? Is it the storage infrastructure? Flakiness in one or more of these components of the technology stack can cause failures in the end-to-end test. To further complicate things, the problem can be with one of the dependencies and not the application itself.

Race conditions in the code

Per the Open Web Application Security Project:

“A race condition is a flaw that produces an unexpected result when the timing of actions impact other actions. An example may be seen on a multithreaded application where actions are being performed on the same data. Race conditions, by their very nature, are difficult to test for.”

Race conditions typically don’t show up in unit tests, but can sometimes show up as one ascends the test pyramid. One place this can show up is in multithreaded code, where end-to-end testing triggers multiple threads that haven’t been tested to this level. This is typically one case where the dependencies can expose flaws in the application due to user scenario testing. These can be and lots of time are the proverbial “needle in the haystack”. One way to avoid this is to design them out, but much easier said than done. Thorough logging and monitoring can also be a big help in getting to the cause of the condition.

Timing issues

There are multiple situations where timing issues can pop up; one of them is dynamic content. Take Google Maps, for example. If you ever notice that the map loads before the plus/minus zoom controls related to that map load. Testing the zoom feature without the zoom controls is problematic. The controls are obviously dependent on that map being there. Your test code must have the proper intelligence built in to ensure proper interaction with page elements. This like the others mentioned above can cause flakiness or instability with your testing.


For applications that depend on different components such as a database where several versions (or even vendors) are supported, it helps to know which customers use what. Testing, like everything else operates in a world of resources that need to be managed. Testing time and resources are best directed at a majority, to cover the most potential users. Of course, the more data there is the better the targeting.


Imagine a big, complex, monolithic database feeding many different applications worked on by separate teams. Due to the complexity and ingrained habits among the developers, unit test development was difficult if not impossible. The test lead decided to implement a bunch of end-to-end tests which grew over the years to almost a 1000 tests. In a period of approximately three years, there was not a single time when all of the tests passed. Those that failed ended up being a small percentage of the tests, however, in the end, the failures ended up exposing real issues with the application that later showed up in production.

Having a small number of flaky tests leads to a big lack of confidence in the test coverage from engineers, often resulting in the failures being ignored. Flakiness is almost inevitable, but this risk is mitigated by following a set of best practices that will help reduce random failures and expedite test execution.

Selenium Best Practices

At CloudBees, we use Selenium. Selenium comes with a browser plugin interactive development environment, Selenium IDE, that offers a convenient to recall actions for repeatability. There are features that we’ve had mixed results with, like automatic code that is generated, as well as some of the associated selectors. Below are some of our observations:

  • Per Selenium’s website, Selenium has two parts — Selenium WebDriver and Selenium IDE. We have found that Selenium IDE can introduce challenges if your system is complex, mainly due to Selenium generated code. In fairness, if there is an interest, try the latest code because Selenium is not static and is being updated.
  • When you want to interact with an element, prefer element ID selectors or rely on pieces of code that do not change too much.
  • To address potential timing issues, use “explicit waits” where you wait on, say for a particular element to be loaded and add a timeout. Never use “implicit waits” (Thread.sleep) where you just wait on a hardcoded amount of time.
  • Implement independent tests that can run in parallel. Tests should never depend on another test. For example, if your test requires a login that depends on a test of the code, then should the tests for login code fail, dependent tests will also fail.
  • Implement the machinery to run your tests in parallel in you CI system so you can save some time.
  • Consider your test code in the same way as your production code — good design and all of the good practices that accompanies production code development.
  • As the test pyramid above shows, you should prioritize more unit/integration test send other tests that are further down the pyramid. End-to-end tests are really good to ensure the application is working well, but they can come with a high cost of run time/maintenance, etc.

Lessons Learned

  • We do not recommend relying on Selenium to setup/cleanup your test data or execute a test prerequisite. Consider the following: 
    • Example: If login is a prerequisite for all your tests, don’t use Selenium to do it, prefer setting up a cookie, token or basic authorization in the URL so if your login is broken, you can still run the other tests for your app and they will be much faster. You will have a test that will specifically test login.
    • Another example: You need a user account for your test, don’t use Selenium to create it via the UI but insert it in the database directly or use a REST service that is responsible for that.
    • The Selenium part of your test should only be the validation of what you actually want to check for this specific test.
    • With any end-to-end test, try to never rely on the app you want to test as a prerequisite for your test.
  • If possible, we prefer cleaning up any data that can impact tests before running it. We used to clean up after a test but in case the framework ends unexpectedly, the data won’t be cleaned up. There are typically “before” and “after” methods that are available with test implementations and useful in this situation. Where possible, use the “before” method as much as possible. It is also a good practice to identify every data in your test environment that can have an impact on what you are testing. If the data is not identified then it can’t be cleaned up.
  • Implement the “Shift Left” approach to testing. The message here is: get the testing team involved early as possible, as opposed to just a few days before the release. Implementing this will vary depending on organizational structures and cultures. Specifically, we have found in regard to end-to-end testing:
    • Run your end-to-end tests as close to development as possible. They are often run as a cron job once a day or once a week, try to move them to the pull request (PR) process so they are as close as possible to the code change.
    • If running all your end-to-end tests is a long process, you can divide them by groups and run a specific group of tests in a PR depending on what part/component of your app has been updated.

In Closing

Like a lot of companies, CloudBees is navigating the journey towards the optimal testing environment and culture. We evolve as the industry evolves, like any competitive business should. This includes newer tools that are free of Selenium WebDriver technology. Stay tuned.

There are two other methodologies worth mentioning that deserve consideration. Test-driven development (TDD) and Behavior Driven Development (BDD). TDD refers to a style of programming in which three activities are tightly interwoven: coding, testing and design. BDD is a synthesis and refinement of practices stemming from TDD and seeks to achieve user story outcomes. For further reading on TDD and BDD consult the links in the Agile Glossary in the Resources section below.

One last point — and we can’t stress this enough — having a strong testing-centric culture in the development teams is of utmost importance and is critical to success.

Additional resources