A few of us here at Codeship recently completed work on delivering isolated networks for steps in Codeship Pro. During the course of the project the codebase underwent significant remodeling. We wanted to share with you some of that journey and the engineering concepts that shaped it.
A Bit of History
During early planning of the networks feature, we realized that we had arrived at a fork in the road with the current implementation of Codeship Pro. To understand why, it helps to recap the history of the product.
As with many products, Codeship Pro started as a forward-thinking idea that needed iteration to discover its true identity. Engineers added features at a blistering pace and solved problems one at a time. As time went on, members of the team responsible for the initial product moved on to other projects. The result was a codebase that executed on the core features but lacked a consistent vision.
Codeship Pro was first based on a pre-1.0 version of Docker. Of course, Docker itself has also changed a lot since the early days.
Growing with Docker
We implemented workarounds for bugs in older versions of Docker, and Docker in turn released new features and deprecated existing mechanisms that we relied on. As we became more familiar with Docker, we would have ideas for new techniques or features that needed testing, so we implemented small proof-of-concepts that were never released. Over time, this led to a confusing codebase that accrued significant technical debt.
Learning on the Go
We wrote Codeship Pro in Go. Due to the relative maturity of the Docker API Golang client at the time, this was an easy decision. However, it was the first Go project at Codeship and many of the engineers were learning Go along the way. Design decisions carried over from experiences with other languages resulted in awkward implementations that didn’t fit well into the Go ecosystem.
Due to these pressures and conditions, we had arrived at a codebase that resisted change. It was time to rebuild.
Why It's Time for a Change Now
As an engineering team, we decided that we needed to pay these debts now to increase our velocity later. Thankfully, the product team agreed to give us time to make these changes while we implemented the networks feature.
A Need for Cohesiveness
A high-level theme for this project was “cohesiveness.” In the past, we were able to add many features but had little time to ensure that the features themselves rolled up into a cohesive whole. We decided on the following goals for the refactor:
Implement networks at the step level. Each step executed within a Codeship Pro build would take place in its own isolated Docker network. This was to help alleviate many work arounds for conflicting ports between steps.
Emphasis on test coverage. Lack of tests was a primary contributor to the codebase’s resistance to change. In the past, it was difficult to determine the intention of much of the code that needed updating. It was also very difficult to know if any changes we made to the code broke backwards compatibility. For the refactor, we established a goal of greater than 80 percent coverage.
Greater orthogonality between the packages. Changes to an implementation in one package shouldn’t affect other packages around it. In our case, this meant more emphasis on interface design and stepping back to assess the design as a whole.
Pruning of features. Removal of features that the product team agreed were no longer necessary or useful. In the event we couldn’t remove something we’d like, we would provide deprecation notices and create a plan to remove it down the road.
A Need to Highlight the Product's First Principles
Early attempts at reorganizing the existing implementation to accommodate the networking changes proved to be difficult.
We had previously conflated areas of responsibility and any attempt to refactor a package tugged on threads that unraveled other packages. It was clear that we’d have to touch more code than we first anticipated, which put us in sight of a full rewrite. We decided to take a step back and address the product as a whole from first principles.
How We Focused the Work
To assess where to anchor our efforts, we decided to do a series of pairing sessions and work through it together. In these sessions, we read through the existing code and transcribed into English what we saw, annotating the code with comments. This exercise helped each of us develop mental models of the key interconnected concepts.
From these descriptions, we then began the process of naming and diagramming the existing areas of responsibility. This revealed a list of things that were working well for us and others that needed improvement. Unfortunately, the list of things to improve was extensive.
Understanding High-level Problems
The key insight that we gained from our discovery process was that there were a few small architectural decisions that contributed to the high cognitive load of understanding how the system worked. The biggest issues were:
The mixing of recursion while creating the service graph and walking of the "step tree" simultaneously.
Tight coupling of event/state, such as "Container Created," reporting to the implementation of the consumers of those events.
We solved the service graph resolution by building a better abstraction of the services upfront. From this, we could detect cycles and provide a sorted order for creating service containers for any step in the tree. We will be discussing the details of this solution in a future blog post.
The tight coupling of event/state reporting to implementation was again solved by creating a better abstraction that represented the basic ideas of events and publishing them without knowledge of what would consume them. We modeled this abstraction after Docker’s own event system.
This also allowed us better insight into what the system was doing by enabling us to create event consumers that logged all events to a file for later consumption, which helped us debug several issues during development.
Building From the Bottom Up
With these high-level problems solved, it was easier to see the algorithm for performing any individual step type. We resolved to break the other areas of responsibility from our discovery phase into smaller pieces and approach things first from the bottom up.
We then implemented tasks like image loading, container lifecycle, network creation, and event handling in separate packages. These packages also provided strong interfaces for the behaviors held inside which also allowed for better testability.
Once we reached a critical mass of foundational components that performed the basic tasks we needed, we set out to build the step orchestration layer. This layer would contain all of the business logic and be responsible for directing the creation of networks, loading of service images, and starting of containers via the package-level abstractions we had established.
The last piece of the puzzle to having a working system was the top-most layer. This required the creation and wiring in of the lower-level dependencies and allowing for customization via configuration options, and finally the walking and running of the steps themselves. It was very exciting to finally see the system function as a whole once we connected everything together.
What We Learned
This is not to say that everything always went smoothly. We had plenty of surprises and challenges along the way and even refactored some of our new code as we further understood the problem.
We also uncovered several "undocumented features" from the previous version through heavy testing and debugging of many different steps and services configurations. In the end, this was a very rewarding experience that also taught us many things along the way.
Breaking Codeship Pro down into its core components allowed us to understand what it was supposed to do as a product. After that, we were able to come up with abstractions that allowed us to execute on that product vision.
During this rewrite process, we did learn a few things will hopefully allow us to succeed on future projects.
Try to question existing assumptions about the product. For example, we cut out many code paths and "features" that were deemed to be unimportant. We had not actually released several of these features to customers and yet they were littering the codebase with unnecessary branches, making the code harder to understand.
Removing the cruft and breaking the product down to its core features and areas of responsibility help when going through a rewrite.
Start with a beginner's mindset
When attempting a rewrite, it's very easy to get overwhelmed and bogged down in the minutia of how you are going tackle such a large task. Instead it is often helpful to start with a piece of the codebase that is small and that you understand well, and allow the process to snowball from there.
Coming at the problem with a beginner's mindset can also help you first focus on the what/why before attempting to figure out the how. This usually allows you get a better understanding of the problem before jumping into the code.
Also, taking a step back may allow you to see the forest for the trees, which you may have been unable to do before.
Follow the Boy Scout Rule
The Boy Scouts have a rule: "Always leave the campground cleaner than you found it.”
We can apply this principle to software projects as well, since not all code can/needs to undergo a full rewrite. Any small incremental changes you can make to improve a codebase or process will benefit not only you but your coworkers as well.
In Codeship Pro’s case, we were able to simplify how we build and deploy the project, not by starting from scratch, but by iterating on the process that was already there. This allowed for a tighter feedback loop for us to ensure that our changes did not break existing functionality.
Let Us Know What You Think
We at Codeship are lucky to have an excellent product team that sees the value in allowing us to tackle our technical debt because they understand it will pay off in spades down the road. We could not have accomplished the large task of rewriting Codeship Pro and adding a new feature at the same time without their excellent planning and product knowledge.
We rolled out the new version of Codeship Pro in production on January 3. It's available for download as a CLI for local development.
Please give it a try and let us know what you think!