In the first part of this series, we talked about team productivity and why focus and letting go of unnecessary work is essential to win your market.
One distraction keeping teams from staying focused is having to manage too much infrastructure. Maintaining a server infrastructure is a lot of work and can shift your team’s focus away from building your product.
Managing servers is a huge distraction
While the cloud has certainly decreased the amount of work it needs to run a complex infrastructure, it still takes several man hours to maintain your cloud servers. Often this involves building or fixing issues we shouldn’t have to deal with any more.
By sticking to the model of long-running server instances, we’re taking the issues those long-running instances have to the cloud. Over time, systems begin to deteriorate, and we have to fix them to keep running. They basically develop their own quirks -- a huge distraction for our daily development.
Handling misbehaving servers
One sign of an infrastructure in need of better automation and a different operational model is the typical “Log into server X to fix issue Z” assignment. This task in particular highlights three issues that need to be addressed in your infrastructure.
Lack of automation
A server impacting your customers’ experience negatively needs to be moved out of rotation immediately. Otherwise, it’ll hurt your customers and your team. When you don’t have automation in place to remove bad servers automatically, your team has to jump on the issue. Focus on your product is lost. You want your team to be able to get all the data and logs necessary to debug the issue but limit the damage an error does to your customers.
Often you can remove servers even before they pose a serious problem. You can measure different metrics in your servers, and if one doesn’t conform to your expectations, you can simply shut the server down.
For example, our servers are checked regularly for network latency and load. If they aren’t in the range we want, we shut the server down, even if it isn’t impacting customers at the moment -- it could in the future. That's reason enough to not only deal with the problem immediately but also give our team the tools to be able to debug the issue later.
This auto-healing effect makes sure your team is focused and well-rested and doesn’t have to react to a pager in the middle of the night.
Over time, different metrics will become important and need to be added to the auto-healing workflow. Those metrics can only be found and used if your metric collection is easy to extend and is accessible to everyone in the team.
This brings us to the second issue we regularly see in infrastructure: lack of good data for debugging an issue in your production system.
Lack of data
When debugging an issue, you need to have all the logs and metrics your system emits centrally available. Otherwise you’re going to have a hard time connecting the dots to find the solution to your issue.
The metrics system also needs to be easily extendable and accessible to make sure your team can use it productively. When you're shutting down instances automatically as soon as they might become a problem, you can’t debug them directly any more. Adding new metrics has to be easy. Otherwise you might not collect all of the metrics you need in order to fix a specific issue when it happens for the first time.
If you have to log in to a server to get to all the metrics you need, that’s an indicator of one of two scenarios: 1) You either don’t have a good central system to collect those metrics, or 2) adding them is too painful for your team, so logging into the server is an easier choice.
Of course, once you collect those metrics, it’s just as important to have them equally accessible. Even the least experienced developer in your team needs to be able to access and understand the system you’re using for collecting and storing metrics and logs. Otherwise the burden of fixing an issue can’t be shared across the whole team, and you’ll be building single points of failure that become frustrating for everyone involved.
You need to set the right incentive for everyone in your team, especially your most experienced developers, to fix the collection of metrics and logs first instead of going into your servers to debug something. They will push back that it limits their power to fix problems, but you need to remove some of their power to make sure you team is more powerful together.
Potential security issues
If you can get into your servers, so can others. That’s your infrastructure’s third potential weakness. Turn off access to close this security hole and to force everyone on the team to improve metrics and log collection. Make that the only way they can get to the data they need to debug issues.
You always have the ability to add a Bastion Host to your infrastructure that allows you to connect into your servers, but it should be used rarely.
Thinking in terms of servers is outdated
After we’ve set incentives to automate the infrastructure, collect all of our metrics, and lock them down, we’re at a point where thinking of individual servers in our infrastructure becomes outdated.
We’re not building infrastructure any more that consists of separate servers. We’re building an application that serves a customer’s need. How many servers that application runs on doesn’t matter as long as the customer is happy and we’re not exceeding our budget.
With auto-scaling and auto-healing infrastructure, each individual server gets replaced often anyway.
At this point, scaling your app becomes a discussion between your application, your budget, and the metrics of your running application. No human interaction necessary or desired any more. This again lets your team focus on building the next great feature.
In the future, systems should be able to go beyond horizontal scaling and tackle automated vertical scaling. Why can’t we have the ability to monitor our application and run experiments on different underlying server types to find the perfect fit for our application load? Why can’t the system detect over time if we’re CPU, Memory, or IO bound and propose changes for better performance?
To make that happen, we need an abstraction that takes care of most of these hard parts of running our infrastructure, so we can focus on building the application for our customers.
Cloud services:
With all these problems that servers still have, the next evolutionary step for our infrastructure is taking place.
Cloud services are starting to replace cloud servers as the main building blocks for our new applications. We’re running on Heroku or Elastic Beanstalk, which hide many of the details necessary for running a successful infrastructure.
And while those are still relatively low-level, the next set of higher level services is coming up. With AWS Lambda, we have a service that completely hides the details of running our application and lets us fully focus on our code and its dependencies.
These new services, like AWS Lambda, have one major distinction to other cloud services: They manage the communication layer between the different parts of your application. Because this communication is implicitly built into the whole system from the beginning, we can radically simplify our application. We only build the business logic without having to think about the details of the infrastructure itself.
However, this communication layer between services is still a big hurdle that needs to be resolved for those cloud services. It needs to be easy enough for you to get started but extensible so that over time you can move parts of your application out of the high-level cloud service to a location that gives you more control if necessary (e.g. your own AWS instances). The communication layer can’t change though, as this would require a major refactoring of your architecture. So the communication layer needs to be able to support a wide variety of use cases, not just specific cloud-services.
But as you can see in the following diagram, not each part of your infrastructure will have to be moved out of the cloud services over its lifetime. We can take advantage of that.
A second issue we’re facing with these new cloud services, apart from communication, is automated metric collection. We need great insights into the running application, otherwise we can’t trust the service. Often there’s not enough data available from the app running inside the cloud service for us to get the necessary insights when something goes wrong.
When providers solve the communication and metrics issue, we’ll see a lot more people moving into cloud services.
Conclusions
With development time limited and our speed of development so important, we’ll have to decide which parts of the infrastructure we want to control and which parts we’re handing over to somebody else to manage. Modern cloud services are an interesting option for many teams because they take over large parts of running a complex infrastructure. By embracing this early, we can move fast and shape the future of how we build infrastructure to fit our long term needs.
Of course there are issues with this new approach to building infrastructure, especially when it comes to lock-in. In the next and final part of this series, we’re going to look into lock-in and how we can avoid or limit issues with lock-in in this new cloud-services world.