This article was originally published on Heptio's blog by Joe Beda. With their kind permission, we’re sharing it here for Codeship readers.
This is the third part in a multi-part series that examines multiple angles of how to think about and apply “cloud native” thinking.
It is probably most useful to think of DevOps as a cultural shift whereby developers must care about how their applications are run in a production environment. In addition, the operations folks are aware and empowered to know how the application works so that they can actively play a part in making the application more reliable. Building an understanding and empathy between these teams is key.
But this can go further. If we reexamine the way that applications are built and how the operations team is structured, we can improve and deepen this relationship. Google does not employ traditional operations teams. Instead, Google defines a new type of engineer called the “Site Reliability Engineer.” These are highly trained engineers (that are compensated at the same level as other engineers) that not only carry a pager but are expected and empowered to play a critical role in pushing applications to be ever more reliable through automation.
When the pager goes off at 2 a.m., anyone answering that page does the exact same thing: try to figure out what is going on so that he/she can go back to bed.
What defines an SRE is what happens at 10 a.m. the next morning.
Do the operations people just complain or do they work with the development team to ensure that a page like that will never happen again?
The SRE and development teams have incentives aligned around making the product as reliable as possible. That, combined with blameless postmortems, can lead to healthy projects that don’t collect technical debt.
SREs are some of the most highly valued people at Google. In fact, oftentimes products launch without SREs with the expectation that the development team will run their product in production.
The process of bringing on SREs often involves the development team proving to the SRE team that the product is ready. It is expected that the development team will have done all of the leg work, including setting up monitoring and alerting, alert playbooks and release processes. The dev team should be able to show that pages are at a minimum and that most problems have been automated away.
As the role of operations becomes much more involved and application-specific, it doesn’t make as much sense for a single team to own the entire operations stack. This leads to the idea of Operations Specialization. In some ways, this is a type of “anti-devops.” Let’s take it from the bottom up:
Hardware Ops. This is already clearly separable. In fact, it is easy to see cloud IaaS as “Hardware Ops as a Service.”
OS Ops. Someone has to make sure the machines boot and that there is a good kernel. Breaking this out from application dependency management mirrors the trend of minimal OS distributions focused on hosting containers (CoreOS, Red Hat Project Atomic, Ubuntu Snappy, Rancher OS, VMWare Photon, Google Container Optimized OS).
Cluster Ops. In a containerized world, a compute cluster becomes a logical infrastructure platform. The cluster system (Kubernetes) provides a set of primitives that enables many of the traditional operations tasks to be self-service.
App Ops. Each application now can have a dedicated apps team as appropriate. As above, the dev team can and should play this role as necessary. This ops team is expected to go deeper on the application as they don’t have to be experts in the other layers. For example, at Google, the AdWords Frontend SRE team will talk to the AdWords Frontend development team a lot more than they’ll talk to the cluster SRE (borg-sre) team. This alignment of incentives can lead to better outcomes.
There is probably room for other specialized SRE teams depending on the needs of the organization. For instance, storage services may be broken out as a separate service with dedicated SREs. Or there may be a team responsible for building and validating the base container image that all teams should use as a matter of policy.
In the next part of this series, we will look at how cloud native relates to containers and container clusters.
Check out the rest of the series: