Key Takeaways from Continuous Discussions (#c9d9) Episode 61: Mapping DevOps into an ITIL Framework

Written by: Electric Bee
1 min read
Stay connected

In our recent Continuous Discussions (#c9d9) video podcast, DevOps and ITIL experts discussed how ITIL and DevOps can work together. Our expert panel included: Kaimar Karu, head of product strategy and development at AXELOS; Robert Stroud , principal analyst at Forrester; Mike Kavis, VP and principal architect at CloudTP; and, our very own Chris Fulton , Anders Wallgren and Sam Fell . During the episode, the panelists discussed ITIL’s relevance in a DevOps world, and how DevOps practices can be incorporated within an ITIL framework for key use cases, such as change management, incident management and more. Continue reading for their expert insights.

Is ITIL Still Relevant in a DevOps World?

Stroud continues to see more and more organizations throw their hands up over ITIL: “To quote a discussion I had yesterday with a Fortune 100 company, the CIO said to me, ‘We've actually terminated our total investment in ITIL. There will be no more ITIL courses approved in this company. If somebody mentions the word ITIL, they're looking for a job.’ So fundamentally, there's a rapid change going on, and we can go through our data which shows that people who are doing DevOps have this misnomer or this proposition that traditional ITIL doesn't work with DevOps. Let's think about this traditional change that's happening - we've got a change that's going to production, and as it goes into production, we've got 32 approvals, and development developed the code in a week, and it takes me 12 weeks to go through the change management approval cycle. This is a real use case. And this is what's annoying the living daylights out of people right now, it's that ITIL's just not relevant.”

ITIL can be frustrating when the level of output increases from dev, but that doesn’t mean it’s bad, says Wallgren : “I use this video in my presentations, it's from the ‘I Love Lucy Show’ when she's in the chocolate factory, and her job is to package the chocolates that are coming down the conveyor belt. And the conveyor belt is moving too fast for her to be able to do it, so she's got to start eating them and stuffing them in her clothes, and all that sort of stuff. I think that's the way lot of ops teams feel. Now that dev has cranked up the conveyor belt a little bit faster, ops has to figure out how to adopt or how to accept that. There is an impedance mismatch there between the output and the input, and there's pain there. That doesn't mean ITIL is completely wrong, it just means you have to figure how to make these things work more smoothly together.”

ITIL does not have to be as long and laborious of a process as some assume, states Karu : “Making people to go through 30 plus hoops getting a change approved, that is not actually ITIL. That is nothing like ITIL. One of the answers I sometimes give to this question is, show me where it tells you in the ITIL publications that this is the way you should decide the process. When we talk about development and operations, development has been able to push out codes quickly for a very long time. So that capability has been there but it has been a slightly different capability.”

When development and operations teams harmonize, good things are bound to happen according to Fulton : “I think a lot of the pain with DevOps for me is truly putting dev and ops on the same team. Because in the past I've seen the finger-pointing, ‘Oh, it's development's fault because they gave us crap,’ or, ‘Oh, it's operation's fault because they can't get us stuff installed fast enough.’ It's truly about putting those two on the same team and having them work together. It's good to have an operations person realize the issues of a developer, and it's good for a developer to realize that the operations person may put these processes in place, because if the website goes down, they lose a million dollars.”

It’s not ITIL that’s gone wrong, it’s the implementation of ITIL, says Kavis : “A lot of times people poorly implement ITIL, and they had poorly implemented waterfall, so they go to Agile and they poorly implement that, and then they expect DevOps to fix it. Well you're going to poorly implement DevOps, right? It comes to culture again. You have to build a culture of full-stack teams, where you talk about DevOps with your security guys. There's all kinds of actors, so whether they report in silos or not, how do you form a full-stack team around this product or service you're trying to build, and everyone's accountable for the customer satisfaction, the reliability? Everyone's accountable.”

Change Management

Change management can get very tricky in software delivery, says Karu : “Very roughly we can separate two types of changes. Of course we have the infrastructure changes, and then we have a smaller ones, we have larger ones. In the cloud environment, a lot of that can be done automatically, we can set up the rules around what should or should not be done, we can set up the rules around the cost involved, and all these things. When we talk about software and change management, this is where it gets quite interesting, because what is usually discussed in the context of the CAB -- which again, is the change advisory board, that not an approval board -- what in most organizations the CAB is doing, is it's actually assessing releases of changes. It tells you whether the thing that has been developed should go into live environment or not, whereas the decision to whether that should be developed or not has already been taken by someone else, somewhere else.”

Change must happen with end-user value in mind, advises Stroud : “The role of operations is fundamentally transitioning. In the past, we'd walk around with magnetic tapes, punch cards and sneakers, and physically implement things. Today, technology's being implemented with bits and bytes. If automation works and automation is out-testing people, why not promote direct to the production environment? We actually don't need a CAB anymore. The interesting thing is, by automating and instrumenting all those processes by moving compliance left, change management is automatically done. What happens now is product teams are being formed, and no change happens anymore without value. End-user value. The end-user wants it, the end-user gets feedback. In fact, at Forrester, we have research that shows if you don't focus on your end-user and customer experience, you won't be in business much longer.”

When automation goes wrong, don’t add in more processes, warns Fulton : “The systems will tell you more information than a human could ever enter into a change management system, and do it accurately. I think the biggest issue that I've seen with ITIL in general is, when the automation fails, instead of fixing the automation, people put in more processes. That's where you get these 12-day approval boards because they tried to automate it, something went wrong -- humans wrote the automation so stuff can go wrong -- but instead of fixing the automation, they added another layer for people to go through.”

Kavis explains traditional change management and how the cloud is affected by it: “Traditionally change management's been a silo-owned process. Development does their stuff, they throw it over to test, test does their stuff, and they throw it over to ops, who knows nothing about your application. So they ask a million questions, rubberstamp it, and it goes into production and it breaks. That's the pattern we've been doing for 30 years. The cloud is driving a lot of this rapid development, and the cloud breaks down because we're delivering so fast. But there is a danger if you just spin up this team who's siloed away from the people who run the business and they just go do their thing. You wind up repeating the sins of the last 30 years. You're going to wind up with the same spaghetti architecture, but you're going to get there faster.”

Wallgren brings it back to ‘Schoolhouse Rock!’ and how a bill becomes a law to explain change management: “The end-to-end perspective is always important. That's something we talk to our customers about all the time. One of the most common questions is, ‘Hey, where do I start?’ for whatever it is that they want to change or improve. There’s a lot of benefit to getting the right people in the room at the right time, and mapping out what you could call a value chain, a value stream, a state machine -- whatever you call it -- make sure that you have some common understanding of what your end-to-end process is.  An example of which is the ‘Schoolhouse Rock!’ episode of how a bill becomes a law. You want to have the ‘Schoolhouse Rock!’ version of how code goes live. It has to get signed by this person and that person, and then there's a pocket veto and blah-blah-blah. Unless you understand at some point, unless somebody understands at a high level what the process is, how are you going to make global optimizations? Every optimization is local.”

Audit Compliance, Incident Management and more

Kavis on audit compliance: “It comes down to getting everyone on the same page, and having a picture of the entire value stream. Then what do you address first? Well, what's the biggest bottleneck, right? If you work on the smallest bottleneck, you're just moving bottlenecks around, but when it comes to audit and compliance automation is the key. We might all be on the same page, but I've seen too often where we have all these different logging tools, monitoring tools, and there's no cohesion, there's no coordination.”

Communication is key in incident management, heeds Karu : “With the technology available today, there is an element of incident management that can be automated, and is restoring the service when something goes wrong. So you kill the service or the data server, you rebuild, reimage, and then you put it back into circulation. That can happen anywhere from microseconds to seconds. But, something in the discussions about ITIL operations and incident management that is very often missed is the whole communication side of things. It's not just enough to solve the technical challenge, it's also the communication, creating the bridge between people who need to get involved in resolving the incident. Let's say you have problem management capability in the organization - it means that you have the capability to learn from past incidents, understand why they happened, and put in place various automated ways on how to overcome them when they happen again, or how to design a system that they don't happen again.”

Stroud drives home the importance of automation: “The reality of it is that tried and true practices and processes evolve over time. One of the things we have to look at now is, how do we go and re-architect the way we're delivering business technology? Which is what we're delivering now. We're delivering business technology to drive business value and drive the value proposition. At the end of the day, whether you said the word ITIL or DevOps or something else, I think fundamentally we're on the same team trying to drive our businesses forward. We don't need to get into religious wars over frameworks. At the end of the day it’s about how we deliver it and how we do it. One thing is clear to me, the recent analysis of the industry shows me that the least conforming part of the CAMS model, is automation. It is not even at a level two on the Forster maturity scale, and everything else is far more mature than that. If I am to mention one key point, look at your automation. Are they in silos or holistic? Because that's where you're missing the boat, that's clearly the one thing we have to do, and if we don't get better at automating and driving value from automation, we are never going to be successful.”

Fulton furthers the automation discussion: “It’s not just about automating and whether or not I can make my build go fast. Automate the process of how am I taking this from my laptop to whatever -- a website, or even if you have a CD that you would ship -- how does that happen? How many steps can you automate for it? Don't try to automate the crap that doesn't matter, automate stuff that you actually need to automate, because it takes forever.”

Some final thoughts from Wallgren : “People focus a lot on mean time to recovery as a metric. But as a single metric it's actually not that interesting. What you really care about is to dig into that a little bit. How long did it take to devise a remediation for that problem? How long did it take to implement that remediation? And then, how long did it take you to do a root cause analysis to figure out how to make sure that doesn't happen again? You might slap a Band-Aid on it quickly, so your mean time to recovery was 38 seconds because you just rebooted the server. But did you do root cause analysis? It's wonderful that you have monitoring that if something misbehaves, it pulls out a gun and shoots it and re-creates it, right? But what about making sure that doesn't happen in the first place? That's what I think a lot of people miss.”

Watch the full episode:

Want more Continuous Discussions (#c9d9)?

We hold our #c9d9 podcast every other Tuesday at 10 a.m. PST. Each episode features expert panelists talking about DevOps, Continuous Delivery, Agile and more.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.