In a recent Continuous Discussions (#c9d9) video podcast, expert panelists discussed monitoring and DevOps . Our expert panel included: Andi Mann , Chief Technology Advocate at Splunk; Andreas Grabner , Technology Strategist at Dynatrace; Mark Burgess , Founder of CFEngine; Torsten Volk , Managing Research Director at Enterprise Management Associates; J. Paul Reed , Build/Release Engineering, DevOps, and Human Factors Consultant; and, our very own Anders Wallgren and Sam Fell . During the episode, the panelists discussed the architecture and prerequisites for effective monitoring, what you should be monitoring and best practices for troubleshooting incidents and recovery from failure. Continue reading for their insights!
What are the Architectures and Prerequisites for Effective Monitoring?
Mann recommends a systems approach: “Think about DevOps from a systems thinking point of view. You start to use those systems to understand what you're monitoring. When you’re thinking about DevOps and monitoring - you build it, you run it. So, you have to think about all of those different systems.”
Grabner says you need to think about monitoring from the start: “When you are talking about architecting new applications, you need to figure out what your monitoring strategy is. How can you monitor all the different moving pieces and components in a good way so that in the end it can make sense? That's the challenge.”
It is important to understand scale for effective monitoring, according to Burgess : “Your monitoring system has to scale in a similar way to the way your software scales, if you want to get useful data out of it. You also need to be somewhat selective about what you're looking at.”
Monitoring includes a number of different aspects according to Reed : “I tend to look at this space through the incidents and remediation lens. Perhaps monitoring permeates all of that and you can see where your monitoring deficiencies might be by looking at how you find out about incidents. Then, when you're doing a retrospective you ask what parts are missing from those conversations.”
Volk adds that the single pane of glass concept is just as critical as scale: “Have a single pane of glass, monitor across the board, and make it part of the DevOps deployment process. Then have this abstraction of the application from the end-of-line infrastructure.”
Wallgren says monitoring should be part of the architecture. “I think the thing that's interesting about the whole monitoring problem is that it's metrics. It is one of the hugest source of metrics that we have, and so we need to think about what is it that we're monitoring for.”
What to Monitor, How and Why?
Wallgren talks semantics: “Where are we in terms of API monitoring or are API calls functioning the way they're supposed to and higher up the stack in terms of what we're getting? Getting all of these things into one pane of glass will be very interesting as we move forward in the next couple of years to where we can maybe start to get a little bit more machine learning or other types of help on finding that tall blade of grass that needs to be chopped down.”
From operational aspect, Reed looks at people: “If you're looking at your entire system, and you keep having problems, and it's like, ‘We do use a monitor, and we do app monitoring, do infrastructure monitoring, etc.,’ well, maybe your IT staff has been on call constantly for two years straight without a break.”
Monitoring needs to go along with teams that are reorganized for DevOps, says Mann . “If I'm on part of the accounts receivable team or UI team, I want to be monitoring what I'm concerned with. You want to be able to monitor all of the components within that service, but you need to tie them all together. I think there's low-level component monitoring, but then you need to start monitoring, as well, the impact that's having on the end customer and look at it all holistically.”
Volk discusses how the application serves the customer with the underlying infrastructure monitoring element, and where there may be something broken in the infrastructure: “What is lacking today is that connection between the business priorities on the top and, at the bottom, the individual engineers that are fixing the problems. It all goes down to this tying together, the business level with the infrastructure level.”
We need to understand how to measure different scales, according to Burgess : “I think the key thing is that what we're building today versus 20 years ago, are systems of many scales. We have things happening at the microscopic-scale, cluster-scale, cloud-scale, service-scale, and customer-scale. And, in a well-designed architecture, these scales tend to decouple from each other. We need to find ways to separate those metrics so we're understanding at what scale we're measuring.”
Grabner says that monitoring tools have grown up: “I think we, as a monitoring market (vendors) learned over the last couple of years that the new applications we built, that also need to interact with the traditional applications, have changed, and so monitoring had to adapt. I think another thing that we learned is that we're all moving towards more platform as a service. I believe what we are all trying to do now is getting into these platforms as services and baking monitoring in.”
What to Do When Things Go Wrong:
When we come across a problem, it often puts people into panic mode, explains Wallgren : “It's very much at that point a foot race. Then, it's a little bit different, because now people are involved trying to fix things under pressure. But, if you monitor starting before production, then a lot more people will be exposed to that information. That's one of the ways that we can help make that work a little bit better.”
Learned something from a failure? Share it with others, advises Grabner : “It's key that we share what we learn from failure, and then try to figure out how we can automate the detection of that particular pattern by shifting it left, adding a new test or looking at the same metric wheel in the pipeline, preventing it. And then, if we have any AI, any machine learning in production, make the AI aware of this problem.”
Volk adds to the shared learning advice: “If you cross-correlate that across different enterprises, and learn from all of those incidents, you will discover a lot of things that you can use to tell your customers, ‘Look, this is not a time machine but it will prevent a lot of issues that you cannot anticipate based on collective learning.’ That, to me, is a big next frontier.”
While automation is a key to success, humans still play a crucial role in monitoring efforts, says Burgess : “It’s by projecting ideas, hypotheses and imagery that allows us to see something totally unexpected as a flash of insight, which is how we can often diagnose those pathological errors that are going wrong.”
Mann , in defense of big data: “When it comes to troubleshooting, I think that big data really starts to come into its own, because you do find the needle in the haystack. You need to bounce around cost effectiveness of storage, but, again, in defense of big data, you only miss it when you forgot to collect it in the first place.”
You may want to consider your definition of monitoring, explains Reed : “If you think you have monitoring and you keep running into problems, the problem may actually be how you, as an engineer, or as an organization think about the monitoring space. If you want to really understand all of that, just go get a copy of ‘In Search of Certainty’ by Mark Burgess. ”
Watch the full episode:
Want more Continuous Discussions (#c9d9)? We hold our #c9d9 podcast every other Tuesday at 10 a.m. PT. Each episode features expert panelists talking about DevOps, Continuous Delivery, Agile and more.
Stay up to date
We'll never share your email address and you can opt out at any time, we promise.