Charity Majors of Honeycomb.io: Rethinking Observability

Thursday, 11 April 2019

Engineers should embrace the fact that a lot of testing is done in production says the one and only Charity Majors. The co-founder of Honeycomb.io joins hosts Michael Neale and Karen Taggart to discuss metrics and observability.

yellow honeycomb shaped lights

Michael Neale: Hi, everyone. This is another DevOps Radio podcast, episode number $episode_number. I hope dash substitution works in podcasts. I’m sure it does. We have Charity, or Charity Majors, if she goes by her full name. I’m sure she can just go by one name, Charity. It’s like Madonna.

Charity: Mm-hmm.

Michael: From Honeycomb.io, and Karen and myself, Michael, from CloudBees. And I thought to kick this off, Karen will be driving things, and Karen’s going to actually do the bio and intro to Charity, because she’s a crazy, crazy fan of Charity and has been stalking the Internet.

Karen Taggart: I’m a little nervous.

Charity: That’s sweet.

Read More »

Charity: Mm-hmm.

Michael: From Honeycomb.io, and Karen and myself, Michael, from CloudBees. And I thought to kick this off, Karen will be driving things, and Karen’s going to actually do the bio and intro to Charity, because she’s a crazy, crazy fan of Charity and has been stalking the Internet.

Karen Taggart: I’m a little nervous.

Charity: That’s sweet.

Michael: She’s gonna save your voice by giving you the intro, so, and then you can correct her at the end if she got anything wrong, so.

Charity: Awesome.

Michael: Over to you, Karen.

Karen: Tiny nervous. So, you know, it’s so, first of all, like Michael said, thank you so much. I know you do hundreds of all these all the time.

Charity: Oh, no, this is super fun. I’m really happy to be here.

Karen: I mean, I know because I listen to them, and I watch them and I read them, so thank you, and I will just give you a quick introduction of what I know about you, but definitely, you know, again, I only know about you the public stuff, so if there’s anything else you want to add.

Charity: Okay.

Karen: First, I just have to tell you that one of the things I admire about you, I’m just gonna go right in for the, like, tear-jerkers here, is that you’re so true to yourself and that you’re not, you don’t back down when you have an informed opinion. And just, as a fellow woman in tech and also a woman in DevOps, a non-dominant personality, whatever we’re calling ourselves these days, it’s just, it means a lot to see that. So, just thanks for doing that.

Charity: Oh, thank you. That’s really nice to hear.

Karen: You’ve been a role model So, it’s just—whew, I got little chills.

Charity: I feel like you’re encouraging my worst impulses, but I’m just gonna, like, put that aside.

Karen: You know what? You will be too much for some people, and those are not your people.

Charity: Cool.

Karen: That’s my little byline, too. Anyway, the other thing is, too, obviously, people who don’t know, know that you’re freaking connected to everyone in the tech world. And the other thing I liked about watching that happen, at least on the Internet, is that it’s not in a fake way. It’s in this genuine way that’s based on curiosity and progress and relationships. And, you know, things can get weird and heated and nasty in the tech world, and I just, I’ve watched you navigate the Internets at least and that public sphere in a very graceful yet opinionated way.

Charity: That’s really sweet to hear. I feel like there is this divide between the anonymity of us being online and trying to remember that there’s a real human on the other side.

Karen: Yeah.

Charity: Yeah, I just started to meet people in the flesh a couple years ago when I was traveling and doing all these talks, because I was at Facebook and they would subsidize that shit, right? And I felt like the tenor of my relationships definitely changed with these online people once I started meeting a bunch of them and realizing, “Oh, they’re just as awkward, like, as I am.” So, it’s been fun bridging that gap. But it’s interesting that you say that. I actually feel, one of my resolutions for this year is that I’m realizing that my entire, the way, so I grew up fundamentalist and very, very repressed. And learning to talk about myself at all has been a real journey over my entire life. But I’m realizing that it was forged in a world where I had to be very big, you know? Like, “That’s my idea!” You know? “I’m gonna get credit for this!” Or I had to really make an impression on people just to be noticed. And I’m realizing that now that I’m a CEO of a company, those habits don’t serve me well. And I’m really trying to find a way to have more range, I guess, emotional range to not have to be big all the time. Like, my therapist, I have a new therapist. I’ve always hated therapy, but I’m doing it this year. He made the analogy to playing a musical instrument, you know? Like, when you’re a beginner, you’re just, “bang, bang, bang, bang!” And the more fluent you get in playing music, the lighter of a touch you can have, because you have more range. You can be loud when you choose to, but you don’t need to. And that’s something that I’m really gonna be trying to work on this year: when do I need to be big? When is it, some people are afraid of me in my real life, you know? They’re afraid, at my job, to tell me some things. And that just broke my heart, because that’s not a thing I’ve ever wanted, and I don’t want to build, you know, I don’t want people to be intimidated. So, yeah, emotional range, it’s my goal for 2019. Communication is hard.

Michael: So, some other questions. So, the name Honeycomb, where did that from?

Charity: Oh, that’s a great question.

Michael: Our company also has a bee metaphor, and …

Charity: Yeah, yeah, yeah. We’re totally siblings.

Michael: To be honest, I think a lot of it was, well, the domain name was free, and the Twitter name was free, and it sounds alright. So, what’s the story there?

Charity: Yeah. It sounds alright, why not? Well, so first of all, we were called Bloodhound. That was our name in the beginning. Because we really liked the idea of, like, well, the domain was available, Bloodhound.sh, and we really liked the idea of a little dog, you know, puttering around with a magnifying glass, looking for nuggets of data that were exceptional. And, after saying Bloodhound out loud for a week or two, it just sounded pretty bloodthirsty. And we kinda wanted to rein it in, so we changed it to Hound.sh, and then we got a takedown from the legal team of Hound CI. So, we were like, “Oh, shit. Back to the drawing board.” And we just started, we did this process, which was like beer and coffee. Like, the first step is beer. You’re just ideating, there’s, no idea is a bad idea, you’re throwing anything out. And the second idea, or the second stage, you come back the next day and the second stage is the coffee stage. You drink your coffee and you start going, “Okay, this is terrible. Okay, this is alright. Let’s group these different,” you know, Data Fairy was another thing we were playing with. And we came down to a couple, and the reason that we landed on Honeycomb was because our friends at Slack actually owned the domain, Honeycomb.io, and they were willing to give it to us for just the price of the transfer.

Michael: Aww!

Charity: I know! It was really sweet! It was apparently one of Stewart’s, it was a personal project of Stewart’s some years ago, they didn’t have any big plans for it, so, you know, we paid the lawyer’s fees and they transferred it to us.

Michael: Right.

Charity: As well as the shortener, Hny.co. So, that really trumped it. That was the right decision, though. I’m looking back, and I’m like, some of the other options that I really liked were Truffle Pig. Truffle Pig, which is not as good. Honeycomb was perfect. And it goes on, like, the deeper you look at it. Because, you know, honeycombs are the structured data, and it’s run by queen bees, and we have two women co-founders, and very industrious and sweet. We just love all the connotations. Your ___ , Karen, was mentioning the connections…

Karen: I’m back, by the way. Apologies.

Charity: Yea! Hi, Karen!

Karen: The Internet went down, of course. Perfect timing. Anyway.

Charity: It happens. But I don’t have, like, I don’t know Stewart or anything, but I know a couple of the engineers there for a long time, and so, I was able to reach out to them. You know, it’s like, these connections that we have in the tech world, it can be, it can seem like a conspiracy or like a closed group, I think, sometimes, which is why I try really hard to go out of my way to have a circle that encompasses a lot of women and underrepresented minorities. But they’re just people, and everybody is so helpful, you know, if you know them. So, Honeycomb.io, that’s where we got it.

Karen: I have read a lot about your company, and the concept, obviously, of observability, which is a large word with a lot of details -

Charity: Yeah.

Karen:- is hot right now. So, I’d love to have you talk about why you think you need to start a new company. You know, what do you guys bring to the table that’s different and new and why are you needed?

Charity: Yeah. Totally. And that’s a great question, because when we started, three years ago, our third birthday is on January 1st, we kept being told, you know, “This is a solved problem. There are all these big companies out there that are already doing everything that can be done in this space.” You know, people were very dismissive of us. And we had had experiences that kind of rooted us in the idea that, “No, there’s something that’s missing.” You know, we had had these experiences of being at Parse and at Facebook and having all the traditional monitoring tools, and our lives just sucked, you know? Like, we were spending all of our time tracking down these one-offs on a platform, writing all these custom tools. And like, it was clear to us that the problem hadn’t been solved. And so, we started building something that was similar to what we had seen solve this problem. But, like, looking back two, three years, we didn’t know how to talk about this at all. Like, every word that was used was so overloaded. Everybody does monitoring, metrics, you know, every data term is so played out, which is why I was looking around in the dictionary, the Wikipedia. I’m just, like, looking around, and that’s when I stumbled across observability. And I realized that, you know, it has an origin there that means exactly what we were trying to get at that the other tools weren’t doing. Which was the ability to understand the inner workings of a system just by asking questions from the outside. You know, not having to SH in, not having to predict what the question is going to be and write code to anticipate it. Because the entire nut of it that we were trying to get at was that the world is too complex to predict all of the problems you’re gonna have and write checks for it or write, you know? And you need to be able to dynamically ask ad hoc questions of your data. So, you have to capture it at this level of granularity where nothing has been pre-aggregated. Nothing has been pre-processed. There aren’t even any indexes, you know? It has to be a level playing field. And I loved this term from control theory because that’s what it means. So, we started using it, and it kinda started taking off, and now there’s kinda the battle between people who want the old definition, which is just a synonym for telemetry, and we’re over here saying, “No, no, no, this should mean something technical and specific because it addresses a need in distributed systems.” Which is more and more of what we all have to deal with is these far-flung systems with third party services that we’re talking to, microservices, you know, a request enters your system, it might have two or three dozen hops before it actually returns the answer to you, and this is just a fundamentally different world than the 20, 30-year-old system where the whole study of monitoring and metrics were where it comes from. So, our origin story was just that you know, I was at Parse running infrastructure as the mobile backend as a service. We had a million mobile apps running on this service, and every day, people would be coming to me, like, “Parse is down!” And I’d be like, “No, it’s not! Behold my wall of dashboards, they’re all great. Like, you’re wrong.” Which is a great thing to do, just tell your customers they’re wrong and it’ll get you everything. And they’d be like, “No.” So, maybe it’s Disney, right? Disney’s like, “Parse is down!” I’m like, “No, it’s not, okay, but you’re Disney, so I have to respect this. I’m gonna go and investigate.” And I would go, or I’d dispatch an engineer, one of us would go and investigate, and it would take us hours if not days if not just open-ended, we might just have to call off the search. Because the range of possible answers was just too big, you know? We were a platform, you could write your own queries and upload them, write your Javascript, upload it. We just had to make it work. But that leads to cotenancy problems, right? Maybe it’s your queries that are timing out, maybe it’s because it’s a bad query, and maybe it’s because your neighbor wrote a bad query or had a spike in traffic and the performance of that note is just slower. You know, multiply this by a million, and you can start to see the difficulty. And the thing that finally helped us get, I tried every tool out there. Every monitoring, every log, everything, nothing. nothing worked. And we were, we had stopped shipping features because we had so many things to try and explain. And the thing that finally helped us get a handle on it was this tool at Facebook after we were acquired called Scuba, which is aggressively hostile to users. Like, it is not fun to use. All that it lets you do is slice and dice on dimensions of arbitrarily high cardinality in real time. And, suddenly, we were able to go from, “Disney is slow, there’s a universe of possibilities, so start looking,” to just slicing and dicing from the top. And we went from not being able to resolve these in hours or days to just seconds or minutes. Like, every time. And our sales team could do it, it was so repeatable and so easy. And that kind of blew my mind. And I’m from Ops, so, we fixed it and then I moved on, and I didn’t really stop to think about why that was, moved on to the next problem. But then, when I was leaving Facebook, I suddenly went, “Oh, shit. I don’t know how to engineer any more without these tools that we’ve been using and that we’ve built.” And that was a sobering time for me, because I realized it wasn’t just about, you know, monitoring for apps and everything, but it’s so ingrained into me as how I engineer, how I look at the world, how I decide what to build, how I know that what I’m shipping is what I think that I wrote, you know, how I validate it at every step of the process. But it was like, I’m blind. I wear glasses, basically. I’m not blind, but I can’t see. It’s like, you know, having glasses and then going to drive the car and being told I can’t have them. Like, I really, I don’t know how to exist without it. And that’s what made it so compelling to me. It was just this visceral sense of, “Oh, even if we fail, it’s fine, we’ll open source it. I need to have this tooling, because I am a tenth as effective as an engineer without it.”

Karen: Can you talk a little bit more about that? Like, give an example of something that you could see with that new tool that you couldn’t see before and how that impacted your delivery cycle?

Charity: Yeah, for sure. Okay, so here’s a thing we did just last week, or a couple weeks ago. So, we wrote our own storage engine, because, so, let me start. Cardinality, basically, when I use the term, I mean the number of possible values in a set. So, if you have a table of 10,000,000 citizens of the United States, the highest possible cardinality would be, like, Social Security number, any unique ID. Very low cardinality would be, like, gender is pretty low. Species equals human is very low, right? So, when you’re thinking about debugging, all of the information that you really care about is going to have a pretty high cardinality, right? Your ___ ID, your first name and last name, shopping cart IDs, et cetera. And yet, the whole heritage of metrics is focused on low cardinality dimensions. You can’t have, and you certainly can’t combine a bunch of high cardinality dimensions, which makes it very hard to zoom in on any particular user or event or shopping cart, et cetera. So, in order to support this, we had to write our own storage engine, because nothing out there actually does that. So, we wrote our own storage engine, and a while ago, one of our engineers was like, “Oh, I think we can get some real cost savings for our users if we introduce some compression.” And, you know, this is a not for real thing to do. And so, we wanted to be sure that it was worth doing. So, we added some details into our storage engine to report when certain scans or queries were run to tell us what was scanned and how long it took. So, we were able to see that it would save us, on average, about 75 percent of the query time if we rolled out this fix with compression. And we were also able to see at a very fine grained level which users would see a lot of improvement. Some users are gonna see no improvement. In fact, some users are going to take a small hit to their query performance and their storage footprint. So, it was important for us to see not just the average, the 90th percentile, the 99th percentile, but to be very aware of, who is this actually gonna hurt? You know, show us all of the outliers, because averages cover over a multitude of sins.

Michael: And, at this point, I have flashbacks to ignoring statistics lessons and averages and means and things like that, and I’m sure you’ve got lots of experience there that, you know, the typical developer doesn’t in this.

Charity: Yeah, they should be able to trust their metrics. And yet, the way that people model this in their head often doesn’t bear any resemblance to the way their vendors are doing it under the hood. Because your typical vendor out there is doing, because it’s expensive to store every data, they do these rollups and these averages and these percentile calculations on the client side. So, if you’re looking at the quote-unquote 99th percentile, you’re not actually looking at the 99th percentile, you’re looking at the averages of the 99th percentiles that were computed on each host over a period of time.

Michael: Right.

Charity: Which is very non-intuitive and leads people to having confidence in things that they shouldn’t, and then they’re confused why they can’t drill down into that is because that original event was discarded before it ever got written to disk.

Michael: Right.

Karen: Right. Isn’t is sometimes those crazy, or perceived crazy edge cases or outliers that really are the actual identifier of the real problem?

Charity: Yep. Absolutely.

Karen: And that’s one thing I was gonna, you nailed it on the head. I wanted to ask you about, you know, you have a very strong opinion about the difference between a metric and an event—

Charity: Yeah.

Karen:- and how that impacts observability.

Charity: Yeah, I do. I don’t think you can have observability unless you have events instead of metrics. And the reason is, every metric is like its own little island. It’s not connected to the rest of what happened in that request at all, right? You’ve got counters, you’ve got badges, you’ve got—et cetera. But they’re all, you know, incremented counters and request passes here, but it’s not, it’s lost the entire connective tissue. Like, you can’t say, “Oh, these 200 metrics all describe the same event.” You can’t do it. You can get those, you can derive the metrics from the event if you stored it that way, but it doesn’t work the other way around. Once you have decomposed that event, you can never get it back again. But that’s all you really need for debugging is that original event. Because this is how, this is like, that is the most important set of relationships is, all these things describe this request as it made its way through your system.

Michael: I think—

Charity: And I’m genuinely asking, did that make sense? Because I try to explain this a bunch of ways, and sometimes I’m not sure if it lands or not.

Karen: It makes sense to me. Michael, did you have any?

Michael: Yeah. You know, I definitely, it makes sense to me.

Charity: Okay, cool. Just checking. Thanks.

Michael: No, there’s a lot of subtle stuff in here and then it starts getting into statistics.

Charity: There is, yeah.

Michael: My main takeaway is, I should trust the experts, here.

Charity: Yeah, really. Nobody should have to understand all this shit, seriously.

Michael: So, I had a question on, sort of circling back to the observability terminology. So, I was at KubeCon recently, and they were doing their keynote thing. It was all pretty interesting. And then they sort of rolled onto Prometheus, and then they basically defined the category as observability, and now, we’re talking about—

Karen: Yeah, they did.

Michael:- observability, and it seemed like there’s a bit of find and replace going on with monitoring to observability. Is that—

Charity: Yes. Oh, you don’t say?

Michael: You’ve ___ that as well? Yeah, right.

Charity: Yes. Literally, they’re just going through their marketing pages and going, “Well, monitoring isn’t hot anymore. Let’s say observability.”

Michael: Right, okay. Is that a bad thing, or is that—

Charity: You know, I think it’s inevitable. I do want to make sure that the, so, I would be championing this whether I had a company or not, but I also accept that I am considered compromised because I do have a thing that I’m trying to sell. There is a technical definition to observability that predates all of this, right? It comes from control theory. Look it up in the Wikipedia. It says, you know, “The amount that you can infer what’s happening on the inside of the system just by observing the outside.” And I’m not saying that metrics are useless. What metrics are good for is describing the health of the system or the component as a whole, right? If you’re looking for capacity planning, metrics are perfect. If you’re looking for, what does the, but when you’re looking for debugging when you’re a software engineer trying to understand the intersection of your code with that system. You don’t actually care about the health of the overall system. You care about, can each request, can it executive from start to finish successfully? That’s all you care about. And metrics, Prometheus is fucking useless for that. It is not going to help you debug because they don’t store any information about the health in the context of the event. They can’t, because you can’t really do that with metrics. So, it does piss me off, so I think that distributed systems are everyone’s destiny, increasingly. Like, everybody’s distributed system is engineered now, and I mean, we do tools that help us do that well. So, it annoys me that people aren’t, that they’re, in their own self-interests, racing to claim the mantle, “We’re doing observability, too!” when they don’t actually address the actual problem. Yeah, it does. It does annoy me, but nothing I can do about it, so I try not to get too worked up.

Michael: Mm-hmm. Karen, you were gonna move on to some other topic areas now, as we’re getting, I’m just keeping an eye on the time. Yeah, we’ve got plenty of time. Yeah, go for it.

Charity: I will rant about this stuff for as long as you keep prompting me, I guarantee. Karen?

Michael: Have we lost Karen again? Oh, no.

Charity: Uh oh.

Karen: Can you guys hear me now? I’m back.

Charity: Yes!

Michael: Yeah, we can hear you now.

Karen: Okay. Sorry, sorry. You know, technical stuff, it’s hard! On a related question, you talk about how things have changed, you know, that 30 years ago, 25 years ago, these tools we had would work, but a lot of things have changed. And obviously, one of those big things that’s changed in the past 5, 10 years or whatever is continuous deliverability.

Charity: Yes.

Karen:- and continuous delivery. How do you think observability, fitting in lots of terms here, observability is really new, but is a part of CD?

Charity: Oh, God. That’s such a great question, and it’s so rich. And I feel like some of the vocabulary that we need, we’re still kind of feeling out. But I feel like, okay so, you know how it used to be you’d run your unit tests on your laptop and then you’d run it in staging, you know, and then you’d ship it to production. I feel like the lines are blurring, and I feel like the only way that you can know, that you can gain confidence in what you’re doing is through observability. I feel like people have this mental model of shipping to production, like it’s a switch, right? It’s on or off. “Oh, it’s in testing. Oh, it’s in production.” And that’s the wrong metaphor. It’s more like it’s baking in an oven, right? You put it in the oven to bake and you slowly gain confidence in it over time. And you wanna put it in as, alright the metaphor might fall apart or whatever, but you really wanna get it out there as soon as possible so that you can start gaining confidence in it, bit by bit. No engineer worth their salt ships to production and immediately goes, “Woot, it passed tests! Done! I trust it!” Like, no. You know not to green code production. You watch it for a while. You try and make sure that it does what you wanted it to do. So, you simultaneously want to pull that moment of beginning to bake it in production earlier and earlier, right? Maybe you ship it with feature flags and use it selectively for just your team or just some beta groups or whatever. But you wanna start baking it in, and you want to bake it as gradually as possible, with as few sharp edges, right? And this is key to continuous delivery because if you don’t have this concept of partial baking and gradually gaining confidence, you’re in for a lot of bad surprises and rollbacks that just kind of grind the whole pipeline to a halt, right? I feel like every engineer should have the muscle memory if they have, they have to have something like Honeycomb to do it, but if they do, like, the engineers at Facebook would do this, right? Push something to prod, nobody can use it. But they turn the feature, the flag on so that 10 percent of users can do it. And then they watch. They break down by old build ID versus the new build ID, and they just see, are there any mismatches, you know? Or, if they’re shipping a change, do they expect to lower the RAM, say, of the footprint of the node? Is it doing what they thought it would do? And as they gain confidence, they slowly turn that throttle up, and that means that there’s no surprises, right? No shocks. No if you’re shipping a component that you expect to hammer your back end, you turn it on slowly and you turn up the dial, you know? Seventy, 80, 90 percent still look good? Okay, right? It’s just, it’s the mature, adult way to ship software.

Karen: So, related, there’s sort of that blending of the line. It’s not just the blending of the line of the industry, it’s the people, you know? The developers are now learning Ops, obviously, and the operations people have to be developers, and this is gonna be probably a difficult question, but if you had to say there was one thing that a new or seasoned developer or engineer had to sort of learn right now that would help them the most in this area in the next few years or whatever, what do you think that they should focus on?

Charity: Put them on call. They’ll be forced to learn all of the right things for their system. This is what I tell all managers, because—

Karen: I love that. I love it!

Charity: Put them on call. Yeah, it immediately, I can predict—

Karen: So wait, engineers who are building new features should be on call for support of the current product, or for Ops, or for both?

Charity: I’m making a blanket statement that all engineers for shipping code to production should be on call. Now, there are infinite permutations and combinations and different ways to do this, right? It all depends on your teams, you know, their strengths, the type of code, the type of software. It’s hard for me to get really prescriptive beyond that you know, daytime, night time. The implementation details are implementation details. But they should be on call for their software. They should be in the line of support, pretty close to the front. And I’m happy to describe some situations that I’ve seen work, but that’s the only thing that I wanna be hardcore about—you should be on call for your software.

Michael: So, “you build it, you run it” sort of—

Charity: You build it, you run it. Yep. Absolutely.

Michael: So, in the—

Charity: And the flip side of that, though, the flip side of that, though, is that it is the responsibility of management and the teams to make on call not suck. It is completely reasonable to ask someone to carry a pager one out of every six weeks. If it only goes off once or twice, then you don’t usually get woken up, right? Absolute responsibility of the business to hold their end of the bargain ad make it not too life impacting.

Michael: Right. That was exactly what I was about to ask next.

Charity: Okay, good.

Michael: But before the podcast, we were sort of, you know—

Charity: Yeah, nobody should be killing themselves on this stuff, no.

Michael: And yeah, it was like, what’s your thoughts on the impact, the human impact of being on call, and I think you kind of hit the nail on the head. If the load is reasonable, then it’s not gonna be anxiety inducing.

Charity: Yeah. Exactly, it shouldn’t be. Because the flip side is this, right? Nobody should have to carry that burden, you know? If developers aren’t doing it, who’s going to? Are you just gonna punch it off on some other team who maybe gets paid less and has less respect and who is less equipped to fix it, right? This is a virtuous feedback loop where it’s because it makes for better systems when the people who have the power and the context and the knowledge of how to fix it are the ones who are alerted, right? Otherwise, some other poor schmoe who doesn’t know how to fix it or who has to spend all day finding the context to fix it you know, it’s less effective, it’s less efficient, and it leads to worse systems over time.

Michael: Moving on to another topic. Karen put my name down to ask this. So, testing in production, it’s like a great, it’s a great tagline.

Charity: Oh, thank you.

Michael: I know exactly what you mean, but can you talk about that? And what I would like to hear is, does an organization need to be at a certain scale to really benefit from that, or can you start at any size with this kind of thinking in practice?

Charity: Well, sadly, the truth is that everyone does test in production. And not even intentionally. Any time that you ship a build to an infrastructure at that point in time, it’s a unique test. It just is. You can only test so much. You know, you can test it in staging, it could be a perfect mirror of prod, and yet a note can go down between then and the time you rolled out to prod and everything would get fucked, right? So, you just kind of have to, part of distributed systems is just embracing the fact that is inescapable that your systems exist in a constant state of partial degradation, right? Unpredictably so.

Karen: I’m sorry, that’s the quote of the day. We exist in a what is it, a state of degradation. A constant state of degradation. I love it.

Charity: Yeah, exactly. Thank you.

Michael: So, that’s a little, you know, I come from more the developer side, but I have enormous respect for people that come from the ops side. In fact, if they say something, I generally believe them, because they have more experience. But that’s kind of upsetting. For a lot of people, it’s like, they think things should be, the normal state is perfection, so you’re saying, basically, the normal state is—

Charity: It’s not.

Michael: - there’s always something broken.

Charity: There are a million things broken right now. So, we intentionally make things escalate to a certain threshold before they let a human know, just because of the sheer number of things going wrong all the time, right? It’s just a fact. And we’ve been able to kind of avoid dealing with this fact because we have had systems that were small enough that we could fit them in our puny human brains and reason about them. And increasingly, we cannot, and we just have to let go of this false, you know, this idea of perfection and embrace the fact that, it’s very liberating I find. I find it very freeing and liberating. There are things going on all the time. It’s my job as a ________ to manage that and make it so that our users are not impacted. You know how many things can break before users notice? Like, it’s our job to make that a very, very long list. A third of Amazon can go down, and you should never notice if you’ve done your job correctly and balanced across availability zones, right? Your job is to make it so that lots of things can break, humans can notice at their leisure, and remediate them before it ever gets to the point that users notice. So, the important part there is, are users noticing problems?

Karen: So, some of this seems like, you know, you talked about thresholds and these millions of lines of codes and these large dashboards. And I’m, of course, thinking I’m the little person, you know? If I just have a smaller project or a smaller market or a smaller—    

Charity: Oh, if you don’t have to deal with distributed systems, by all means, don’t. Please don’t. Easy is easy.

Karen: Oh, good.

Charity: If you can solve your problem with a LAMP stack, you know, do it like it’s 1999. That’s great. That technology is well understood. It fails in predictable ways. You can Google for the errors and find them immediately. Like, if you can solve your problems with that, do it. Don’t invite trouble. The problem is that, for most of us, we can’t, any more.

Michael: So, in your opinion, there is sort of a, like, if you can do things simply and—

Charity: Yeah!

Michael:- and if you don’t need to scale, don’t scale, but yeah.

Charity: No, no! Don’t borrow trouble. It’s hard! Distributed systems are incredibly hard. Don’t do them if you don’t have to.

Michael: Right, that’s the, who is it, is it Martin Fowler, or someone, “The first rule of distributed systems is don’t do it.”

Charity: Don’t do it! Don’t do it unless you’re absolutely forced to do it.

Michael: Right. So, yeah, I guess we’re getting towards time, so we probably should wrap up. Is there anything else, Karen, you wanted to cover?

Karen: The last thing I wanted to ask, is there anything, Charity, that you think either Michael or I or obviously, our listeners should seek out? Any talks, any specific conferences or blogs or particular books that you think would, I don’t know, help continue our learning in this area?

Charity: Ooh, boy. I really like Accelerate, the book that Jez and Nicole just wrote, because it shows that the speed of shipping and the accuracy or the lack of failures aren’t actually intentioned to each other, they are self-reinforcing. If you get better at one, you tend to get better at the other. And I really like, oh what is that thing that I read? Oh, there’s a talk by Bo Linton on understanding your systems and he just starts out with, it’s hilarious, and it’s on point. It is funny. He is not a Honeycomb customer, but it ends up being basically the best possible advertisement for our Honeycomb style observability. But along the way, it touches on politics and teams and how to structure your launch and your events. It is just brilliant. Everyone should watch that talk.

Karen: Well, thank you, again, so much.

Charity: Thank you for having me!

Karen: I mean, I have, I could talk with you for hours, you know? I’ll always follow up with you directly with any questions. 

Charity: Invite me back some time. I’d love to do this.

Karen: Yes! I would love to, actually, at some point have you back and maybe talk with someone who sees things a little differently because I’d love to have a little debate. It’d be enjoyable.

Charity: Totally! That’d be really fun. Any time. Alright.

Michael: Has Karen told you about her idea of drunk DevOps?

Charity: No! So, we were thinking about doing that, too, like, a drunk, yes, a drunk DevOps type of thing.

Karen: Yeah, like, you’ve seen Drunk History, right?

Charity: Yeah.

Karen: The little, yeah.

Charity: DevOps is begging for it.

Karen: Yeah, just get drunk and just talk about DevOps. I mean, yeah.

Charity: I almost set this up two months ago, and then Barron didn’t make it, and I didn’t follow up.

Karen: When it happens, please consider me to be there because I would have a great time.

Charity: The next time 

Karen: And Michael, although Michael’s Australian, so I think he could probably drink us under the table.

Charity: That’s probably true. We’ll see.

Michael: That’s just, you know, we just call that breakfast.

Karen: What do you call it? 

Michael: Yeah, we just call it breakfast time. That’s it.

Charity: There you go.

Michael: Well, thanks very much for taking the time, Charity, for this therapy session/podcast.

Charity: Yeah.

Michael: Most helpful, and yeah, we’ll have to talk to you again when there’s more stuff to talk about at some point, and I’m sure we’ll cross paths at some point.

Charity: Definitely. More internet.

Karen: More internet, yet.

Michael: Alright.

Charity: Awesome. Alright, thanks.

Karen: Thank you.

Like what you’ve heard today? Don’t miss out on our next episode. Subscribe to DevOps Radio on iTunes or visit our website at CloudBees.com. For more updates on DevOps Radio and industry buzz follow CloudBees on TwitterFacebook, and LinkedIn.

Read More »

Michael Neale

Michael is a developer and manager with an interest in continuous delivery and artificial intelligence. He is a cofounder of CloudBees and has worn many hats over the years. He lives in Australia and often can be found on a plane when he isn’t on a video call.

Related Content