Daniel Roston - Don't Play Games When It Comes to DevOps
In this episode of DevOps Radio, we're at Jenkins World 2017 with Dan Roston of Electronic Arts. He'll tell us about his role in providing CI automation across EA, the correlation between containers and continuous delivery, and the challenges of mobile as a platform.
Sacha Labourey: All right. This is DevOps Radio. I'm Sacha Labourey, CEO of CloudBees. My guest today is Dan Roston and you work at Electronic Arts. Welcome, Dan.
Dan Roston: Thank you very much for having me.
Sacha: If people hear a bit of noise around, it's just because we are in the middle of DevOps. Essentially we are at Jenkins World 2017. We're in San Francisco, and it's a lot of fun.
Dan: It's a blast, yeah. I've been here for a couple years now, previous Jenkins Worlds, and the growth of it is just fantastic.
Sacha: Have you had time to attend some sessions and so on?
Dan: Yeah. I really enjoyed the keynote that occurred earlier today. I thought that was great.
Dan: That one guy was perfect.
Sacha: All right. Talking about the keynote, we talked this morning around how a lot of companies have a centralized DevOps team that they're essentially using to standardize around a common set of practices, tools, and so on. Right?
Sacha: And that's exactly what you're doing. Right?
Dan: Exactly, right. I am part of a central team that provides CI automation across the entire EA organization, with games being developed all around the world. Yeah, we're predominantly just supporting the developers and providing automated testing builds, and produce results for them as quick as we can.
Sacha: All right. People are going to be upset if I don't ask what games. Give me some names.
Dan: That's what everybody wants to hear. Electronic Arts is always the cool industry. I've worked in FIFA, the big FIFA, Plants vs. Zombies: Garden Warfare, a very popular one as well. I've supported some mobile titles as well. Currently I'm supporting the Frostbite engine itself that various teams then use to build up their game.
Sacha: All right. You were telling me just before we started that essentially you joined just before the company was really ramping up that team, and you saw this explosion of projects that have been developed. Can you talk more about that?
Dan: Yeah. When I joined, I was part of the test engineering group. So what we were tasked with was essentially ensuring a given build is good enough for QA to test. So we would write scripts to go down critical path, make sure it doesn't break, and QA doesn't spend their time testing on a build that we already know is broken. Now this evolved to work very closely with the build engineering group as well. Everybody we merged together, such that we could produce build automation and test automation across all of the various titles.
Sacha: So _____ was that adoption phase, right, because a lot of people had not done that before. So any gossip you can share with us?
Dan: I don't know about some gossip, but I'll tell you some interesting stuff, is just the burden that QA had before is vastly reduced. Prior to our involvement, they would have to build a build on their machine, which is not necessarily a sterile environment as a virtual environment. It was prone to error. QA would need to have technical knowledge on how to build the game. They would then have to test it all throughout themselves. So we were able to solve kind of two problems. One was remove the burden from QA to require the knowledge to build a game, and just reduce the manpower required to find bugs. Our sole goal is to find bugs, and QA should just be testing to see how fun the game is. They shouldn't be doing the edge case tests.
Sacha: I see. That's interesting because you're saying that now, after the transition, there is less, quote/unquote, work for the QA team. I think when we talk about people that are new to that space, they feel like, "It's all great to release more frequently and increase velocity, but there is only so much testing we can do." That might seem contrarian to the idea that doing more actually leads to less. Can you talk about that?
Dan: Yeah. We are able to kind of increase the breadth of our tests. We're able to essentially take a lot of the routine maintenance or I would say the routine tests that QA would do. So they would have a series of checklists, "When you get a build, you must do this, do this. Make sure that's good," and essentially remove all of the things that can be obviously automated, work with the developers for the things that _____ does easy and try to make that automated, so QA, when they grab the build, they're just working with the developer, testing just that new feature and testing the balance of the game. Make sure it is playable and make sure it's fun. Make sure that you can complete well with the other QA partners and stuff like that.
Sacha: Do you have a metric for fun?
Dan: QA might, but that's something I find very hard to measure myself. That's funny.
Sacha: Can you describe a bit your environment? What does it look like?
Dan: We've got a virtual environment on-premise. It's set up using Chef. It's predominantly Windows because we build Xbox games, so it's a dependency there. We've got, at last count, over 1,000 virtual machines that are built at any given time. We've got datacenters in Vancouver, where I'm from, also in Florida, where a big studio is as well, and even all around the world. There's a big studio in Romania as well, in Stockholm as well. We are expanding very, very quickly.
Sacha: Wow. What's the _____ in terms of service? I assume you have some metrics you can share with us or ____ you can quantify.
Dan: I don't have the exact numbers offhand, but we've got north of, I would say, 3,000 virtual machines just building. We've got, I would say, last count it was 72 Jenkins masters that are orchestrating all the work. We've got a whole bunch of virtual machines there, just hosting internal services for routine tasks internally and stuff like that.
Sacha: Okay. So what we're seeing is a high correlation between containers and continuous delivery. Obviously you're working a lot on Windows. Microsoft is doing a lot of work in that area with Docker. Have you looked at that? What's your status there?
Dan: We are super-eager to use Docker and all their container technologies. For a little context, from a clean machine, to build our binaries can take up to three hours. We've got a huge code base. We do a lot of data manipulation to build all the assets and things like that. But we want to make it such that we can build from a clean environment very quickly, so that we can just generate – mount a container and build off that. So right now we're investing very heavily on caching systems that exist off of the virtual machine itself, basically positioning ourselves that we can really make use of Docker and other containers.
Sacha: Have you already tested some of the Windows containers?
Dan: We have tested, not in a production level. We have some internal environments and internal micro-services that do use Docker, as a matter of fact. So far, we're very optimistic about it.
Sacha: Very nice. I hear a lot of people talk about Linux containers, obviously, but we don't hear too much about Windows containers. On Linux, it's almost like a given, right, but on Windows it can open doors to a lot of very good stuff.
Dan: Yeah. We've been watching the chatter around Dockers on Windows for a long time now and we're very optimistic, certainly recently, and we've started testing things out.
Sacha: Now a question that's pretty specific to your space and I'm not sure if it makes sense. I know a lot of companies in the gaming space are using obfuscation techniques to avoid that their game be hacked or get free credits or whatever. Does that have an impact on how you do your continuous delivery? Is that stuff that typically happens at the very end of the process, so it doesn't matter? How does that change things?
Dan: That's a good question. You're absolutely right. We do have a lot of those tooling in place. Generally we'll have two flows. One will be we'll have the quick builds. They'll just be in our internal environments that QA can grab in the local studio and do the quick tests, that may not have as verbose of extra security tooling in there. But we also produce release builds that are sent to QA in other studios that do go through the entire. So we just ensure that the entire process is always working. They're always able to grab the latest build. Essentially, we're always at the point where we can release the game and it's always secure. So we kind of have these two different environments going.
Sacha: I see. And you never had any compatibility, meaning you were safe because none of the _____. Then once you released, you realize some issues popped up.
Dan: We have certainly hit issues that only occur in the obfuscated builds. Luckily, because we do this testing relative early, before we release the product, we're able to find it and work with whatever partner produces the tool that we're using, and resolve it before it's time to actually release the actual product.
Sacha: Okay. Do you get lots of requests from specific teams to get specific feature sets, new tooling or its _____ standardized?
Dan: It's fairly standardized. We are fortunate that EA is very, very large and we have a lot of core technologies that the games are built on top of. So we have the same kind of build environment across the boards. Knowledge that I would have in building FIFA is transferrable to other sports titles or other shooting games and stuff like that. So we do benefit a lot from that. But then on top of that, there's always going to be specific requirements for each individual team, whether it's an online-only game or an MMO. They're going to have other security implications versus an offline games. Those are generally, in my experience, dealt with on a case-by-case basis, depending on what our security has deemed necessary for security.
Sacha: I'm not a big player, so you'll have to help me here. How frequently do you actually send upgrades to games _____ ____? What's the ______ stream here?
Dan: It's interesting be the model used to be we would release a build. It would exist in retail shelves and you would install, and we would release patches, maybe three or four, just to ensure the game is stable. This has changed in recent years. We have more patches, so we try and make it so that the game stays relevant longer. Additionally, we have more server side patches, so we can add new assets, updated balanced rosters just on the server side that doesn't require a patch to the client game itself. In recent years we've gone through and done both of these much more frequently. The server side patches we do – I don't know the number off the top of my head, but some teams do it once a month to kind of refresh the rosters. Then the patches, we sometimes – sorry, for the client side patches we do special promotions around Christmas and things like that, generally maybe every two or three months, depending on the title itself.
Sacha: Is it hard from a _____ management standpoint to have those server side _____ updates, maybe mobile games? They're all interconnected.
Dan: Yeah. You have to have this kind of matrix of testing. You need to ensure that if you were to patch this server over here, that your client that has yet to be patched is still compatible. Then if you were to patch the client, it is still compatible with the updated server. It just increases the matrix of configurations that we have to test against. So it's certainly a problem. Luckily, we don't have a large bulk of patches to apply at one given time, so we can kind of make sure that the client side patch goes through it and then work on the next one.
Sacha: What about the mobile application? How do you do mobile testing?
Dan: One of the restrictions we've had in recent years is to build an iPhone game. We need to build it on a Mac device, and it's really hard to virtualize this. So we would have a room stacked full of a whole bunch of Macs. Recently, there's a new technology that allows us to virtualize this. So we're moving away from these physical Macs that were plugged into an iPhone, and these physical PCs that are plugged into a physical Android, for example, and trying to virtualize that as much as we can.
Sacha: How do you do that?
Dan: What we need to do is we can virtualize the Macs such that we can build an actual binary, but at this point to actually test it on-device, we need a device connected to a Mac. So we can have this separation between a device that's dedicated to build the binary and a device that is dedicated to testing it, so we can scale out, so we can build really quick. Then we can have a different series of Macs that are performing the tests, so we don't have the bottleneck of this one Mac must build and test, and then you don't get your result until all the entire loop is complete.
Sacha: So essentially you're using virtualization for the build phase.
Dan: At this point.
Sacha: And testing still requires hardware.
Dan: At this point, yes. There are some companies that do provide testing as a service as well that we're exploring, but at this point we only are testing physical devices.
Sacha: I see. And the virtualization technique you are talking about, is that something that is embedded in MacOS or is that something –?
Dan: This one, I don't have explicit contact.
Sacha: All right. In terms of the process for your project, do all projects share the same process?
Dan: Pretty much, yeah. We go through a cycle of alpha, where you're building up the feature set and ensuring that you're cutting and you're scoping correctly. Then you go through various tests and ensure that at this particular milestone, let's call it beta, you've got all your features in place. There are minimal bugs. Then there's the stretch to final, where you need to ensure that the product is complete on time. We have very, very strict deadlines because there's a lot of marketing around this. I'm sure you see a lot of FIFA commercials and things like that. So we need to ensure that the product is for sure delivered on time. We work with first-party on this, so we can deliver a build to them a little bit early, so we can get feedback a little bit earlier, as we're working through bugs and stuff, and make sure that when we deliver what we feel is the final product we get surprised by we're missing some feature that they require. So it's close collaboration with our first-party partners.
Sacha: All right. I know building a _____ software factory or centralized DevOps services is an ongoing work. What would you say are your biggest challenges today?
Dan: It's actually rather interesting. I'll give you a specific example here. We've been measuring our reliability for quite some time, ensuring that any builder test doesn't fail due to some infrastructure-related reason. It always produces a valid result, whether it's successful or the error is related to some command. We've been measuring this for a long time, ensuring our reliability is at least 99.9 percent-plus. But now when we engage with our customers, which are the game developers themselves, we find there's sometimes a disconnect between their perceived reliability and how we measure ourselves. Through a lot of discourse there, we found that it's not necessarily the total number of infrastructure-related failures that's important, but rather how quickly we can resolve it. So we've changed how we're modeling our reliability to not just our overall reliability as a percentage, but any time there is an infrastructure-related error, how quickly can we then fix that error and then produce a valid result. That was very interesting, because it wasn't something that we as a provider of this service necessarily notice, but with close collaboration with our customer groups it's certainly something that ______.
Sacha: I see. So essentially, you could have lots of small interruptions that would give you a bad score, but actually they were small and maybe you had one very big that lasted a day and you were going to say it's just one, but it was down for a full day, so that was a big issue, right.
Dan: Exactly right. There could be an infrastructure error at the end of the day. Most people have gone home and that error persists until the morning, and that's over 12 hours. That has a big impact, but in our metrics that's just one data point. So we started to measure the duration and time that were in each individual state.
Sacha: Have you tried to apply some type of cost instead of as a percentage, but more as dollars, US or Canadian dollars? But essentially say, "Downtime of that type of server, that may be $100.00 an hour. Those may be $1,000.00," or maybe when it's close to a release that's a million an hour. I don't know. Have you looked at those approaches to –
Dan: Conceptually, yes. I'm not privy to knowing the actual financial situation, but we do at times measure, you know, if this service goes down, maybe it's just due to one error, but because it goes down a whole suite of developers can't work. All of those developers, who you are paying a regular salary to, can't work. So that has a huge impact on the company. So I don't have metrics to say how much dollars we're actually losing, but we are trying to engage not just the number of errors, but the impact it has on them.
Sacha: Can you share with us a stressful situation you had, maybe just before a very important release or something that went wrong?
Dan: Oh boy. Sure.
Sacha: I'm asking this question because I know for people it's great to hear stories of when things go wrong at other people's companies, because they always feel like it's going wrong at their company.
Dan: Yeah, totally. You'll have to give me a moment to think of a good example and one that I'm at liberty to say on the radio. Yeah, actually I'll provide an example here. It was very close to when we wanted to deliver our final product to first-party. This was FIFA in this example. We were having a lot of network-related issues here. So there's connections from our Jenkins master to our slaves that were kind of flaky, and then it would get disconnected and fail to build. Then that reduces the amount of time that QA can then grab it and test, do the last minute tests before we can deliver it. That's very stressful. We're running around, trying to engage with our IT partners and developers to make sure both are aware of this situation and how we can resolve it as quickly as possible. Luckily, in that particular situation, our IT partners were able to resolve the issue quick quickly, but it's kind of when we're near finaling, the phrase that's quite often used is, "Hurry up and wait," which is ideally we have nothing to do, but if there is any infrastructure issue everybody must scramble. We need to fix it immediately, as soon as possible, because it has a huge impact on the final delivery.
Sacha: Wow. Yeah, I can imagine. I like that, "Hurry up and wait." Maybe you have some closing thoughts around the future of DevOps or where you see this going. What are the next things you think are going to be impacting DevOps?
Dan: Yeah. I really love the DevOps space. As a little context, when I first got into the industry I wanted to be a developer, as most people I'm sure did, but as I got more and more invested into DevOps I saw it's really cool just finding optimizations here and there and just making things run faster, increasing breadth. Some of the panels that I've seen just today in Jenkins World are really cool. It kind of visualizes a problem that I've seen myself, and it's really nice to know that other people are seeing the same issues. Other people are working together on a resolution to that. The best part about all of this is just the community here. It's very easy to walk up to somebody and just talk about issues that him and I are having, and kind of work through it to a resolution.
Sacha: Right. That's great. Thanks, Dan.
Dan: Thank you very much.
Sacha: Thanks a lot for accepting this interview. That was Dan Roston from Electronic Arts.
Dan: Thank you for having me.
Sacha: All right.