Jason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode, Ana Medina is joined by Carmen Saenz, a senior DevOps engineer at Apex Clearing Corporation. Carmen shares her thoughts on what cloud-native engineers can learn from our on-prem past, how she learned to do DevOps work, and what reliable IT systems look like in higher education.

Ana: Hey, everyone. We have a new podcast today, we have an amazing guest; we have Carmen Saenz joining us. Carmen, do you want to tell us a little bit about yourself, a quick intro?

Carmen: Sure. I am Carmen Saenz. I live in Chicago, Illinois, born and raised on the south side. I am currently a senior DevOps engineer at Apex and I have been in high-frequency trading for 11 out of 12 years.

Ana: DevOps engineers, those are definitely the type of work that we love diving in on, making sure that we’re keeping those systems up-to-date. But that really brings me into one of the questions we love asking about. We know that in technology, we sometimes are fighting fires, making sure our engineers can deploy quickly and keep collaboration around. What is one incident that you’ve encountered that has marked your career? What exactly happened that led up to it, and how is it that your team went ahead and discovered the issue?

Carmen: One of the incidents that happened to us was, it was around—close to the beginning of the teens [over 00:01:23] 2008, 2009, and I was working at a high-frequency trading firm in which we had an XML configuration that needed to be deployed to all the machines that are on-prem at the time—this was before cloud—that needed to connect to the exchanges where we can trade. And one of the things that we had to do is that we had to add specific configurations in order for us to keep track of our trade position. One of the things that happened was, certain machines get a certain configuration, other machines get another configuration. That configuration wasn’t added for some machines, and so when it was deployed, we realized that they were able to connect to the exchange and they were starting to trade right away. Luckily, someone noticed from external system that we weren’t getting the positions updates.

So, then we had to bring down all these on-prem machines by sending out a bash script to hit all these specific machines to kill the connection to the exchange. Luckily, it was just the beginning of the day and it wasn’t so crazy, so we were able to kill them within that minute timeframe before it went crazy. We realized that one of the big issues that we had was, one, we didn’t have a configuration management system in order to check to make sure that the configurations we needed were there. The second thing that we were missing is a second pair of eyes. We need someone to actually look at the configuration, PR it, and then push it.

And once it’s pushed, then we should have had a third person as we were going through the deployment system to make sure that this was the new change that needed to be in place. So, we didn’t have the measures in place in order for us to actually make sure that these configurations were correct. And it was chaos because you can lose money because you’re down when the trading was starting in the day. And it was just a simple mistake of not knowing these machines needed a specific configuration. So, it was kind of intense, those five minutes. [laugh].

Ana: [laugh]. So, amazing that y’all were able to catch it so quickly because the first thing that comes to mind, as you said, before the cloud—on-prem—and it’s like, do we start needing to making ‘BC’, like, ‘Before Cloud’ times when we talk about incidents? Because I think we do. When we look at the world that we live in now in a more cloud-native space, you tell someone about this incident, they’re going to look at us and say, “What do you mean? I have containers that manage all my config management. Everything’s going to roll out.”

Or, “I have observability that’s going to make us be resilient to this so that we detect it earlier.” So, with something like chaos engineering, if something like this was to happen in an on-prem type of data center, is there something that chaos engineering could have done to help prepare y’all or to avoid a situation like this?

Carmen: Yeah. One of the things that I believe—the chaos engineering, for what it’s worth, I didn’t actually know what chaos engineering was till 2012, and the specific thing that you mentioned is actually what they were testing. We had a test system, so we had all these on-prem machines and different co-locations in the country. And we would take some of our test systems—not the production because that was money-based but our test systems that were on simulated exchanges—and what would we do to test to make sure our code was up-to-date is we actually had a Chaos Monkey to break the configuration.

We actually had a Chaos Monkey and it would just pick a random function to run that day. It would be either send a bad config to a machine or bring down a machine by disconnecting its connection, doing a networking change in the middle to see how we would react. [unintelligible 00:05:01] with any machine in our simulation. And then we had to see how it was going to react with the changes that was happening, we had to deduce, we had to figure out how to roll it back. And those are the things that we didn’t have at the time. In 2012—this was another company I was working for in high-frequency trading—and they implemented chaos engineering in that simulation, specifically for them, we would catch these problems before we hit production. So yeah, that’s definitely was needed.

Ana: That’s super awesome that a failure encountered four years prior to your next company, you ended up realizing, wait, if this company actually follows what they do have of let’s roll out a bad deploy; how does our system actually engage with it? That’s such an amazing learning experience. Is there anything more recent that you’ve done in chaos engineering you’d want to share about?

Carmen: Actually, since I’ve just started at this company a couple of months ago, I haven’t—thankfully—run into anything, so a lot of my stories are more like war stories from the PC days. So.

Ana: Do you usually work now, mostly on-prem systems or do you find yourself in hybrid environments or cloud type of environments?

Carmen: Recently, in the last three to four years I spent in cloud-only. I rarely have to encounter on-prem nowadays. But coming from an on-prem world to a cloud world, it was completely different. And I feel with the tools that we have now we have a lot of built-in checks and balances in which even with us trying to manually delete a node in our cluster, we can see our systems auto-heal because cloud engineering tries to attempt to take care of that for us, or with, you know, infrastructur...

Rate