Podchaser Logo
Home
Building a Resilient Engineering Culture with Ryn Daniels

Building a Resilient Engineering Culture with Ryn Daniels

Released Thursday, 25th April 2019
Good episode? Give it some love!
Building a Resilient Engineering Culture with Ryn Daniels

Building a Resilient Engineering Culture with Ryn Daniels

Building a Resilient Engineering Culture with Ryn Daniels

Building a Resilient Engineering Culture with Ryn Daniels

Thursday, 25th April 2019
Good episode? Give it some love!
Rate Episode

About Ryn Daniels

Ryn Daniels is a staff infrastructure software engineer who got their start in programming with TI-80 calculators back when GeoCities was still cool. Their work has focused on infrastructure operability, sustainable on-call practices, and the design of effective and empathetic engineering cultures. They are the co-author of O’Reilly’s Effective DevOps and have spoken at numerous industry conferences on devops engineering and culture topics. Ryn lives in Berlin, Germany with a perfectly reasonable number of cats and in their spare time can often be found powerlifting, playing cello, or handcrafting knitted server koozies for the data center.

Links


Transcript

Mike Julian: This is the Real World DevOps Podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, or the authors of great books to fantastic public speakers, I want to introduce you to the most interesting people I can find. 


This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with our other tools. Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out InfluxData.com.


Mike Julian: Hi folks. I'm Mike Julian, your host with Real World DevOps. My guest this week is Ryn Daniels, co-author of O'Reilly's Effective DevOps, a public speaker and previously worked in engineering for both Etsy and Travis CI. Ryn, I hear you're working everyone's favorite infrastructure automation company now, HashiCorp is it?


Ryn Daniels: Yes, it is. I'm a working on the terraform ecosystem team. I'm going to be working on the AWS provider.


Mike Julian: You've been writing and talking a lot about this idea of resilient culture and you wrote a article for a InfoQ, which we'll link in the show notes, about crafting resilient culture, which talked about the Apache Snafu. You and I were just talking before the show about an earlier story about Postfix and Puppet and well, things exploding in your face.


Ryn Daniels: Yes, so it's a fun story with a little less of a happy ending than the Apache snafu. My first ops job I inherited two data centers that didn't even have a lonely bash script for company. I was doing everything by hand. There were a lot of dragons and nobody was really sure where are the dragons were lurking. One of the things that I was kind of put in charge of was the idea of, "What if we didn't do literally everything manually? What if we had some sort of automation?" So I got to do fun stuff like set up automated Linux installs instead of me going around carrying a USB DVD player and yeah.


Mike Julian: Definitely been there.


Ryn Daniels: Yeah, that that was ... Those were sad times. So I was starting to put together Puppet and it was mostly going pretty well. I was starting out with the what seemed like the safe stuff. And I asked the engineering team, I'm like, "So it seems like Postfix is configured a bit on these servers, but it's not running. Should it be running?" And people talked amongst themselves a little bit and they were like, "Yeah, it should definitely be running because the servers are set up to email us when something goes wrong." Okay.


Mike Julian: So clearly everything was fine because no emails were going out.


Ryn Daniels: Exactly. Exactly. So I clear this with everyone. I tell them, I'm like, "Okay, I'm going to roll out this change." And I turn on postfix everywhere. And this was my very first ops job, so we didn't have anything like a testing or a staging environment. I was really kind of playing everything by ear at that point and learning as I went. So I turn on Postfix and then a few minutes later somebody says the site's down. Like how did turning on Postfix take the site down?


Mike Julian: That's weird.


Ryn Daniels: And we kind of kind of poke a little bit on one of the servers that I was logged into and like the web server was still running. Everything looked like it should have been fine. What happened was there were eight years of emails queued up on every single server, and when Puppet turned on Postfix, those eight years of cued emails started sending all at once. And the way that networking was or wasn't configured back then, I think I just like saturated every single network link in our two data centers with all of these emails, and everyone's like, "Ryn, help, make it stop, get everything back on line." I'm like, "I don't know how to un-send eight years worth of email, folks. Like, we're just going to have to wait this out." Which is kind of what happened. And eventually, eventually all of the emails sent and shockingly, there were a lot of error emails as it turns out in this sort of environment.


Mike Julian: Surprise, surprise.


Ryn Daniels: Yeah. And after that everyone was a little twitchy anytime I mentioned making a Puppet change. So yeah, it was definitely an exciting afternoon slash couple of days trying to figure out what went wrong with automation and try and keep it from going that sideways in the future.


Mike Julian: How did your teammates react to all this? Like aside from like, "Ryn, what have you done?"


Ryn Daniels: It was, it was mostly just that kind of panic and then everyone trying to figure out what to do. People had differing amounts of visibility into what was going on. There was kind of a homegrown monitoring system that was set up that also lived in the data center, which may or may not have been very accessible during this time. Oh, I remember, I was stuck in the data center physically because nothing was configured to have a remote, out-of-band access. So most of my days were spe...

Show More

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features