JJ: How do you now get this massive organization to change the way that they work? Even if they were following, like, a checklist and Google Docs, that still marks as a fairly significant cultural change, and so we need to be very mindful of it.

Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build modern applications, or help us fix them when they break. In this episode, JJ Tang, co-founder of Rootly, joins us to chat about incident response, the tool he’s built, and the lessons he’s learned from incidents.

So, in this episode, we’ve got with us JJ Tang, who’s the co-founder of a company and a tool called Rootly, welcome to the show.

JJ: Thank you, Jason, super excited to be here. Big fan of what you guys are doing over at Gremlin and all things Chaos Engineering. Quick intro on my side. I’m JJ, as you mentioned. We are building Rootly, which is an incident management platform built on top of Slack.

So, we help a bunch of different companies automate what we believe to be some of the most manual and tedious work when it comes to incidents, like creating virtual war rooms, Zoom Bridges, tracking your action items on Jira, generating your postmortem timeline, adding the right responders, and generally just helping build that consistency. So, we work with a bunch of different fast-growing tech companies like Canva, Grammarly, Bolt, Faire, Productboard, and also some of the more traditional ones like Ford and Shell. So, super excited to be here. Hopefully, I have some somewhat engaging insight, I hope. [laugh].

Jason: Yeah, I think you will because in our discussions previously, we’ve always had fantastic conversations. So, you’ve kind of covered a lot of the first question that I normally ask, and that’s what did you build? And so as you explained, Rootly is an incident management tool; works with Slack. But that naturally leads into the other question that I asked our Build Things guests, and that’s why did you build this? Was it something from your experience as an engineer that you’re just like, “I need a tool to solve this?” What’s the story behind Rootly?

JJ: Yeah, definitely. Sorry to jump the gun on the first question. I was a little bit too excited, I think. But yeah, so my co-founder, and I—his name is Quinton—we both used to work at Instacart, the grocery delivery startup. He was there super, super early days; he was actually one of the first SREs there and kind of built out that team.

And I was more on the product side of things, so I helped us build out our enterprise and last-mile delivery products. If you’re curious what does [laugh] grocery have to do with reliability, actually, not that much, but the challenges we were dealing with were at very great scale. So, it all started back when the pandemic first started getting kicked off. Instacart was growing rapidly at the time, we were scaling really well, we were heading the numbers where we want it to be, but with suddenly the lockdowns occurring, everyone overnight who didn’t care about grocery delivery and thought, “Well, why don’t I just drive to Walmart,” [laugh] suddenly wanted to order things on Instacart. So, the company grew 5, 600%, nearly overnight.

And with that, our systems just could not handle the load. And it’d be the most obscure incidents you wouldn’t think would break, but under such immense stress and demand, we just couldn’t keep the site up all the time. And what that really exposed on our end was, we don’t have a really good incident management process. What we were doing was, we kind of just had every engineer in a single incident channel on Slack. And if you got paged, you just kind of ping in there. “I just got woken up. Did anyone else? Does this look legit?”

And there was no formal way, so there was no consistency in terms of how the incidents were created. And then, of course, from that top-of-funnel into the postmortem, there wasn’t too much discipline there. So, we really thought about, you know, after the dust kind of settled, there must be a better way to do this. And like most organizations that we work with, you start thinking about how can I build this myself?

I think there’s probably a little bit of a gap right now in this space. People generally understand monitoring tools really well, like New Relic, Datadog, alerting tools super well, PagerDuty, Opsgenie, they do a really good job at it. But everything afterwards, the actual orchestration and learning from the incidents tends to be a little bit sparse. So, we started embarking on our own. And for my co-founder’s side of things, he was more at the heart of the incident than I was. I think I was the one complaining about and breathing down his neck a little bit about why things [laugh] sometimes weren’t working.

And—yeah, and, you know, as we started thinking about internal solutions, we took a step back and thought, “Well, you know, if Instacart is facing this problem then I think a lot of companies must be as well.” And luckily, our hypothesis has proven to be true, and yeah, the rest is just history now.

Jason: That’s really fascinating, particularly because, I mean, it is such a widespread issue, right? And I think I’ve experienced that as well, where you’ve got a general on-call or incidents channel, and literally everybody in the organization’s in there, not just engineers, but—like yourself—product people and customer success or support folks are all in there. And the idea is this, sort of—it’s a giant, giant crowd of folks who are just, like, waiting and wondering. And so having a tool to help manage that is extremely useful. As you started building out this tool, I’m starting to think there are starting to become a lot more incident management tools or incident response management tools, so talk to me about what are the unique points about Rootly?

Because I suspect that a lot of it is influenced from, “These are the pain points that I had during my incidents,” and so you pulled them over? And so I’m curious, what are those that you brought to the tool that really help it shine during an incident?

JJ: Yeah, definitely. I think the space that we’re in right now is certainly heating up as you go to the different conferences and the content that’s put out there. Which is great because that means everyone is educating the broader audience of what’s going on and just makes my job just a little bit easier. There’s a couple, you know, original hypothesis that we had for the product that just ended up not being as important. And that has really defined how we think about Rootly and how we differentiate a lot of what we do.

How we did incidents at Instacart wasn’t all that unique, you know? We used the same tools everyone else did. We had Opsgenie, we used Slack, Datadog, Jira, we wrote our postmortems on Confluence, stuff like that, and our initial reaction was, “Well, people are using the same tools, they must be following a very similar process.” And we...

Rate