Dan Isla: Astronomical Reliability

Released Tuesday, 17th May 2022

Good episode? Give it some love!

Dan Isla: Astronomical Reliability

Tuesday, 17th May 2022

Good episode? Give it some love!

Rate Episode

It’s time to shoot for the stars with Dan Isla, VP of Product at itopia, to talk about everything from astronomical importance of reliability to time zones on Mars. Dan’s trajectory has been a propulsion of jobs bordering on the science fiction, with a history at NASA, modernizing cloud computing for them, and loads more. Dan discusses the finite room for risk and failure in space travel with an anecdote from his work on Curiosity. Dan talks about his major take aways from working at Google, his “baby” Selkies, his work at itopia, and the crazy math involved with accounting for time on Mars!

In this episode, we cover:

Introduction (00:00)
Dan’s work at JPL (01:58)
Razor thin margins for risk (05:40)
Transition to Google (09:08)
Selkies and itopia (13:20)
Building a reliability community (16:20)
What itopia is doing (20:20)
Learning, building a “toolbox,” and teams (22:30)
Clockdrift (27:36)

Links Referenced:

itopia: https://itopia.com/
Selkies: https://github.com/danisla/selkies
selkies.io: https://selkies.io
Twitter: https://twitter.com/danisla
LinkedIn: https://www.linkedin.com/in/danisla/

Transcript

Dan: I mean, at JPL we had an issue adding a leap second to our system planning software, and that was a fully coordinated, many months of planning, for one second. [laugh]. Because when you’re traveling at 15,000 miles per hour, one second off in your guidance algorithms means you missed the planet, right? [laugh]. So, we were very careful. Yeah, our navigation parameters had, like, 15 decimal places, it was crazy.

Julie: Welcome to Break Things on Purpose, a podcast about reliability, building things with purpose, and embracing learning. In this episode, we talked to Dan Isla, VP of Product at itopia about the importance of reliability, astronomical units, and time zones on Mars.

Jason: Welcome to the show, Dan.

Dan: Thanks for having me, Jason and Julie.

Jason: Awesome. Also, yeah, Julie is here. [laugh].

Julie: Yeah. Hi, Dan.

Jason: Julie’s having internet latency issues. I swear we are not running a Gremlin latency attack on her. Although she might be running one on herself. Have you checked in in the Gremlin control panel?

Julie: You know, let me go ahead and do that while you two talk. [laugh]. But no, hi and I hope it’s not too problematic here. But I’m really excited to have Dan with us here today because Dan is a Boise native, which is where I’m from as well. So Dan, thanks for being here and chatting with us today about all the things.

Dan: You’re very welcome. It’s great to be here to chat on the podcast.

Jason: So, Dan has mentioned working at a few places and I think they’re all fascinating and interesting. But probably the most fascinating—being a science and technology nerd—Dan, you worked at JPL.

Dan: I did. I was at the NASA Jet Propulsion Lab in Pasadena, California, right, after graduating from Boise State, from 2009 to around 2017. So, it was a quite the adventure, got work on some, literally, out-of-this-world projects. And it was like drinking from a firehose, being kind of fresh out to some degree. I was an intern before that so I had some experience, but working on a Mars rover mission was kind of my primary task. And the Mars rover Curiosity was what I worked on as a systems engineer and flight software test engineer, doing launch operations, and surface operations, pretty much the whole, like, lifecycle of the spacecraft I got to experience. And had some long days and some problems we had to solve, and it was a lot of fun. I learned a lot at JPL, a lot about how government, like, agencies are run, a lot about how spacecraft are built, and then towards the end a lot about how you can modernize systems with cloud computing. That led to my exit [laugh] from there.

Jason: I’m curious if you could dive into that, the modernization, right? Because I think that’s fascinating. When I went to college, I initially thought I was going to be an aerospace engineer. And so, because of that, they were like, “By the way, you should learn Fortran because everything’s written in Fortran and nothing gets updated.” Which I was a little bit dubious about, so correct folks that are potentially looking into jobs in engineering with NASA. Is it all Fortran, or… what [laugh] what do things look like?

Dan: That’s an interesting observation. Believe it or not, Fortran is still used. Fortran 77 and Fortran—what is it, 95. But it’s mostly in the science community. So, a lot of data processing algorithms and things for actually computing science, written by PhDs and postdocs is still in use today, mostly because those were algorithms that, like, people built their entire dissertation around, and to change them added so much risk to the integrity of the science, even just changing the language where you go to language with different levels of precision or computing repeatability, introduced risk to the integrity of the science. So, we just, like, reused the [laugh] same algorithms for decades. It was pretty amazing yeah.

Jason: So, you mentioned modernizing; then how do you modernize with systems like that? You just take that codebase, stuff it in a VM or a container and pretend it’s okay?

Dan: Yeah, so a lot of it is done very carefully. It goes kind of beyond the language down to even some of the hardware that you run on, you know? Hardware computing has different endianness, which means the order of bits in your data structures, as well as different levels of precision, whether it’s a RISC system or an AMD64 system. And so, just putting the software in a container and making it run wasn’t enough. You had to actually compute it, compare it against the study that was done and the papers that were written on it to make sure you got the same result. So, it was pretty—we had to be very careful when we were containerizing some of these applications in the software.

Julie: You know, Dan, one thing that I remember from one of the very first talks I heard of yours back in, I think, 2015 was you actually talked about how we say within DevOps, embrace failure and embrace risk, but when you’re talking about space travel, that becomes something that has a completely different connotation. And I’m kind of curious, like, how do you work around that?

Dan: Yeah, so failing fast is not really an option when you only have one thing [laugh] that you have built or can build. And so yeah, there’s definitely a lot of adverseness to failing. And what happens is it becomes a focus on testing, stress testing—we call it robustness testing—and being able to observe failures and automate repairs. So, one of the tests programs I was involved with at JPL was, during the descent part of the rover’s approach to Mars, there was a power descent phase where the rover actually had a rocket-propelled jetpack and it would descend to the surface autonomously and deliver the rover to the surface. And during that phase it’s moving so fast that we couldn’t actually remote control it, so it had to do everything by itself.

And there were two flight computers...

Rate