Ben Sigelman is the CEO of Lightstep and the author of the Dapper paper that spawned distributed tracing discussions in the software industry. On the podcast today, Ben discusses with Wes observability, and his thoughts on logging, metrics, and tracing. The two discuss detection and refinement as the real problem when it comes to diagnosing and troubleshooting incidents with data. The podcast is full of useful tips on building and implementing an effective observability strategy.
Why listen to this podcast:
- If you’re getting woke up for an alert, it should actually be an emergency. When that happens, things to think about include: when did this happen, how quickly is it changing, how did it change, and what things in my entire system are correlated with that change.
- A reality that seems to be happening in our industry is that we’re coupling the move to microservices with a move to allowing teams to fully self-determine technology stacks. This is dangerous because we’re not at the stage where all languages/tools/frameworks are equivalent.
- While a service mesh offers a great potential for integrations at layer 7 many people have unrealistic expectations on how much observability will be enabled by a service mesh. The service mesh does a great job of showing you the communication between the services, but often the details get lost in the work that’s being done inside the service. Service owners need to still do much more work to instrument applications.
- Too many people focus on the 3 Pillars of Observability. While logs, metrics, and tracing are important, observability strategy ought to be more focused on the core workflows and needs around detection and refinement.
- Logging about individual transactions is better done with tracing. It’s unaffordable at scale to do otherwise.
- Just like logging, metrics about individual transactions are less valuable. Application level metrics such as how long a queue is are metrics that are truly useful.
- The problem with metrics are the only tools you have in a metrics system to explain the variations that you’re seeing is grouping by tags. The tags you want to group by have high cardinality, so you can’t group them. You end up in a catch 22.
- Tracing is about taking traces and doing something useful with them. If you look at hundreds or thousands of tracing, you can answer really important questions about what’s changing in terms of workloads and dependencies about a system with evidence.
- When it comes to serverless, tracing is more important than ever because everything is so ephemeral. Node is one of the most popular serverless languages/frameworks and, unfortunately, also one of the hardest of all to trace.
- The most important thing is to make sure that you choose something portable for the actual instrumentation piece of a distributed tracing system. You don’t want to go back and rip out the instrumentation because you want to switch vendors. This is becoming conventional wisdom.
More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2PPIdeE
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Like InfoQ on Facebook: bit.ly/2jmlyG8
Follow on Twitter: twitter.com/InfoQ
Follow on LinkedIn: www.linkedin.com/company/infoq
Check the landing page on InfoQ: https://bit.ly/2PPIdeE