Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello, and welcome to the Data Engineering
0:13
Podcast, the show about modern data management. Data
0:16
lakes are notoriously complex. For
0:19
data engineers who battle to build and
0:21
scale high quality data workflows on the
0:23
data lake, Starburst powers petabyte-scale SQL analytics
0:26
fast, at a fraction of the cost
0:28
of traditional methods, so that you can
0:30
meet all of your data needs, ranging
0:32
from AI to data applications to complete
0:35
analytics. Trusted by teams of all sizes,
0:37
including Comcast and DoorDash, Starburst is a
0:39
data lake analytics platform that delivers the
0:42
adaptability and flexibility a lake has ecosystem
0:44
promises. And Starburst does
0:46
all of this on an open architecture,
0:49
with first-class support for Apache Iceberg, Delta
0:51
Lake, and Hoodie, so you
0:53
always maintain ownership of your data. Want
0:56
to see Starburst in action?
0:58
Go to dataengineeringpodcast.com slash
1:00
starburst and get $500 in credits
1:02
to try Starburst Galaxy today, the easiest
1:04
and fastest way to get started using
1:06
Trino. Daxter offers a
1:09
new approach to building and running
1:11
data platforms and data pipelines. It
1:13
is an open-source, cloud-native orchestrator for
1:15
the whole development lifecycle, with integrated
1:18
lineage and observability, a declarative programming
1:20
model, and best-in-class testability. Your
1:23
team can get up and running
1:25
in minutes thanks to Daxter Cloud,
1:27
an enterprise-class hosted solution that offers
1:29
serverless and hybrid deployments, enhanced security,
1:31
and on-demand ephemeral test deployments. Go
1:34
to dataengineeringpodcast.com/daxter today to get started,
1:36
and your first 30 days are
1:38
free. Your host is Tobias
1:41
Macy, and today I'm interviewing Mayan Salom
1:43
about how to incorporate observability into a
1:45
DBT-oriented workflow and some of the ways
1:47
that elementary can help. So, Mayan, can
1:49
you start by introducing yourself? Yeah,
1:52
sure. So, happy to be here.
1:54
I'm Mayan. My Starbucks
1:56
name is Maya. It's much easier to
1:58
pronounce. I'm the CEO... and I'm
2:00
a co-founder of elementary. Some
2:02
people know us as elementary data. I've
2:05
been in data roles for 12 years before
2:09
starting elementary, mainly in
2:11
a cybersecurity company. I
2:13
actually got into data much earlier because
2:16
I was a kid that was
2:18
obsessed with sports. Originally, I was opinion and
2:20
my dad wanted a boy. And when he
2:22
didn't get a boy, then he's like, you're
2:25
gonna want to involve in me? And
2:27
he said, I'm gonna be
2:29
doing this all the way through the advances
2:31
when I reached the kind of my
2:33
own age-old firm. So I
2:35
started it obviously. I
2:38
was a little bit more
2:40
critical on data pipelines and
2:43
all of them in various ways. Very
2:46
slowly, people get treated and said,
2:48
so it was very important.
2:50
And then later on, it was much
2:52
bigger, more complicated tasks as well. So
2:54
that's what got me started with elementary. You
2:57
mentioned already how you first got interested working
2:59
in data. I'm wondering if you can just
3:01
give a bit of the sense of what
3:03
it is about the space that has kept
3:05
you interested and why you want to focus
3:08
your time and energy on that problem space.
3:11
So I think in general, I
3:13
have a big passion for data.
3:16
It's like the kind of the right way
3:18
to make decisions. And I
3:20
think everyone who's a data professional
3:22
probably feels nothing, many aspects of
3:24
their life, nothing just in
3:27
their professional life. And
3:29
it's something you trust, right?
3:31
When they're saying, you know, you're gonna
3:33
make great decisions. And when you can't
3:35
use it, when you see it, when
3:38
you say, the way staff
3:40
are used at some times in
3:42
media, maybe to kind of
3:45
create wrong messages, then it may break
3:47
your heart. So it's
3:50
a very frustrating thing working intensely
3:52
with data. When I was
3:54
doing in my last role before elementary,
3:56
I was doing cybersecurity incident response. There's
4:01
like a big crisis that you're there
4:03
to solve. And it's time sensitive.
4:06
There's a lot of pressure and you need to be very,
4:08
very accurate with everything. There's a lot of consequences. And
4:10
just the amount of time we spend there on
4:13
validating and revalidating and
4:15
trying to understand
4:18
if everything is okay was just so frustrating.
4:20
It sounds like something that I want to
4:22
focus on and solve. And
4:28
now digging into the question
4:30
of observability and in particular
4:32
for DBT projects, data observability
4:35
started coming to the fore
4:37
in the data space maybe
4:39
two or three years ago.
4:42
And I'm just wondering if you can
4:44
talk to some of the elements of
4:46
observability that are most applicable to people
4:48
who are using DBT for managing their
4:50
transformations in a SQL context. Yeah,
4:53
yeah. So we
4:55
started elementary a bit over
4:58
two years ago. And we
5:00
saw the revolution
5:02
at the moment that DBT
5:04
is bringing to how people
5:06
build, how they make it so
5:09
much easier and they obstruct so much of the complexity.
5:11
And we felt that when
5:14
it comes to observability, the decision of
5:16
simplicity needs to apply in the same
5:18
kind of change. And
5:21
we felt that there isn't a
5:23
tool in there that we would use
5:25
if we would build DBT projects. It
5:28
would make observability really easy. And in terms
5:30
of whether it means your needs when it
5:32
comes to observability, when you have DBT projects,
5:34
I think it has three aspects. The
5:38
first is not unique to DBT. The data
5:40
itself, you need to validate it. You need
5:42
to monitor it. You need to understand if
5:44
there are unexpected changes. It's really not
5:47
your expectation. There
5:49
is the operational part, which
5:51
I think part of what
5:54
makes working with DBTs is that they
5:56
let you, like, take all These small steps
5:58
in your pipeline and... Each.
6:17
To. Do because. It
6:21
really do. You know, That
6:25
occurred to me to do. Believe.
6:27
There was plenty. Of
6:30
good faith a few. Craving. Up.
6:35
Room with you. Get him
6:37
to prevent fundamental religious. There's
6:39
a convertible now. Looks
6:43
more. Comprehensive plan
6:45
of have tried to cover everything
6:47
ah but the barn and. Are
6:50
be tuned in. That
6:54
the tree aspects of a states we really
6:56
try to help and. And
6:58
for people who are using Db
7:00
T, they're trying to gain some
7:02
visibility and tear. The overall metrics
7:04
of their project are trying to
7:06
understand what are the things that
7:08
are going well. How can I
7:10
improve? One of the reasons for
7:12
these different failures. One of the
7:14
anomalies that after deal was what
7:16
are some of the ad hoc
7:18
or a D I Y approaches
7:20
that teams are likely to attempt
7:22
in the process of trying to
7:24
obtain as insights. So.
7:29
Many famous. Point
7:31
A slow and they're just
7:33
thirty. They're gonna do things
7:35
like taking. Of
7:38
the to T and taking
7:40
Ah managed that output you
7:42
can do. You.
7:49
Like sending. It has been locked into
7:51
the lake in a dog or. Are
7:55
taking a nap and I've been uploading.
7:57
To the wow because. That's. where he
7:59
said comfortable with SQL
8:02
and then maybe work even
8:04
with your BI to create some dashboard
8:06
on top of it. We also saw users
8:09
doing stuff like breaking
8:11
down even their DBT project to
8:13
run each model on a thing
8:15
like a different stack of orchestrator
8:17
to build and build for
8:20
better observability. So all
8:22
kinds of hacks and some
8:24
teams have a really good setup
8:27
that is working for them. The question
8:29
is really how does it
8:32
hold over time, right? Like how much maintenance
8:34
does it require? How does it
8:36
hold with version upgrades, what
8:38
changes, what more and more needs and how does it
8:40
scale? And for
8:42
teams who are scaling
8:45
their usage of DBT, a lot
8:47
of the work
8:49
that the DBT product team is
8:52
focused on is trying to move
8:55
them into the cloud environment as
8:57
a means of getting some of
8:59
that visibility, some of the ease
9:01
of use, developer experience enhancements. And
9:04
I'm curious what you see as some
9:06
of the tension for teams who are
9:08
evaluating that approach of do I just
9:10
go as DBT cloud and they're going
9:12
to solve all my problems? Or I
9:14
really like the fact that I have
9:16
full control over all of my project
9:18
because DBT from the CLI is self-hosted,
9:20
I can do whatever I want, I
9:22
don't have to necessarily worry about the
9:24
cost scaling with my usage. I'm just wondering if
9:26
you could talk to some of the tensions
9:29
that teams address in that question
9:31
and maybe some of the ways
9:33
that some of these self-service approaches
9:35
to observability can mitigate
9:37
that potential pain point. Yeah,
9:40
so I think DBT cloud
9:42
has its value and
9:44
I think if you said
9:47
a lot of it has to do with user
9:49
experience and the development experience and I
9:51
think they did a
9:53
great job with helping the users
9:55
that are maybe less
9:57
technical and less comfortable with that. development
10:00
environment and helping to work with code
10:02
in the past to work with it
10:04
very easily. So in terms of scaling
10:06
I think it does work for organizations.
10:09
It's really telling me for people to put on
10:11
the right on the project and
10:13
it's very easy to start creating new
10:15
things in terms of
10:17
getting a orchestration easy. And
10:20
in terms of the 20-somes to observe
10:23
ability we still see like a lot of
10:25
the users of elementary use
10:27
dbt cloud so it doesn't
10:29
answer their needs I think. The
10:32
main reason for that is
10:34
because you have an entrepreneur that
10:37
can address it. Your
10:39
dbt project will know all of
10:41
your activities there and there's
10:44
a lot of, and I said Google is there in the
10:46
context of the survey. Eventually which
10:48
really impacts the health
10:50
of your data and the performance is a
10:52
lot of moving parts. So there's the
10:54
underlying data warehouse and there's the orchestrator
10:56
and there are the sources and there
10:58
are the tools that pull
11:00
data from the warehouse. And there are a
11:03
lot of other elements and
11:05
as long as the dbt cloud
11:07
might look only at sampling
11:10
of elements of the pipeline then
11:12
you're still going to miss stuff. And
11:16
on the other side of the
11:19
scale is these generalized data observability
11:21
systems or in some cases people
11:23
will lean on their application observability
11:25
stacks to try and get visibility
11:28
into their overall data platform execution.
11:30
And I'm curious what are
11:32
some of the shortcomings in
11:34
the experience particularly for dbt
11:37
projects that teams are battling
11:39
with and trying to adopt
11:41
these either larger scale or
11:43
more generalized systems for data observability.
11:46
Yeah. So in my past
11:48
I tried to utilize systems
11:51
like this set application monitoring like
11:53
data dog and Splunk to
11:55
monitor data. It was
11:57
hard. I think like it's easier to deliver. and
12:00
how solutions we talked about
12:03
in making those platforms kind of
12:05
work for you when it comes to data
12:07
availability. And then when it
12:09
comes to data availability tools that
12:11
are not built for
12:13
this workflow, what
12:15
draws us to build the
12:18
way we do it is that I
12:20
think that accessibility has a lot to do with
12:22
usage and with like
12:24
investing and creating, like implementing
12:27
the practices. It's like
12:29
it's not a pure tech program, right?
12:32
It's tech and people and processes
12:34
programs and you totally take you
12:36
so far. And
12:39
it's kind of like in a good sport that you
12:41
know it's good for you. You
12:43
know you need to work out, but that is your chance
12:45
to set in that is
12:48
comfortable and work for you. Like if
12:50
the gym is not close enough to
12:52
home or anything like that, then you're not
12:54
actually going to do it. So
12:56
we really try to build
12:59
into the
13:01
way you already worked, into your
13:03
workflow, into your development workflow. So
13:08
I think that for other tools in
13:10
the market, the barrier of entry for
13:12
someone who's an analytics engineer is very unique.
13:14
If you need an
13:17
upset, you need permission, you
13:19
probably should have your DevOps
13:21
team or your data platform
13:23
administrators or something to actually
13:25
use it. And then
13:27
you would need to replicate a lot of
13:29
the configuration you already invested in building to
13:31
that tool. And then you
13:34
guys need to make that to prepare, like
13:36
this is more of a production
13:38
environment. You
13:40
should ignore them and this is like
13:42
how frequently you should monitor this pipeline
13:44
and this is a table that's low-increment
13:47
line. Like there's a lot of context
13:49
that you need to kind of go and
13:51
everything is so external to how you work,
13:53
to your code, to your environment, to your
13:55
logic. When you develop, you need to
13:57
like go to a different system and remember to do
13:59
it. to do it and kind
14:01
of scattered all over the place.
14:05
Or you say, okay, I know who's DBT
14:08
test, this is me, they give me, and I'm
14:10
gonna stick to it because it's my companion, I
14:12
think the other show of DBT test is me,
14:14
me, me. It speaks
14:16
to how easy it is to use
14:19
them, and I will incorporate it there to
14:21
improve. So if you end
14:23
up with using both DBT test and an
14:25
external tool, then you get this mess of
14:28
nothing is consolidated and everything is even
14:30
harder to kind of monitor in terms
14:33
of the process. Yeah, so,
14:35
let's see. Another
14:38
big difference is that being
14:40
part of the pipeline kind of gives you power.
14:43
So you can stop the pipeline, then
14:46
you can prevent that data from propagating
14:49
further. You can only
14:51
monitor when your data is loaded.
14:55
So it's like the most timely monitoring
14:57
and also the most efficient one. So
15:00
that was another big incentive of like trying
15:02
to really build into the workflow and build
15:04
into the pipeline. In terms of
15:06
that aspect of embedding
15:08
into the workflow, a
15:11
lot of these more generalized
15:13
observability systems will use the data
15:16
warehouse as their focal point for
15:19
identifying activity, figuring out what are the
15:21
different signals that are going to be
15:23
useful for determining whether everything is healthy,
15:25
particularly if they're trying to do any sort
15:27
of anomaly detection across the data. But
15:31
as you pointed out, that leaves out a whole
15:33
chunk of the
15:35
work that's being done where you only know
15:37
if there's a problem after you've already pushed
15:39
it into production. I'm curious for
15:41
people who are building a DBT and for the
15:43
case where you are able to embed into
15:46
that development workflow and the CI
15:48
CD workflow, what are some
15:51
of the useful signals for being able
15:53
to raise that early warning to teams
15:55
to say this change that you're
15:57
making is likely to cause these downstream problems.
16:00
And just some of the types of insights that you're
16:02
able to generate for people so that they can reduce
16:06
that cycle time for being able to
16:08
identify and address problems. Yeah,
16:10
so what we see a
16:13
lot of our users do is that
16:15
they work with elementary in
16:17
different environments, just like they work
16:19
with DBT. So they
16:21
have their DBT project, which they run
16:24
in dev, which they run in staging,
16:26
which they run in production, and
16:28
the fact that elementary and your
16:30
monitor is in the testing and everything
16:32
is incorporated. So your
16:35
DBT project means that you also
16:37
have three elementary environments, the equivalent
16:39
to your DBT environment. And
16:43
we see all kinds of deployment, right? That's
16:45
also part of being part
16:48
of your code. You can really have
16:51
the same flexibility. So
16:53
for our users, only use our monitors
16:55
in staging because they only load data
16:57
to production after they validate it in
16:59
staging and see that everything is okay.
17:01
And only then load to production. Some
17:03
other monitoring productions that they
17:06
use DBT builds and they
17:08
use all of the elementary tests
17:11
and tests that actually stop the burden.
17:14
So we solve a problem digitally
17:16
loads to the table where everything
17:18
is protected and doesn't propagate further.
17:21
Though, yes, then the problem is thoroughly the
17:23
sources, right? So the problem doesn't even start
17:25
because the source has issues. So
17:29
this is kind of how it is today. We
17:32
have some plans around it, like we want to
17:35
provide more options
17:37
around how you can use elementary
17:39
to prevent issues. Right now
17:43
I think we're still in the
17:45
phase where working with the different
17:47
environments is already very valuable. And
17:50
I think a lot of teams have incorporated
17:52
that successfully when
17:55
they did their DBT project already got
17:57
a huge benefit in
17:59
reducing that. number of incidents they have in
18:01
production. And then for
18:04
that earlier in the development cycle
18:06
problem there are also another set
18:08
of tools that have been developed
18:10
in particular for dbt of
18:13
these various linters pre-commit checks some
18:16
of the best practices and sanity
18:18
checks for the code
18:20
style and the structural elements
18:22
of the dbt project and
18:24
i'm curious how that overlaps
18:26
with these more generalized
18:29
observability and data quality and
18:31
developer quality issues that teams
18:33
are addressing. I think something
18:37
very powerful that happens
18:39
to users when they start using
18:41
elementary heavily is that they
18:43
actually start getting more benefits
18:46
from implementing best
18:48
practices. So when i say
18:51
best practices is thinking of
18:53
assigning owners to the different
18:55
models to the different tests using
18:58
tags using descriptions kind of
19:00
even reducing the amount of
19:04
nobody actually addresses and then adding
19:06
to other tests that people actually
19:08
care about. So we see a
19:10
lot of care in-house and
19:12
i think that in-house in-house
19:15
elementary and the highest level
19:17
also started working in their development
19:19
process. So they started working that
19:22
you can't add a new model
19:24
without defining an owner defining
19:26
like which channel should another
19:28
go to without defining what
19:32
say the finest like baseline observability
19:35
so it can be volume anomalies and
19:37
freshness anomalies and schema monitoring and
19:39
things that are like the absolute
19:41
baseline for them. So we actually
19:43
see teams leverage the
19:46
fact that they can enforce those policies
19:48
in their ci to kind
19:50
of maintain a high standard over time. This
19:55
episode is brought to you by Datafold, a
19:58
testing automation platform for data and engineers
20:00
that prevents data quality issues from entering
20:02
every part of your data workflow, from
20:04
migration to DBT deployment. Datafold
20:07
has recently launched data replication
20:10
testing, providing ongoing validation for
20:12
source-to-target replication. Leverage
20:14
Datafold's fast, cross-database data diffing
20:16
and monitoring to test your
20:18
replication pipelines automatically and continuously.
20:21
Validate consistency between source and target at
20:23
any scale, and receive alerts about any
20:26
discrepancies. Learn more
20:28
about Datafold by visiting
20:30
dataengineeringpodcast.com/datafold today. And
20:34
digging into the elementary tool chain
20:36
and the technology stack, I'm curious
20:38
if you can talk to some
20:40
of the design aspects that you
20:43
were focused on for the initial
20:45
development process and some of the core
20:48
goals that you're focused on as you
20:50
build out the product, build out the
20:52
open source side of the system and
20:54
some of the ways that you're thinking
20:57
about the specific challenges and problems that
20:59
you're addressing first and foremost, and some
21:01
of the ways that that has evolved
21:03
as you build out more capability. Yeah.
21:08
So our kind
21:10
of main design principle was
21:13
that we want to
21:16
give our users the ability to use
21:18
the product without learning
21:20
anything new, right? But like
21:22
they don't need a learning curve to start
21:24
using elementary. So you need to really stick
21:26
to the tech they already know and the
21:29
tools they already know, and you need to
21:31
make it as easy as possible for them
21:33
without any barriers, without relying on
21:36
anyone else. And that was really challenging. So
21:40
we started with a DBT package
21:42
because we're like, that's where they
21:44
live, so we must be part
21:50
of the project. And I don't know,
21:53
did you ever try
21:55
to develop a DBT package or
21:57
something? I haven't done.
22:00
My own development of DBT packages, I've looked
22:02
a little bit into them structurally and
22:05
started to consider using them for purposes
22:07
of being able to separate
22:10
some of the core product
22:13
developers, some of the core business rules around
22:15
a particular product so that that can live
22:18
in the code base of the application where
22:20
that data originates, but I haven't actually gone
22:22
down that path yet. So I'm curious to
22:24
hear your experience of building and
22:26
maintaining DBT packages and some of the
22:28
sharp edges that you've run up against.
22:31
Yeah, so I think it's
22:33
first when we created those different systems
22:35
of DBT applications, you're like, oh, we
22:37
can just build a plugin, right? But
22:40
DBT package is actually just a DBT
22:42
project. So it's like more another
22:45
project that is kind of attached
22:47
to your own product. It
22:50
means that you're limited to what
22:52
DBT is building in this world.
22:55
DBT wasn't designed
22:57
to facilitate savings.
23:00
It was designed to facilitate DBT
23:02
projects and data modeling and things
23:05
like that. So it
23:08
was really challenging to do like
23:10
complex engineering there. And I think
23:12
we did some of the... Probably
23:16
some of our team knows the
23:18
DBT code base better than some of the
23:21
developers in DBT because they have to understand
23:24
so well what are the different possibilities that
23:26
I'm actually exposed to. We also made
23:28
some contribution to DBT code so
23:31
we enabled what we needed.
23:34
But I think it was a
23:36
really good decision. I think we
23:38
paid the DBT engineering price in order to
23:40
build something that is so easy for your
23:42
nurses to start from. Like a two minute
23:45
set up. With the code
23:47
they already know, the permissions they already have,
23:50
the banner of the only have it to
23:52
elect everything in there. And
23:54
I can get all the outputs very
23:56
easily. Create in SQL, work with
23:58
the ITool too. to analyze it,
24:00
like everything is super simple for them to
24:03
start. And then when we
24:05
move down from there to other needs, like
24:08
visualization and alerting
24:11
and all that, we also try to
24:13
maintain the same principle. So for example,
24:16
we have a UI in the open source
24:18
offer, but you don't need
24:20
a server or anything to run it. You don't
24:22
need the UI, basically. That
24:25
you can even, some for users,
24:27
send it over on the
24:30
link. It will not even be hosted anywhere. So
24:32
that was a decision to
24:34
keep things very, very simple and keep
24:37
our users very independent. And
24:39
then we just use the scale and
24:42
your needs scale, and if we get
24:44
to limits of what you can do with it, and we
24:46
just give you a tool in solution, and then
24:48
we also base the client offering
24:50
that we still try to keep the same principles
24:54
and keep as much code as possible
24:56
on your channel with your server. And
24:59
one of the big benefits with
25:02
building a system that's a cloud-setter
25:05
doesn't require access to your
25:07
data. So
25:09
you need to recognize that you can
25:11
easily use the user to evaluate
25:14
your application on your own or
25:16
your application on Twitter. Like
25:18
I mentioned, I said, it's very history-informational scheme of
25:20
methods that you are dealing with. In DBT, we
25:23
kind of came to the same principle
25:26
of removing as much friction as
25:28
possible when you're adopting the tool.
25:31
To actually make it easy to
25:33
start, to make it easy to adapt it. Another
25:36
interesting aspect of this space right
25:38
now is that DBT was one
25:41
of the earliest entrants that helped
25:43
to define the overall space of
25:45
analytics engineering, and as
25:48
it has grown, it has helped
25:50
to elevate that workflow and
25:53
those capabilities, but now that that success has
25:55
been gained, there are a number of other
25:57
projects that are coming along to try and
25:59
help. capitalize on that growth and
26:01
offer additional enhancements or better user
26:04
experience in different aspects. And I'm
26:06
curious as somebody who is so
26:09
deeply integrated into the DBT ecosystem,
26:11
how you're thinking about being able
26:13
to keep your options open
26:16
of also being able to integrate with
26:18
some of those other systems as they
26:20
grow and gain adoption. So thinking things
26:22
like SQL Mesh, Maloy, SDF,
26:25
etc. Yeah.
26:27
So I do think one
26:30
of the powers of standards
26:32
and I think DBT became the Exactly
26:35
Standard. It's not only the tool itself
26:37
or the framework itself, but also the
26:40
ecosystem around it.
26:42
And I do think
26:45
that today you're going
26:47
to get so much value out of
26:49
other tools in the ecosystem with you
26:51
as a DBT. And it
26:53
may seem very hard to
26:55
switch to any other solutions, but
26:58
obviously there's no solutions, we'll get more
27:00
traction and get more, get
27:02
adopted more widely than an ecosystem
27:04
would be created among these bots.
27:07
And I think this is the end of
27:09
the day. The same
27:11
principle we've given to DBT, and
27:14
this is to other tools as
27:16
well. So kind of a similar
27:18
workflow. At the end
27:20
of the day, elementary
27:23
branch queries against your DSSets,
27:26
SQL queries. So apparently today we
27:28
construct them with very
27:31
complicated DBT macros can
27:33
still be translated to like any
27:35
other hopefully simpler
27:38
posing language than Jinta. So I
27:40
think in that case, we do
27:43
try to build generically and
27:45
we are open to adopting
27:47
other solutions, but not
27:50
something I see in the near future. We
27:53
like the fact that we're focused and we
27:55
still have a large user base to serve
27:58
being focused on DBT. And
28:00
so for teams who are
28:02
interested in adopting elementary
28:04
for their workflow, I'm curious if
28:06
you can just talk to the
28:08
overall process of setting it up,
28:10
getting it integrated and starting to
28:13
adopt the various capabilities as part
28:15
of the development cycle. Yeah.
28:18
So the question of
28:21
I started building a DBT project
28:24
or I have a DBT project,
28:26
like when should I be start
28:28
using elementary yesterday? So when you
28:30
start, at least with a DBT
28:32
package, you
28:34
can really think of it as a
28:36
gradual approach. So you can start
28:38
with a DBT package. It's going to
28:40
take you two minutes. It has like a
28:43
zero friction, zero cost, zero setup, and you're
28:45
going to start getting value. You're going to
28:48
start seeing the output. What
28:50
the metric produces, it's going to give
28:53
you visibility that you didn't have before. And it's
28:55
going to give you the ability
28:57
to do anomaly detection and
28:59
like a vast test that are not
29:02
offered in the ecosystem, in
29:04
the like wide DBT
29:06
test ecosystem. And then from
29:09
there, your leads are going
29:11
to start growing. So you're going to start saying, oh,
29:13
I wish I could get a nerds
29:16
around those stuff. I wish I could
29:18
like route these nerds to different people and
29:20
tag them and leverage all these methods at
29:22
that time. I wish I could see
29:25
all these results on a lineage
29:27
graph and go down to the column
29:29
level and see the impact on my
29:32
dashboard. There's a lot
29:35
of room
29:38
to use the capabilities that like help you reduce
29:42
the time to resolution when
29:44
you have an issue or avoid
29:46
doing breaking changes or really
29:49
taking a more proactive approach to
29:52
data issues. And that's where you should consider
29:56
one of our other offerings, like the cloud
29:58
offering or using the the CIO. The
30:01
way we review PLCs, like
30:04
with users who started off
30:06
in the child product, we
30:09
start to write in like three phases
30:11
of manner. So first
30:13
we're trying to get them to this baseline
30:16
of the possibility that the case
30:18
let's make sure that we prevent all
30:20
the super embarrassing stuff, right? Like
30:23
those things we've been detected anymore. So
30:25
let's get you to this basic information
30:27
like freshness and volume and
30:30
scheme and uniqueness and knowledge. Like let's
30:32
get you to that level and
30:34
let's talk about the most
30:36
embarrassing incidents you had and see
30:38
that they're covered. The load will be
30:41
in each phase, really, for sure. Focus
30:45
on your critical models. Let's get more room
30:47
with each and try to build a plan
30:49
for that. And lastly, I
30:51
think that in
30:53
the past part is getting to the process
30:56
and the enforcement and how you maintain that
30:58
over time and how you incorporate that
31:00
into your data process and how you
31:02
enforce a governance plan.
31:05
I think it's
31:08
not enough to have like the onboarding
31:11
with elementary, which is really cool. You
31:13
get a lot of make-a-let-it-service into weeks,
31:15
but then a year later your project
31:17
is different than your last everything, right?
31:19
So it's a way to maintain that
31:21
origin. So that's like the
31:23
three phases of the project. I think all
31:26
of our open source users are trying to
31:28
incorporate kind of the same phases on their own.
31:31
And once somebody is using elementary,
31:33
they're leaning on the insights that
31:36
it's able to provide and incorporate
31:38
that into their development workflow and
31:40
their team review process. I'm curious
31:42
how you've seen that impact the
31:45
overall approach to development, some of the
31:47
ways that it shifts the thinking, some
31:49
of the planning, and just the overall
31:52
experience of working on a DBT project
31:54
in ways that it causes teams to
31:57
Either accelerate their delivery pace or to.
32:00
Change the way that they design their
32:02
systems, etc. Be.
32:12
A lender building. Their major changes
32:14
when they're building a new. It.
32:33
Was a. Anything.
32:36
Going. On.
32:43
Your. Comments that we
32:45
have an incident. Is today and
32:47
units are are both of them for other.
32:50
Hello need to panel I think about
32:52
what? how can you be prepared and
32:54
like how can you proactively make who
32:56
talked of the. Infidel, lower it a
32:59
visit us what happened by embrace. Stuff.
33:01
Like that. Ah, I'm just gonna
33:03
keep happening or and the only way. To.
33:10
Be Huge. He
33:14
if. he
33:17
were added to the sheer complexity.
33:19
the more. Reason they're
33:21
getting the feeling in the
33:23
thing is. Dangerous
33:26
is due to. Retrieve.
33:34
It would be security.
33:43
Issues he people there have
33:45
been a change. It. He.
33:48
Sees an issue. they're gonna sit
33:50
the dishes. You're gonna be coming
33:53
back for Not enemy Now I.
33:57
understand it it was wrong and
33:59
you didn't Another
34:09
way that some of these types
34:11
of tools, in particular the pre-commit
34:13
style checks, but also just the
34:16
tools that bring additional rigor to the
34:18
process, it can, if you're
34:21
not careful in terms of how you implement
34:23
it and roll it out to the team,
34:25
it can actually cause you to either
34:28
stall out in terms of the velocity that you're
34:30
able to build up or it can cause the
34:32
team to discard the tool
34:34
wholesale because they don't want to
34:36
deal with the pain of adapting
34:38
to the practices that it's
34:41
trying to encourage. And I'm curious
34:43
how you are approaching that
34:45
side of the problem as well of making
34:47
sure that the overall
34:50
burden of extra work doesn't
34:52
cause teams to try
34:55
out elementary, say, this is going to add too
34:57
much work to my plate, so I'm just going
34:59
to get rid of it and not bother and
35:01
just ignore the fact of all these issues that
35:03
it's trying to highlight. Yeah.
35:05
So I think one
35:08
thing is that I feel now
35:10
we kind of had the privilege
35:12
that users who come to
35:15
elementary are already
35:18
paying the fees for
35:20
not investing in their capability
35:23
and it's been a very
35:27
good reason.
35:30
So, to kind of the reason
35:32
that it's not going to invest
35:34
in differently, then over time they're going
35:37
to reduce it. A
35:39
significant amount of money you
35:41
pay for it, you must be in the
35:45
positive steps and not in the negative
35:47
steps. Like, it's better to not
35:50
have fires and like invest in buildings
35:54
that don't burn than dealing with
35:56
fires and kind of a... on
36:00
them as early as possible all the time.
36:02
So I do think
36:04
that users have more awareness today to
36:08
the return on investment of
36:10
spending to use the instrument services. We
36:13
do this in-service in
36:16
our cities where particular users and agents
36:18
learn what they should enforce and what's
36:20
working for them and what's not. Actually
36:23
something we're working on now is
36:25
to give them the visibility to
36:29
already have the ability to see
36:31
the opportunity to see which tests.
36:33
I often and whether the state
36:35
of rights and success rates those open
36:37
area monitors and then right
36:39
now we're trying to help them with some of
36:41
our new people address
36:43
the program. And our
36:46
recommendation in general is like if no
36:48
one would address the test
36:50
if it fails then you shouldn't address
36:52
it. Right, because nobody cares. So
36:55
we help them move. We also if we don't
36:58
have them in our life still don't
37:00
have to just just kind of think of testing rights.
37:02
We try to help them make purchase
37:04
decisions. Just see how sometimes that
37:07
part just works for them and they don't work
37:09
for the average. I think it serves their role.
37:12
As you have been investing
37:14
in this space of observability
37:18
and developer experience improvement
37:20
and data quality for people who
37:23
are investing in this dbt ecosystem
37:25
and using that as their de
37:28
facto approach for managing transformations. What
37:30
are some of the most interesting
37:32
or innovative or unexpected ways that
37:34
you've seen the elementary tool chain
37:37
used? That's a that's an interesting
37:41
question.
37:45
I think something very cool about
37:47
elementary that it saves all the
37:49
output to wear out
37:51
to that elementary schema. And
37:53
then it's accessible to our users and we saw
37:56
use cases that our users sold
37:59
with it. We
38:01
have the institute to do automated
38:04
data warehouse cleanup for
38:06
now to kind of
38:09
maintain everything clean and reduce
38:11
costs. And we saw it
38:14
being used for cost
38:16
analysis to understand
38:18
exactly which, like
38:21
each pipeline, reach
38:23
business domain, how much they've come
38:25
up to do separate exchange management.
38:27
So we saw
38:29
a lot of ad-hoc use
38:32
cases that users used telemetry
38:34
to solve. An interesting
38:36
use case was migration, where
38:39
we saw users when they were
38:41
migrating between data warehouses with the
38:43
same DBT project, then they used to
38:45
write really the exact same test and
38:48
also monitor the pipeline itself and then compare
38:51
the results they got. In
38:54
elementary from two different data warehouses to
38:56
kind of kind of make the
38:58
migration. And we also saw
39:00
users just think that we didn't expect
39:03
even data quality, like the
39:05
main tree and tree. And
39:07
we have people
39:09
with each beginning
39:12
to read. And even though we
39:14
observe a lot, we need to be in the
39:16
middle of the time, but then we create an
39:18
alert if at 30 cents, I'm wearing over three
39:20
times the same week. Or
39:24
twice the same day, or like kind
39:26
of creating this
39:30
level approach to how they test the
39:32
data. And in
39:34
your experience of investing in this
39:36
ecosystem, putting in the engineering time
39:38
and effort to build this suite
39:40
of capabilities and working with end
39:42
users, I'm curious, what are some
39:44
of the most interesting or unexpected
39:46
or challenging lessons that you've learned
39:48
in the process? So
39:52
being a startup founder in general is a
39:54
very humbling experience and
39:56
building a product is a very
39:59
humbling experience. I think
40:01
the second lesson is that you need to be
40:04
very, very, very attentive to
40:06
the users. You need to keep
40:08
experimenting and you need to always listen because
40:11
it's shocking to realize how
40:15
little you can predict once we actually
40:17
make an impact. And
40:20
what you do when actually you're
40:22
reacting. So you think
40:24
you know, I love you. You
40:26
think you're already exploiting the space
40:28
and you think you're not. But
40:31
you keep having surprises, whether you
40:33
want a negative one. So
40:35
I think every time we use the
40:37
exact same and do things without getting
40:40
enough feedback and experiencing thoughts
40:42
and paying off and
40:45
getting feedback, it's really, it's
40:48
always a mistake. So that's something we
40:50
keep doing. I can
40:52
say even when we started elementary, we
40:54
were very, very focused on the anomaly
40:56
detection part and the data
40:58
variability part. And then we
41:01
actually created a lot of the methodology
41:04
tables and all that for us. So
41:07
we moved kind of around
41:09
the other side, add that information to the
41:11
other side. And then we found that most
41:14
of the users actually incorporate elementary for
41:16
that and then discover that we have
41:18
anomaly detection and adopt that. So that's
41:20
like just an example of
41:22
a super positive surprise. We
41:28
have no way of predicting.
41:30
You know, like that became a super
41:32
big part of the product. And
41:36
for teams who are
41:38
building their DBT projects
41:41
and they're trying to improve
41:43
their overall productivity and uptime
41:46
and capabilities, what are the
41:48
cases where elementary is the
41:50
wrong choice? So obviously
41:52
if you don't work heavily with
41:54
DBT and you have like a
41:56
critical background on DBT, then
41:58
it's really different. also I think
42:01
we did make some things out
42:03
there that they didn't incorporate
42:06
a DVD and work with it but
42:08
it's changing for them. Like
42:11
the recording program that we did, and
42:13
I think it's very interesting. So,
42:19
you know, I think we're
42:22
looking to be part of that. And
42:24
as you continue to build and iterate
42:26
on the technology and the product, what
42:29
are some of the things you have
42:31
planned for the near to
42:33
medium to medium to medium? And
42:47
as you continue to build and iterate on
42:49
the technology and the product, what are some
42:51
of the things you have planned for the
42:53
near to medium term or any projects
42:55
or problem areas you're excited to explore? Yeah,
42:59
that's always
43:01
a big deal. I might start up,
43:03
right, because things change so rapidly. So
43:06
we always, we're very open with our
43:08
users that we only have our
43:10
roadmap that goes to corner max,
43:13
but then also an opportunity because they
43:15
have a lot of impact. And once
43:17
we build it, the feedback from them is super
43:20
valuable. I think the MOE
43:22
faith, and I think we will probably keep
43:24
facing it as we grow
43:26
and should we go wide or should
43:28
we go deep? Like the question,
43:30
maybe you asked me before about platforms
43:33
that are on DBT, other
43:35
frameworks out there, you asked
43:37
about teams, and I'm going to make
43:40
the wrong choice for them, some teams that are not using
43:42
DBT, Emily. So in terms of the
43:44
problems with all the users, with the
43:46
staff, with the board, should
43:48
we go wider or should we go deeper?
43:50
And our lessons so far was that we
43:52
should really, we're on our best when we're very
43:54
close and when we go deep. So that's
43:57
CUP. We're
44:01
using the same technology as the
44:03
previous one. We're
44:05
using the same technology as the
44:08
previous one. At
44:10
the moment we're using on three
44:12
areas. So
44:15
we're trying to learn how our users
44:17
decide what to monitor. And
44:20
we look at the testing they have and we ask
44:22
them and we try to understand the decision making process.
44:25
So we can make it easier for
44:27
them moving forward and really automated, but
44:29
we did have a lot of
44:32
inspiration there. We also see
44:34
that there may be travel around communication of
44:37
data health and data issues. So
44:40
kind of the people processes part of the
44:42
problem. We can still make a
44:44
lot of progress there and help them with
44:46
that. And then we keep
44:50
kind of trying to measure what's
44:52
the time to resolution when they
44:55
do have incidents. And we're trying to make
44:57
a positive impact there. We also have
44:59
a lot of ideas and areas that
45:02
we're exploring on that area. But
45:04
if you're a nurse on elementary,
45:06
we're going to keep making
45:09
the accessibility easier for you. And
45:12
we're going to keep refusing your request for
45:14
us to solve other
45:16
issues in place. Although we want to solve
45:18
them, but we're not there yet. And
45:22
are there any other aspects of
45:24
the overall space of data observability
45:26
for DBT projects, the work that
45:28
you're doing at elementary, or
45:30
some of the ways that you see
45:32
this overall challenge of data quality, data
45:34
observability evolving as the ecosystem grows and
45:36
matures that we didn't discuss yet that
45:39
you'd like to cover before we close
45:41
out the show? Yeah,
45:44
I think this
45:46
whole ecosystem is still growing. And
45:49
I think there was a phase
45:51
of doing more and more.
45:53
And now people are
45:56
trying to consolidate and doing less and
45:58
being more focused on it. the value
46:00
of the thing. I think that with
46:03
observability we need to be able
46:05
to support that process
46:07
and do the same. So help them
46:09
with priorities and understanding
46:11
what's actually critical and reducing the noise and
46:14
helping the knowing core was actually important. And
46:17
I think that here's
46:19
a big problem in analytics
46:21
maybe is the depth of the business business
46:29
that people have. And that's
46:33
just not something we can
46:35
ever automate probably. Sometimes
46:37
we see users ask us and we
46:39
have no idea why they decided to
46:42
add them or why they decided to model their data
46:44
in a certain way and then we ask them and
46:46
it becomes super clear. But we still
46:48
need that context, we still need to ask
46:51
them. So we want
46:53
to have the numerical
46:55
AI bot that could
46:58
be replaced by content
47:01
and in big pre-pitch we can make it. How
47:04
we can create the
47:06
interface and how we see that
47:09
content into and get any advance
47:12
possible and the coverage that they need.
47:14
And the coverage that works for them and the
47:16
coverage that really supports their role. So that's
47:19
an area to make a big progress. And I
47:21
think other
47:26
domains in data if they'll be
47:28
able to create better interfaces for
47:30
users to input context and get
47:32
out and use their
47:34
workflow then that's definitely going to
47:36
create progress. And
47:39
maybe someday someone will
47:41
figure out the times and differences.
47:44
You should create so many data
47:46
quality problems but I think that's
47:48
just too far ahead. We're not
47:51
serious in terms of technology.
47:54
Everybody just needs to use UTC all the
47:56
time. Yeah, yeah, that's
47:58
going to happen I think. I'm afraid.
48:01
Unfortunately not. All right.
48:04
Well, for anybody who wants to get in touch with you
48:06
and follow along with the work that you and your team
48:08
are doing, I'll have you add your preferred contact information to
48:10
the show notes. And as the final question,
48:12
I'd like to get your perspective on what you see
48:14
as being the biggest gap in the tooling or technology
48:16
that's available for data management today. Yeah.
48:19
So I think going back to
48:21
that context question, how
48:24
can we make it easy
48:26
for people to share
48:29
why they made the decision they made? And
48:32
some other day they made the decision
48:34
they made in data observability, why they made
48:37
the decision they made in documenting
48:39
or another documenting topic. If
48:42
things would make more sense to the
48:44
new members on your team and to
48:46
your stakeholders and to everyone you
48:49
collaborate with and even to the
48:51
vendors you work with, right? Like if we'll
48:53
have more context from our
48:55
users about what drove their decisions,
48:57
then we could give them better
48:59
advice and better outcomes.
49:02
And that's still something that I
49:05
don't think anyone figured out. Like
49:07
how can we communicate better
49:10
around kind of
49:12
the decisions and the design patterns that we have
49:14
to do and we really did it. All
49:17
right. Well, thank you very much for
49:20
taking the time today to join me
49:22
and share your work on elementary and
49:24
share your experience and perspective on the
49:27
overall space of data observability for DBT
49:29
projects. It's definitely a very interesting
49:31
and complex problem area. So I appreciate the time
49:34
and energy that you and your team are putting
49:36
into helping to solve for that. And I hope
49:38
you enjoy the rest of your day. Yeah.
49:41
Thank you for having me. And also
49:43
I hope listeners enjoy and I do
49:45
want to point out that English is
49:47
my third language. So
49:49
I hope people
49:52
would forgive my
49:54
mistakes and enjoy listening. Thank
50:03
you for listening. Don't forget to check
50:05
out our other shows, podcasts.init, which covers
50:07
the Python language, its community, and the
50:09
innovative ways it is being used, and
50:11
the Machine Learning Podcast, which helps you
50:14
go from idea to production with machine
50:16
learning. Visit the site at dataengineeringpodcasts.com to
50:18
subscribe to the show, sign up for
50:20
the mailing list, and read the show
50:22
notes. And if you've learned something or
50:24
tried out a project from the show, then tell us about it.
50:27
Email hosts at dataengineeringpodcasts.com
50:29
with your story. And
50:31
to help other people find the show, please leave a
50:33
review on Apple Podcasts.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More