Podchaser Logo
Home
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Released Sunday, 31st March 2024
Good episode? Give it some love!
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Sunday, 31st March 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello, and welcome to the Data Engineering

0:13

Podcast, the show about modern data management. Data

0:16

lakes are notoriously complex. For

0:19

data engineers who battle to build and

0:21

scale high quality data workflows on the

0:23

data lake, Starburst powers petabyte-scale SQL analytics

0:26

fast, at a fraction of the cost

0:28

of traditional methods, so that you can

0:30

meet all of your data needs, ranging

0:32

from AI to data applications to complete

0:35

analytics. Trusted by teams of all sizes,

0:37

including Comcast and DoorDash, Starburst is a

0:39

data lake analytics platform that delivers the

0:42

adaptability and flexibility a lake has ecosystem

0:44

promises. And Starburst does

0:46

all of this on an open architecture,

0:49

with first-class support for Apache Iceberg, Delta

0:51

Lake, and Hoodie, so you

0:53

always maintain ownership of your data. Want

0:56

to see Starburst in action?

0:58

Go to dataengineeringpodcast.com slash

1:00

starburst and get $500 in credits

1:02

to try Starburst Galaxy today, the easiest

1:04

and fastest way to get started using

1:06

Trino. Daxter offers a

1:09

new approach to building and running

1:11

data platforms and data pipelines. It

1:13

is an open-source, cloud-native orchestrator for

1:15

the whole development lifecycle, with integrated

1:18

lineage and observability, a declarative programming

1:20

model, and best-in-class testability. Your

1:23

team can get up and running

1:25

in minutes thanks to Daxter Cloud,

1:27

an enterprise-class hosted solution that offers

1:29

serverless and hybrid deployments, enhanced security,

1:31

and on-demand ephemeral test deployments. Go

1:34

to dataengineeringpodcast.com/daxter today to get started,

1:36

and your first 30 days are

1:38

free. Your host is Tobias

1:41

Macy, and today I'm interviewing Mayan Salom

1:43

about how to incorporate observability into a

1:45

DBT-oriented workflow and some of the ways

1:47

that elementary can help. So, Mayan, can

1:49

you start by introducing yourself? Yeah,

1:52

sure. So, happy to be here.

1:54

I'm Mayan. My Starbucks

1:56

name is Maya. It's much easier to

1:58

pronounce. I'm the CEO... and I'm

2:00

a co-founder of elementary. Some

2:02

people know us as elementary data. I've

2:05

been in data roles for 12 years before

2:09

starting elementary, mainly in

2:11

a cybersecurity company. I

2:13

actually got into data much earlier because

2:16

I was a kid that was

2:18

obsessed with sports. Originally, I was opinion and

2:20

my dad wanted a boy. And when he

2:22

didn't get a boy, then he's like, you're

2:25

gonna want to involve in me? And

2:27

he said, I'm gonna be

2:29

doing this all the way through the advances

2:31

when I reached the kind of my

2:33

own age-old firm. So I

2:35

started it obviously. I

2:38

was a little bit more

2:40

critical on data pipelines and

2:43

all of them in various ways. Very

2:46

slowly, people get treated and said,

2:48

so it was very important.

2:50

And then later on, it was much

2:52

bigger, more complicated tasks as well. So

2:54

that's what got me started with elementary. You

2:57

mentioned already how you first got interested working

2:59

in data. I'm wondering if you can just

3:01

give a bit of the sense of what

3:03

it is about the space that has kept

3:05

you interested and why you want to focus

3:08

your time and energy on that problem space.

3:11

So I think in general, I

3:13

have a big passion for data.

3:16

It's like the kind of the right way

3:18

to make decisions. And I

3:20

think everyone who's a data professional

3:22

probably feels nothing, many aspects of

3:24

their life, nothing just in

3:27

their professional life. And

3:29

it's something you trust, right?

3:31

When they're saying, you know, you're gonna

3:33

make great decisions. And when you can't

3:35

use it, when you see it, when

3:38

you say, the way staff

3:40

are used at some times in

3:42

media, maybe to kind of

3:45

create wrong messages, then it may break

3:47

your heart. So it's

3:50

a very frustrating thing working intensely

3:52

with data. When I was

3:54

doing in my last role before elementary,

3:56

I was doing cybersecurity incident response. There's

4:01

like a big crisis that you're there

4:03

to solve. And it's time sensitive.

4:06

There's a lot of pressure and you need to be very,

4:08

very accurate with everything. There's a lot of consequences. And

4:10

just the amount of time we spend there on

4:13

validating and revalidating and

4:15

trying to understand

4:18

if everything is okay was just so frustrating.

4:20

It sounds like something that I want to

4:22

focus on and solve. And

4:28

now digging into the question

4:30

of observability and in particular

4:32

for DBT projects, data observability

4:35

started coming to the fore

4:37

in the data space maybe

4:39

two or three years ago.

4:42

And I'm just wondering if you can

4:44

talk to some of the elements of

4:46

observability that are most applicable to people

4:48

who are using DBT for managing their

4:50

transformations in a SQL context. Yeah,

4:53

yeah. So we

4:55

started elementary a bit over

4:58

two years ago. And we

5:00

saw the revolution

5:02

at the moment that DBT

5:04

is bringing to how people

5:06

build, how they make it so

5:09

much easier and they obstruct so much of the complexity.

5:11

And we felt that when

5:14

it comes to observability, the decision of

5:16

simplicity needs to apply in the same

5:18

kind of change. And

5:21

we felt that there isn't a

5:23

tool in there that we would use

5:25

if we would build DBT projects. It

5:28

would make observability really easy. And in terms

5:30

of whether it means your needs when it

5:32

comes to observability, when you have DBT projects,

5:34

I think it has three aspects. The

5:38

first is not unique to DBT. The data

5:40

itself, you need to validate it. You need

5:42

to monitor it. You need to understand if

5:44

there are unexpected changes. It's really not

5:47

your expectation. There

5:49

is the operational part, which

5:51

I think part of what

5:54

makes working with DBTs is that they

5:56

let you, like, take all These small steps

5:58

in your pipeline and... Each.

6:17

To. Do because. It

6:21

really do. You know, That

6:25

occurred to me to do. Believe.

6:27

There was plenty. Of

6:30

good faith a few. Craving. Up.

6:35

Room with you. Get him

6:37

to prevent fundamental religious. There's

6:39

a convertible now. Looks

6:43

more. Comprehensive plan

6:45

of have tried to cover everything

6:47

ah but the barn and. Are

6:50

be tuned in. That

6:54

the tree aspects of a states we really

6:56

try to help and. And

6:58

for people who are using Db

7:00

T, they're trying to gain some

7:02

visibility and tear. The overall metrics

7:04

of their project are trying to

7:06

understand what are the things that

7:08

are going well. How can I

7:10

improve? One of the reasons for

7:12

these different failures. One of the

7:14

anomalies that after deal was what

7:16

are some of the ad hoc

7:18

or a D I Y approaches

7:20

that teams are likely to attempt

7:22

in the process of trying to

7:24

obtain as insights. So.

7:29

Many famous. Point

7:31

A slow and they're just

7:33

thirty. They're gonna do things

7:35

like taking. Of

7:38

the to T and taking

7:40

Ah managed that output you

7:42

can do. You.

7:49

Like sending. It has been locked into

7:51

the lake in a dog or. Are

7:55

taking a nap and I've been uploading.

7:57

To the wow because. That's. where he

7:59

said comfortable with SQL

8:02

and then maybe work even

8:04

with your BI to create some dashboard

8:06

on top of it. We also saw users

8:09

doing stuff like breaking

8:11

down even their DBT project to

8:13

run each model on a thing

8:15

like a different stack of orchestrator

8:17

to build and build for

8:20

better observability. So all

8:22

kinds of hacks and some

8:24

teams have a really good setup

8:27

that is working for them. The question

8:29

is really how does it

8:32

hold over time, right? Like how much maintenance

8:34

does it require? How does it

8:36

hold with version upgrades, what

8:38

changes, what more and more needs and how does it

8:40

scale? And for

8:42

teams who are scaling

8:45

their usage of DBT, a lot

8:47

of the work

8:49

that the DBT product team is

8:52

focused on is trying to move

8:55

them into the cloud environment as

8:57

a means of getting some of

8:59

that visibility, some of the ease

9:01

of use, developer experience enhancements. And

9:04

I'm curious what you see as some

9:06

of the tension for teams who are

9:08

evaluating that approach of do I just

9:10

go as DBT cloud and they're going

9:12

to solve all my problems? Or I

9:14

really like the fact that I have

9:16

full control over all of my project

9:18

because DBT from the CLI is self-hosted,

9:20

I can do whatever I want, I

9:22

don't have to necessarily worry about the

9:24

cost scaling with my usage. I'm just wondering if

9:26

you could talk to some of the tensions

9:29

that teams address in that question

9:31

and maybe some of the ways

9:33

that some of these self-service approaches

9:35

to observability can mitigate

9:37

that potential pain point. Yeah,

9:40

so I think DBT cloud

9:42

has its value and

9:44

I think if you said

9:47

a lot of it has to do with user

9:49

experience and the development experience and I

9:51

think they did a

9:53

great job with helping the users

9:55

that are maybe less

9:57

technical and less comfortable with that. development

10:00

environment and helping to work with code

10:02

in the past to work with it

10:04

very easily. So in terms of scaling

10:06

I think it does work for organizations.

10:09

It's really telling me for people to put on

10:11

the right on the project and

10:13

it's very easy to start creating new

10:15

things in terms of

10:17

getting a orchestration easy. And

10:20

in terms of the 20-somes to observe

10:23

ability we still see like a lot of

10:25

the users of elementary use

10:27

dbt cloud so it doesn't

10:29

answer their needs I think. The

10:32

main reason for that is

10:34

because you have an entrepreneur that

10:37

can address it. Your

10:39

dbt project will know all of

10:41

your activities there and there's

10:44

a lot of, and I said Google is there in the

10:46

context of the survey. Eventually which

10:48

really impacts the health

10:50

of your data and the performance is a

10:52

lot of moving parts. So there's the

10:54

underlying data warehouse and there's the orchestrator

10:56

and there are the sources and there

10:58

are the tools that pull

11:00

data from the warehouse. And there are a

11:03

lot of other elements and

11:05

as long as the dbt cloud

11:07

might look only at sampling

11:10

of elements of the pipeline then

11:12

you're still going to miss stuff. And

11:16

on the other side of the

11:19

scale is these generalized data observability

11:21

systems or in some cases people

11:23

will lean on their application observability

11:25

stacks to try and get visibility

11:28

into their overall data platform execution.

11:30

And I'm curious what are

11:32

some of the shortcomings in

11:34

the experience particularly for dbt

11:37

projects that teams are battling

11:39

with and trying to adopt

11:41

these either larger scale or

11:43

more generalized systems for data observability.

11:46

Yeah. So in my past

11:48

I tried to utilize systems

11:51

like this set application monitoring like

11:53

data dog and Splunk to

11:55

monitor data. It was

11:57

hard. I think like it's easier to deliver. and

12:00

how solutions we talked about

12:03

in making those platforms kind of

12:05

work for you when it comes to data

12:07

availability. And then when it

12:09

comes to data availability tools that

12:11

are not built for

12:13

this workflow, what

12:15

draws us to build the

12:18

way we do it is that I

12:20

think that accessibility has a lot to do with

12:22

usage and with like

12:24

investing and creating, like implementing

12:27

the practices. It's like

12:29

it's not a pure tech program, right?

12:32

It's tech and people and processes

12:34

programs and you totally take you

12:36

so far. And

12:39

it's kind of like in a good sport that you

12:41

know it's good for you. You

12:43

know you need to work out, but that is your chance

12:45

to set in that is

12:48

comfortable and work for you. Like if

12:50

the gym is not close enough to

12:52

home or anything like that, then you're not

12:54

actually going to do it. So

12:56

we really try to build

12:59

into the

13:01

way you already worked, into your

13:03

workflow, into your development workflow. So

13:08

I think that for other tools in

13:10

the market, the barrier of entry for

13:12

someone who's an analytics engineer is very unique.

13:14

If you need an

13:17

upset, you need permission, you

13:19

probably should have your DevOps

13:21

team or your data platform

13:23

administrators or something to actually

13:25

use it. And then

13:27

you would need to replicate a lot of

13:29

the configuration you already invested in building to

13:31

that tool. And then you

13:34

guys need to make that to prepare, like

13:36

this is more of a production

13:38

environment. You

13:40

should ignore them and this is like

13:42

how frequently you should monitor this pipeline

13:44

and this is a table that's low-increment

13:47

line. Like there's a lot of context

13:49

that you need to kind of go and

13:51

everything is so external to how you work,

13:53

to your code, to your environment, to your

13:55

logic. When you develop, you need to

13:57

like go to a different system and remember to do

13:59

it. to do it and kind

14:01

of scattered all over the place.

14:05

Or you say, okay, I know who's DBT

14:08

test, this is me, they give me, and I'm

14:10

gonna stick to it because it's my companion, I

14:12

think the other show of DBT test is me,

14:14

me, me. It speaks

14:16

to how easy it is to use

14:19

them, and I will incorporate it there to

14:21

improve. So if you end

14:23

up with using both DBT test and an

14:25

external tool, then you get this mess of

14:28

nothing is consolidated and everything is even

14:30

harder to kind of monitor in terms

14:33

of the process. Yeah, so,

14:35

let's see. Another

14:38

big difference is that being

14:40

part of the pipeline kind of gives you power.

14:43

So you can stop the pipeline, then

14:46

you can prevent that data from propagating

14:49

further. You can only

14:51

monitor when your data is loaded.

14:55

So it's like the most timely monitoring

14:57

and also the most efficient one. So

15:00

that was another big incentive of like trying

15:02

to really build into the workflow and build

15:04

into the pipeline. In terms of

15:06

that aspect of embedding

15:08

into the workflow, a

15:11

lot of these more generalized

15:13

observability systems will use the data

15:16

warehouse as their focal point for

15:19

identifying activity, figuring out what are the

15:21

different signals that are going to be

15:23

useful for determining whether everything is healthy,

15:25

particularly if they're trying to do any sort

15:27

of anomaly detection across the data. But

15:31

as you pointed out, that leaves out a whole

15:33

chunk of the

15:35

work that's being done where you only know

15:37

if there's a problem after you've already pushed

15:39

it into production. I'm curious for

15:41

people who are building a DBT and for the

15:43

case where you are able to embed into

15:46

that development workflow and the CI

15:48

CD workflow, what are some

15:51

of the useful signals for being able

15:53

to raise that early warning to teams

15:55

to say this change that you're

15:57

making is likely to cause these downstream problems.

16:00

And just some of the types of insights that you're

16:02

able to generate for people so that they can reduce

16:06

that cycle time for being able to

16:08

identify and address problems. Yeah,

16:10

so what we see a

16:13

lot of our users do is that

16:15

they work with elementary in

16:17

different environments, just like they work

16:19

with DBT. So they

16:21

have their DBT project, which they run

16:24

in dev, which they run in staging,

16:26

which they run in production, and

16:28

the fact that elementary and your

16:30

monitor is in the testing and everything

16:32

is incorporated. So your

16:35

DBT project means that you also

16:37

have three elementary environments, the equivalent

16:39

to your DBT environment. And

16:43

we see all kinds of deployment, right? That's

16:45

also part of being part

16:48

of your code. You can really have

16:51

the same flexibility. So

16:53

for our users, only use our monitors

16:55

in staging because they only load data

16:57

to production after they validate it in

16:59

staging and see that everything is okay.

17:01

And only then load to production. Some

17:03

other monitoring productions that they

17:06

use DBT builds and they

17:08

use all of the elementary tests

17:11

and tests that actually stop the burden.

17:14

So we solve a problem digitally

17:16

loads to the table where everything

17:18

is protected and doesn't propagate further.

17:21

Though, yes, then the problem is thoroughly the

17:23

sources, right? So the problem doesn't even start

17:25

because the source has issues. So

17:29

this is kind of how it is today. We

17:32

have some plans around it, like we want to

17:35

provide more options

17:37

around how you can use elementary

17:39

to prevent issues. Right now

17:43

I think we're still in the

17:45

phase where working with the different

17:47

environments is already very valuable. And

17:50

I think a lot of teams have incorporated

17:52

that successfully when

17:55

they did their DBT project already got

17:57

a huge benefit in

17:59

reducing that. number of incidents they have in

18:01

production. And then for

18:04

that earlier in the development cycle

18:06

problem there are also another set

18:08

of tools that have been developed

18:10

in particular for dbt of

18:13

these various linters pre-commit checks some

18:16

of the best practices and sanity

18:18

checks for the code

18:20

style and the structural elements

18:22

of the dbt project and

18:24

i'm curious how that overlaps

18:26

with these more generalized

18:29

observability and data quality and

18:31

developer quality issues that teams

18:33

are addressing. I think something

18:37

very powerful that happens

18:39

to users when they start using

18:41

elementary heavily is that they

18:43

actually start getting more benefits

18:46

from implementing best

18:48

practices. So when i say

18:51

best practices is thinking of

18:53

assigning owners to the different

18:55

models to the different tests using

18:58

tags using descriptions kind of

19:00

even reducing the amount of

19:04

nobody actually addresses and then adding

19:06

to other tests that people actually

19:08

care about. So we see a

19:10

lot of care in-house and

19:12

i think that in-house in-house

19:15

elementary and the highest level

19:17

also started working in their development

19:19

process. So they started working that

19:22

you can't add a new model

19:24

without defining an owner defining

19:26

like which channel should another

19:28

go to without defining what

19:32

say the finest like baseline observability

19:35

so it can be volume anomalies and

19:37

freshness anomalies and schema monitoring and

19:39

things that are like the absolute

19:41

baseline for them. So we actually

19:43

see teams leverage the

19:46

fact that they can enforce those policies

19:48

in their ci to kind

19:50

of maintain a high standard over time. This

19:55

episode is brought to you by Datafold, a

19:58

testing automation platform for data and engineers

20:00

that prevents data quality issues from entering

20:02

every part of your data workflow, from

20:04

migration to DBT deployment. Datafold

20:07

has recently launched data replication

20:10

testing, providing ongoing validation for

20:12

source-to-target replication. Leverage

20:14

Datafold's fast, cross-database data diffing

20:16

and monitoring to test your

20:18

replication pipelines automatically and continuously.

20:21

Validate consistency between source and target at

20:23

any scale, and receive alerts about any

20:26

discrepancies. Learn more

20:28

about Datafold by visiting

20:30

dataengineeringpodcast.com/datafold today. And

20:34

digging into the elementary tool chain

20:36

and the technology stack, I'm curious

20:38

if you can talk to some

20:40

of the design aspects that you

20:43

were focused on for the initial

20:45

development process and some of the core

20:48

goals that you're focused on as you

20:50

build out the product, build out the

20:52

open source side of the system and

20:54

some of the ways that you're thinking

20:57

about the specific challenges and problems that

20:59

you're addressing first and foremost, and some

21:01

of the ways that that has evolved

21:03

as you build out more capability. Yeah.

21:08

So our kind

21:10

of main design principle was

21:13

that we want to

21:16

give our users the ability to use

21:18

the product without learning

21:20

anything new, right? But like

21:22

they don't need a learning curve to start

21:24

using elementary. So you need to really stick

21:26

to the tech they already know and the

21:29

tools they already know, and you need to

21:31

make it as easy as possible for them

21:33

without any barriers, without relying on

21:36

anyone else. And that was really challenging. So

21:40

we started with a DBT package

21:42

because we're like, that's where they

21:44

live, so we must be part

21:50

of the project. And I don't know,

21:53

did you ever try

21:55

to develop a DBT package or

21:57

something? I haven't done.

22:00

My own development of DBT packages, I've looked

22:02

a little bit into them structurally and

22:05

started to consider using them for purposes

22:07

of being able to separate

22:10

some of the core product

22:13

developers, some of the core business rules around

22:15

a particular product so that that can live

22:18

in the code base of the application where

22:20

that data originates, but I haven't actually gone

22:22

down that path yet. So I'm curious to

22:24

hear your experience of building and

22:26

maintaining DBT packages and some of the

22:28

sharp edges that you've run up against.

22:31

Yeah, so I think it's

22:33

first when we created those different systems

22:35

of DBT applications, you're like, oh, we

22:37

can just build a plugin, right? But

22:40

DBT package is actually just a DBT

22:42

project. So it's like more another

22:45

project that is kind of attached

22:47

to your own product. It

22:50

means that you're limited to what

22:52

DBT is building in this world.

22:55

DBT wasn't designed

22:57

to facilitate savings.

23:00

It was designed to facilitate DBT

23:02

projects and data modeling and things

23:05

like that. So it

23:08

was really challenging to do like

23:10

complex engineering there. And I think

23:12

we did some of the... Probably

23:16

some of our team knows the

23:18

DBT code base better than some of the

23:21

developers in DBT because they have to understand

23:24

so well what are the different possibilities that

23:26

I'm actually exposed to. We also made

23:28

some contribution to DBT code so

23:31

we enabled what we needed.

23:34

But I think it was a

23:36

really good decision. I think we

23:38

paid the DBT engineering price in order to

23:40

build something that is so easy for your

23:42

nurses to start from. Like a two minute

23:45

set up. With the code

23:47

they already know, the permissions they already have,

23:50

the banner of the only have it to

23:52

elect everything in there. And

23:54

I can get all the outputs very

23:56

easily. Create in SQL, work with

23:58

the ITool too. to analyze it,

24:00

like everything is super simple for them to

24:03

start. And then when we

24:05

move down from there to other needs, like

24:08

visualization and alerting

24:11

and all that, we also try to

24:13

maintain the same principle. So for example,

24:16

we have a UI in the open source

24:18

offer, but you don't need

24:20

a server or anything to run it. You don't

24:22

need the UI, basically. That

24:25

you can even, some for users,

24:27

send it over on the

24:30

link. It will not even be hosted anywhere. So

24:32

that was a decision to

24:34

keep things very, very simple and keep

24:37

our users very independent. And

24:39

then we just use the scale and

24:42

your needs scale, and if we get

24:44

to limits of what you can do with it, and we

24:46

just give you a tool in solution, and then

24:48

we also base the client offering

24:50

that we still try to keep the same principles

24:54

and keep as much code as possible

24:56

on your channel with your server. And

24:59

one of the big benefits with

25:02

building a system that's a cloud-setter

25:05

doesn't require access to your

25:07

data. So

25:09

you need to recognize that you can

25:11

easily use the user to evaluate

25:14

your application on your own or

25:16

your application on Twitter. Like

25:18

I mentioned, I said, it's very history-informational scheme of

25:20

methods that you are dealing with. In DBT, we

25:23

kind of came to the same principle

25:26

of removing as much friction as

25:28

possible when you're adopting the tool.

25:31

To actually make it easy to

25:33

start, to make it easy to adapt it. Another

25:36

interesting aspect of this space right

25:38

now is that DBT was one

25:41

of the earliest entrants that helped

25:43

to define the overall space of

25:45

analytics engineering, and as

25:48

it has grown, it has helped

25:50

to elevate that workflow and

25:53

those capabilities, but now that that success has

25:55

been gained, there are a number of other

25:57

projects that are coming along to try and

25:59

help. capitalize on that growth and

26:01

offer additional enhancements or better user

26:04

experience in different aspects. And I'm

26:06

curious as somebody who is so

26:09

deeply integrated into the DBT ecosystem,

26:11

how you're thinking about being able

26:13

to keep your options open

26:16

of also being able to integrate with

26:18

some of those other systems as they

26:20

grow and gain adoption. So thinking things

26:22

like SQL Mesh, Maloy, SDF,

26:25

etc. Yeah.

26:27

So I do think one

26:30

of the powers of standards

26:32

and I think DBT became the Exactly

26:35

Standard. It's not only the tool itself

26:37

or the framework itself, but also the

26:40

ecosystem around it.

26:42

And I do think

26:45

that today you're going

26:47

to get so much value out of

26:49

other tools in the ecosystem with you

26:51

as a DBT. And it

26:53

may seem very hard to

26:55

switch to any other solutions, but

26:58

obviously there's no solutions, we'll get more

27:00

traction and get more, get

27:02

adopted more widely than an ecosystem

27:04

would be created among these bots.

27:07

And I think this is the end of

27:09

the day. The same

27:11

principle we've given to DBT, and

27:14

this is to other tools as

27:16

well. So kind of a similar

27:18

workflow. At the end

27:20

of the day, elementary

27:23

branch queries against your DSSets,

27:26

SQL queries. So apparently today we

27:28

construct them with very

27:31

complicated DBT macros can

27:33

still be translated to like any

27:35

other hopefully simpler

27:38

posing language than Jinta. So I

27:40

think in that case, we do

27:43

try to build generically and

27:45

we are open to adopting

27:47

other solutions, but not

27:50

something I see in the near future. We

27:53

like the fact that we're focused and we

27:55

still have a large user base to serve

27:58

being focused on DBT. And

28:00

so for teams who are

28:02

interested in adopting elementary

28:04

for their workflow, I'm curious if

28:06

you can just talk to the

28:08

overall process of setting it up,

28:10

getting it integrated and starting to

28:13

adopt the various capabilities as part

28:15

of the development cycle. Yeah.

28:18

So the question of

28:21

I started building a DBT project

28:24

or I have a DBT project,

28:26

like when should I be start

28:28

using elementary yesterday? So when you

28:30

start, at least with a DBT

28:32

package, you

28:34

can really think of it as a

28:36

gradual approach. So you can start

28:38

with a DBT package. It's going to

28:40

take you two minutes. It has like a

28:43

zero friction, zero cost, zero setup, and you're

28:45

going to start getting value. You're going to

28:48

start seeing the output. What

28:50

the metric produces, it's going to give

28:53

you visibility that you didn't have before. And it's

28:55

going to give you the ability

28:57

to do anomaly detection and

28:59

like a vast test that are not

29:02

offered in the ecosystem, in

29:04

the like wide DBT

29:06

test ecosystem. And then from

29:09

there, your leads are going

29:11

to start growing. So you're going to start saying, oh,

29:13

I wish I could get a nerds

29:16

around those stuff. I wish I could

29:18

like route these nerds to different people and

29:20

tag them and leverage all these methods at

29:22

that time. I wish I could see

29:25

all these results on a lineage

29:27

graph and go down to the column

29:29

level and see the impact on my

29:32

dashboard. There's a lot

29:35

of room

29:38

to use the capabilities that like help you reduce

29:42

the time to resolution when

29:44

you have an issue or avoid

29:46

doing breaking changes or really

29:49

taking a more proactive approach to

29:52

data issues. And that's where you should consider

29:56

one of our other offerings, like the cloud

29:58

offering or using the the CIO. The

30:01

way we review PLCs, like

30:04

with users who started off

30:06

in the child product, we

30:09

start to write in like three phases

30:11

of manner. So first

30:13

we're trying to get them to this baseline

30:16

of the possibility that the case

30:18

let's make sure that we prevent all

30:20

the super embarrassing stuff, right? Like

30:23

those things we've been detected anymore. So

30:25

let's get you to this basic information

30:27

like freshness and volume and

30:30

scheme and uniqueness and knowledge. Like let's

30:32

get you to that level and

30:34

let's talk about the most

30:36

embarrassing incidents you had and see

30:38

that they're covered. The load will be

30:41

in each phase, really, for sure. Focus

30:45

on your critical models. Let's get more room

30:47

with each and try to build a plan

30:49

for that. And lastly, I

30:51

think that in

30:53

the past part is getting to the process

30:56

and the enforcement and how you maintain that

30:58

over time and how you incorporate that

31:00

into your data process and how you

31:02

enforce a governance plan.

31:05

I think it's

31:08

not enough to have like the onboarding

31:11

with elementary, which is really cool. You

31:13

get a lot of make-a-let-it-service into weeks,

31:15

but then a year later your project

31:17

is different than your last everything, right?

31:19

So it's a way to maintain that

31:21

origin. So that's like the

31:23

three phases of the project. I think all

31:26

of our open source users are trying to

31:28

incorporate kind of the same phases on their own.

31:31

And once somebody is using elementary,

31:33

they're leaning on the insights that

31:36

it's able to provide and incorporate

31:38

that into their development workflow and

31:40

their team review process. I'm curious

31:42

how you've seen that impact the

31:45

overall approach to development, some of the

31:47

ways that it shifts the thinking, some

31:49

of the planning, and just the overall

31:52

experience of working on a DBT project

31:54

in ways that it causes teams to

31:57

Either accelerate their delivery pace or to.

32:00

Change the way that they design their

32:02

systems, etc. Be.

32:12

A lender building. Their major changes

32:14

when they're building a new. It.

32:33

Was a. Anything.

32:36

Going. On.

32:43

Your. Comments that we

32:45

have an incident. Is today and

32:47

units are are both of them for other.

32:50

Hello need to panel I think about

32:52

what? how can you be prepared and

32:54

like how can you proactively make who

32:56

talked of the. Infidel, lower it a

32:59

visit us what happened by embrace. Stuff.

33:01

Like that. Ah, I'm just gonna

33:03

keep happening or and the only way. To.

33:10

Be Huge. He

33:14

if. he

33:17

were added to the sheer complexity.

33:19

the more. Reason they're

33:21

getting the feeling in the

33:23

thing is. Dangerous

33:26

is due to. Retrieve.

33:34

It would be security.

33:43

Issues he people there have

33:45

been a change. It. He.

33:48

Sees an issue. they're gonna sit

33:50

the dishes. You're gonna be coming

33:53

back for Not enemy Now I.

33:57

understand it it was wrong and

33:59

you didn't Another

34:09

way that some of these types

34:11

of tools, in particular the pre-commit

34:13

style checks, but also just the

34:16

tools that bring additional rigor to the

34:18

process, it can, if you're

34:21

not careful in terms of how you implement

34:23

it and roll it out to the team,

34:25

it can actually cause you to either

34:28

stall out in terms of the velocity that you're

34:30

able to build up or it can cause the

34:32

team to discard the tool

34:34

wholesale because they don't want to

34:36

deal with the pain of adapting

34:38

to the practices that it's

34:41

trying to encourage. And I'm curious

34:43

how you are approaching that

34:45

side of the problem as well of making

34:47

sure that the overall

34:50

burden of extra work doesn't

34:52

cause teams to try

34:55

out elementary, say, this is going to add too

34:57

much work to my plate, so I'm just going

34:59

to get rid of it and not bother and

35:01

just ignore the fact of all these issues that

35:03

it's trying to highlight. Yeah.

35:05

So I think one

35:08

thing is that I feel now

35:10

we kind of had the privilege

35:12

that users who come to

35:15

elementary are already

35:18

paying the fees for

35:20

not investing in their capability

35:23

and it's been a very

35:27

good reason.

35:30

So, to kind of the reason

35:32

that it's not going to invest

35:34

in differently, then over time they're going

35:37

to reduce it. A

35:39

significant amount of money you

35:41

pay for it, you must be in the

35:45

positive steps and not in the negative

35:47

steps. Like, it's better to not

35:50

have fires and like invest in buildings

35:54

that don't burn than dealing with

35:56

fires and kind of a... on

36:00

them as early as possible all the time.

36:02

So I do think

36:04

that users have more awareness today to

36:08

the return on investment of

36:10

spending to use the instrument services. We

36:13

do this in-service in

36:16

our cities where particular users and agents

36:18

learn what they should enforce and what's

36:20

working for them and what's not. Actually

36:23

something we're working on now is

36:25

to give them the visibility to

36:29

already have the ability to see

36:31

the opportunity to see which tests.

36:33

I often and whether the state

36:35

of rights and success rates those open

36:37

area monitors and then right

36:39

now we're trying to help them with some of

36:41

our new people address

36:43

the program. And our

36:46

recommendation in general is like if no

36:48

one would address the test

36:50

if it fails then you shouldn't address

36:52

it. Right, because nobody cares. So

36:55

we help them move. We also if we don't

36:58

have them in our life still don't

37:00

have to just just kind of think of testing rights.

37:02

We try to help them make purchase

37:04

decisions. Just see how sometimes that

37:07

part just works for them and they don't work

37:09

for the average. I think it serves their role.

37:12

As you have been investing

37:14

in this space of observability

37:18

and developer experience improvement

37:20

and data quality for people who

37:23

are investing in this dbt ecosystem

37:25

and using that as their de

37:28

facto approach for managing transformations. What

37:30

are some of the most interesting

37:32

or innovative or unexpected ways that

37:34

you've seen the elementary tool chain

37:37

used? That's a that's an interesting

37:41

question.

37:45

I think something very cool about

37:47

elementary that it saves all the

37:49

output to wear out

37:51

to that elementary schema. And

37:53

then it's accessible to our users and we saw

37:56

use cases that our users sold

37:59

with it. We

38:01

have the institute to do automated

38:04

data warehouse cleanup for

38:06

now to kind of

38:09

maintain everything clean and reduce

38:11

costs. And we saw it

38:14

being used for cost

38:16

analysis to understand

38:18

exactly which, like

38:21

each pipeline, reach

38:23

business domain, how much they've come

38:25

up to do separate exchange management.

38:27

So we saw

38:29

a lot of ad-hoc use

38:32

cases that users used telemetry

38:34

to solve. An interesting

38:36

use case was migration, where

38:39

we saw users when they were

38:41

migrating between data warehouses with the

38:43

same DBT project, then they used to

38:45

write really the exact same test and

38:48

also monitor the pipeline itself and then compare

38:51

the results they got. In

38:54

elementary from two different data warehouses to

38:56

kind of kind of make the

38:58

migration. And we also saw

39:00

users just think that we didn't expect

39:03

even data quality, like the

39:05

main tree and tree. And

39:07

we have people

39:09

with each beginning

39:12

to read. And even though we

39:14

observe a lot, we need to be in the

39:16

middle of the time, but then we create an

39:18

alert if at 30 cents, I'm wearing over three

39:20

times the same week. Or

39:24

twice the same day, or like kind

39:26

of creating this

39:30

level approach to how they test the

39:32

data. And in

39:34

your experience of investing in this

39:36

ecosystem, putting in the engineering time

39:38

and effort to build this suite

39:40

of capabilities and working with end

39:42

users, I'm curious, what are some

39:44

of the most interesting or unexpected

39:46

or challenging lessons that you've learned

39:48

in the process? So

39:52

being a startup founder in general is a

39:54

very humbling experience and

39:56

building a product is a very

39:59

humbling experience. I think

40:01

the second lesson is that you need to be

40:04

very, very, very attentive to

40:06

the users. You need to keep

40:08

experimenting and you need to always listen because

40:11

it's shocking to realize how

40:15

little you can predict once we actually

40:17

make an impact. And

40:20

what you do when actually you're

40:22

reacting. So you think

40:24

you know, I love you. You

40:26

think you're already exploiting the space

40:28

and you think you're not. But

40:31

you keep having surprises, whether you

40:33

want a negative one. So

40:35

I think every time we use the

40:37

exact same and do things without getting

40:40

enough feedback and experiencing thoughts

40:42

and paying off and

40:45

getting feedback, it's really, it's

40:48

always a mistake. So that's something we

40:50

keep doing. I can

40:52

say even when we started elementary, we

40:54

were very, very focused on the anomaly

40:56

detection part and the data

40:58

variability part. And then we

41:01

actually created a lot of the methodology

41:04

tables and all that for us. So

41:07

we moved kind of around

41:09

the other side, add that information to the

41:11

other side. And then we found that most

41:14

of the users actually incorporate elementary for

41:16

that and then discover that we have

41:18

anomaly detection and adopt that. So that's

41:20

like just an example of

41:22

a super positive surprise. We

41:28

have no way of predicting.

41:30

You know, like that became a super

41:32

big part of the product. And

41:36

for teams who are

41:38

building their DBT projects

41:41

and they're trying to improve

41:43

their overall productivity and uptime

41:46

and capabilities, what are the

41:48

cases where elementary is the

41:50

wrong choice? So obviously

41:52

if you don't work heavily with

41:54

DBT and you have like a

41:56

critical background on DBT, then

41:58

it's really different. also I think

42:01

we did make some things out

42:03

there that they didn't incorporate

42:06

a DVD and work with it but

42:08

it's changing for them. Like

42:11

the recording program that we did, and

42:13

I think it's very interesting. So,

42:19

you know, I think we're

42:22

looking to be part of that. And

42:24

as you continue to build and iterate

42:26

on the technology and the product, what

42:29

are some of the things you have

42:31

planned for the near to

42:33

medium to medium to medium? And

42:47

as you continue to build and iterate on

42:49

the technology and the product, what are some

42:51

of the things you have planned for the

42:53

near to medium term or any projects

42:55

or problem areas you're excited to explore? Yeah,

42:59

that's always

43:01

a big deal. I might start up,

43:03

right, because things change so rapidly. So

43:06

we always, we're very open with our

43:08

users that we only have our

43:10

roadmap that goes to corner max,

43:13

but then also an opportunity because they

43:15

have a lot of impact. And once

43:17

we build it, the feedback from them is super

43:20

valuable. I think the MOE

43:22

faith, and I think we will probably keep

43:24

facing it as we grow

43:26

and should we go wide or should

43:28

we go deep? Like the question,

43:30

maybe you asked me before about platforms

43:33

that are on DBT, other

43:35

frameworks out there, you asked

43:37

about teams, and I'm going to make

43:40

the wrong choice for them, some teams that are not using

43:42

DBT, Emily. So in terms of the

43:44

problems with all the users, with the

43:46

staff, with the board, should

43:48

we go wider or should we go deeper?

43:50

And our lessons so far was that we

43:52

should really, we're on our best when we're very

43:54

close and when we go deep. So that's

43:57

CUP. We're

44:01

using the same technology as the

44:03

previous one. We're

44:05

using the same technology as the

44:08

previous one. At

44:10

the moment we're using on three

44:12

areas. So

44:15

we're trying to learn how our users

44:17

decide what to monitor. And

44:20

we look at the testing they have and we ask

44:22

them and we try to understand the decision making process.

44:25

So we can make it easier for

44:27

them moving forward and really automated, but

44:29

we did have a lot of

44:32

inspiration there. We also see

44:34

that there may be travel around communication of

44:37

data health and data issues. So

44:40

kind of the people processes part of the

44:42

problem. We can still make a

44:44

lot of progress there and help them with

44:46

that. And then we keep

44:50

kind of trying to measure what's

44:52

the time to resolution when they

44:55

do have incidents. And we're trying to make

44:57

a positive impact there. We also have

44:59

a lot of ideas and areas that

45:02

we're exploring on that area. But

45:04

if you're a nurse on elementary,

45:06

we're going to keep making

45:09

the accessibility easier for you. And

45:12

we're going to keep refusing your request for

45:14

us to solve other

45:16

issues in place. Although we want to solve

45:18

them, but we're not there yet. And

45:22

are there any other aspects of

45:24

the overall space of data observability

45:26

for DBT projects, the work that

45:28

you're doing at elementary, or

45:30

some of the ways that you see

45:32

this overall challenge of data quality, data

45:34

observability evolving as the ecosystem grows and

45:36

matures that we didn't discuss yet that

45:39

you'd like to cover before we close

45:41

out the show? Yeah,

45:44

I think this

45:46

whole ecosystem is still growing. And

45:49

I think there was a phase

45:51

of doing more and more.

45:53

And now people are

45:56

trying to consolidate and doing less and

45:58

being more focused on it. the value

46:00

of the thing. I think that with

46:03

observability we need to be able

46:05

to support that process

46:07

and do the same. So help them

46:09

with priorities and understanding

46:11

what's actually critical and reducing the noise and

46:14

helping the knowing core was actually important. And

46:17

I think that here's

46:19

a big problem in analytics

46:21

maybe is the depth of the business business

46:29

that people have. And that's

46:33

just not something we can

46:35

ever automate probably. Sometimes

46:37

we see users ask us and we

46:39

have no idea why they decided to

46:42

add them or why they decided to model their data

46:44

in a certain way and then we ask them and

46:46

it becomes super clear. But we still

46:48

need that context, we still need to ask

46:51

them. So we want

46:53

to have the numerical

46:55

AI bot that could

46:58

be replaced by content

47:01

and in big pre-pitch we can make it. How

47:04

we can create the

47:06

interface and how we see that

47:09

content into and get any advance

47:12

possible and the coverage that they need.

47:14

And the coverage that works for them and the

47:16

coverage that really supports their role. So that's

47:19

an area to make a big progress. And I

47:21

think other

47:26

domains in data if they'll be

47:28

able to create better interfaces for

47:30

users to input context and get

47:32

out and use their

47:34

workflow then that's definitely going to

47:36

create progress. And

47:39

maybe someday someone will

47:41

figure out the times and differences.

47:44

You should create so many data

47:46

quality problems but I think that's

47:48

just too far ahead. We're not

47:51

serious in terms of technology.

47:54

Everybody just needs to use UTC all the

47:56

time. Yeah, yeah, that's

47:58

going to happen I think. I'm afraid.

48:01

Unfortunately not. All right.

48:04

Well, for anybody who wants to get in touch with you

48:06

and follow along with the work that you and your team

48:08

are doing, I'll have you add your preferred contact information to

48:10

the show notes. And as the final question,

48:12

I'd like to get your perspective on what you see

48:14

as being the biggest gap in the tooling or technology

48:16

that's available for data management today. Yeah.

48:19

So I think going back to

48:21

that context question, how

48:24

can we make it easy

48:26

for people to share

48:29

why they made the decision they made? And

48:32

some other day they made the decision

48:34

they made in data observability, why they made

48:37

the decision they made in documenting

48:39

or another documenting topic. If

48:42

things would make more sense to the

48:44

new members on your team and to

48:46

your stakeholders and to everyone you

48:49

collaborate with and even to the

48:51

vendors you work with, right? Like if we'll

48:53

have more context from our

48:55

users about what drove their decisions,

48:57

then we could give them better

48:59

advice and better outcomes.

49:02

And that's still something that I

49:05

don't think anyone figured out. Like

49:07

how can we communicate better

49:10

around kind of

49:12

the decisions and the design patterns that we have

49:14

to do and we really did it. All

49:17

right. Well, thank you very much for

49:20

taking the time today to join me

49:22

and share your work on elementary and

49:24

share your experience and perspective on the

49:27

overall space of data observability for DBT

49:29

projects. It's definitely a very interesting

49:31

and complex problem area. So I appreciate the time

49:34

and energy that you and your team are putting

49:36

into helping to solve for that. And I hope

49:38

you enjoy the rest of your day. Yeah.

49:41

Thank you for having me. And also

49:43

I hope listeners enjoy and I do

49:45

want to point out that English is

49:47

my third language. So

49:49

I hope people

49:52

would forgive my

49:54

mistakes and enjoy listening. Thank

50:03

you for listening. Don't forget to check

50:05

out our other shows, podcasts.init, which covers

50:07

the Python language, its community, and the

50:09

innovative ways it is being used, and

50:11

the Machine Learning Podcast, which helps you

50:14

go from idea to production with machine

50:16

learning. Visit the site at dataengineeringpodcasts.com to

50:18

subscribe to the show, sign up for

50:20

the mailing list, and read the show

50:22

notes. And if you've learned something or

50:24

tried out a project from the show, then tell us about it.

50:27

Email hosts at dataengineeringpodcasts.com

50:29

with your story. And

50:31

to help other people find the show, please leave a

50:33

review on Apple Podcasts.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features