Addressing The Challenges Of Component Integration In Data Platform Architectures by Data Engineering Podcast | Podchaser

Episode from the podcastData Engineering Podcast

Addressing The Challenges Of Component Integration In Data Platform Architectures

Released Monday, 27th November 2023

Good episode? Give it some love!

Addressing The Challenges Of Component Integration In Data Platform Architectures

Addressing The Challenges Of Component Integration In Data Platform Architectures

Monday, 27th November 2023

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello, and welcome to the Data Engineering

0:13

Podcast, the show about modern data management. You

0:16

shouldn't have to throw away the database to build

0:18

with fast-changing data. You should be able

0:20

to keep the familiarity of SQL and

0:22

the proven architecture of cloud warehouses, but

0:24

swap the decades-old batch computation model for

0:26

an efficient incremental engine to get complex

0:28

queries that are always up to date.

0:31

With Materialize, you can. It's the only true

0:34

SQL streaming database built from the ground up

0:36

to meet the needs of modern data products.

0:39

Whether it's real-time dashboarding and analytics,

0:41

personalization and segmentation, or automation and

0:43

alerting, Materialize gives you the ability

0:45

to work with fresh, correct, and

0:47

scalable results, all in a familiar

0:50

SQL interface. Go to dataengineeringpodcast.com/materialize

0:52

today to get two

0:54

weeks free. Introducing

0:57

Rutterstack Profiles. Rutterstack

0:59

Profiles takes the SAS guesswork and

1:01

SQL grunt work out of building complete

1:03

customer profiles so you can quickly ship

1:06

actionable, enriched data to every downstream team.

1:09

You specify the customer traits, then Profiles

1:11

runs the joins and computations for you

1:14

to create complete customer profiles. Get

1:16

all of the details and try the

1:19

new product today at dataengineeringpodcast.com/ rutterstack.

1:23

Your host is Tobias Macy, and today

1:25

I'm going to be sharing an update

1:27

on my own experience of the journey

1:29

of building a data platform from scratch.

1:32

And today I'm in particular going

1:35

to be focusing on the challenges

1:37

of integrating the disparate tools that

1:39

are required to build a comprehensive

1:41

platform and some of

1:44

the complexities around being able

1:46

to maintain a single

1:48

source of truth or a single interface for

1:50

being able So

2:00

for the better part of six years now,

2:03

maybe close to seven, I

2:05

have been working in technology for

2:07

over a decade. And I have

2:09

been spending the past year and

2:11

a half coming up on

2:13

maybe two years building a

2:16

data platform from scratch to get all

2:18

the buzzwords out of the way using

2:21

a cloud-first data

2:23

lake house architecture, focusing

2:25

on DBT for transformation,

2:28

Firebyte for extract and load, DAGSTR

2:30

for the full integration, and using

2:32

Trino as the query engine for

2:35

data on top of S3. So

2:38

recognizing that that puts me in the

2:41

minority of most people who are building

2:43

a data platform, particularly if they have

2:45

small teams, that has put

2:47

me in the position of needing to figure

2:51

out some of the interfaces

2:53

for integration. For

2:56

a lot of people where they're just

2:58

getting started with building up the data

3:00

platform, they're probably going to be going

3:03

with the managed platform route or picking

3:05

from a selection of different vendors. The

3:08

canonical one for the so-called modern

3:10

data stack is probably 5TRAN, Snowflake,

3:13

and DBT, with

3:15

maybe Looker as your business

3:18

intelligence layer. I'm

3:20

not going to spend a lot of time

3:22

in this episode digging into the motivations for

3:24

why I selected those different components of the

3:26

stack, because I've covered it in other episodes,

3:28

which I will link to in the show

3:30

notes. But it has

3:32

definitely led to a certain

3:34

amount of friction as I try to

3:37

manage some of the different integrations out

3:39

of the box, although the story for

3:41

that particular set of technologies has been

3:44

steadily getting better as time goes by.

3:47

Today, I really want to talk about

3:49

where I am in my journey currently.

3:52

I have a core set of

3:54

functioning capabilities. I can ingest data.

3:57

I can transform it. I can

3:59

query it. But now I'm

4:01

getting to the point of needing to be able to

4:04

onboard more people, provide a

4:06

more seamless user experience, being

4:09

able to manage some of the

4:11

different means of data sharing or

4:13

data delivery, where maybe not everybody

4:15

who's going to be accessing the

4:17

data lives within the bounds of

4:19

my team or my department. And

4:23

there are definitely data sharing

4:26

capabilities that are part of some

4:28

of the different platforms, most notable

4:30

being probably BigQuery and Snowflake with

4:32

the ways that they manage data

4:35

sharing. But there are

4:37

a number of different ways of approaching that.

4:40

Given that I am in the

4:42

world of building a lake house

4:44

architecture, I've got my data in

4:46

S3, I'm using the iceberg

4:48

table format, so all of the data

4:50

is already representable as a

4:53

table. And

4:55

so then the question is, okay, for

4:57

somebody who just wants to be able

4:59

to access the data, what's the best

5:01

way to deliver it? Do I need

5:03

to provide one-off jobs to generate

5:06

a CSV and send it to them

5:08

via email? Do I need to give

5:10

an S3 bucket to be able to load things from? It

5:13

all depends on what the level of sophistication is of

5:15

the people who are going to be consuming the data.

5:17

Maybe it's just a dashboard format and people just need

5:19

to be able to look at the data. One

5:23

of the challenges of that handoff

5:25

is also when you need to

5:27

be able to be

5:29

considerate of how that data is

5:31

going to be used after you

5:33

present it, because maybe

5:35

you want to be able to give a

5:38

visual representation or give away for somebody to

5:41

access the data, but

5:43

you don't want them to then

5:45

exfiltrate it into another system

5:47

by a means of a CSV export, for

5:49

instance. And that's where you

5:51

start getting into questions of governance

5:54

and who has access

5:56

to what, being able to audit data access.

6:00

But harking back to one

6:02

of the episodes I did a while ago

6:04

on the idea of shadow IT in the

6:06

data industry, the

6:08

best way to prevent people from

6:12

taking data out of the context in which

6:14

you want it to be presented and bringing

6:16

it into other tools is

6:18

to reduce any friction

6:22

or pain that they're experiencing accessing

6:24

the data in the way that

6:26

you have presented to them. Because

6:29

if your options are the best

6:31

option and the most accessible option for

6:34

the people who are viewing

6:36

that data, then they're not going to want to

6:38

bringing it out of that system because you are

6:40

giving them the best experience. And

6:43

so that's where I am right now of figuring

6:45

out what is that best experience for everybody? What

6:48

are the requirements? How do I then manage

6:50

that? And a lot

6:52

of the complexity comes in with

6:54

the elements of interoperability

6:57

and integration as

6:59

you start to add more layers

7:01

and components and capabilities into

7:04

the overall platform. And

7:07

I'm using the term platform deliberately because

7:09

I am aiming for a holistic experience

7:12

for end users versus just a number

7:14

of point solutions where somebody

7:16

can maybe plug something in and they do their

7:18

thing and then somebody else can plug something in

7:20

and do their own thing. I

7:24

want to figure out what is the minimal self set

7:26

of interfaces that I

7:28

need to build and support to be able

7:30

to address the widest variety of

7:33

needs while still being extensible for the

7:35

case where somebody has a bespoke requirement

7:37

that I need to be able to

7:40

fulfill. How do I make sure that

7:43

that doesn't add an undue amount of

7:45

maintenance burden on myself and my team

7:48

while still being able to deliver on that

7:50

request. In

7:52

general, the way that that

7:55

degree of interoperability is managed

7:57

is through adoption of open

8:00

standards that everybody has agreed

8:03

upon. So SQL is probably

8:05

the longest-lived one of those

8:07

open standards, although to be fair, there

8:09

are multiple different varieties of it, but

8:11

at its core, SQL is understandable. So

8:13

if you have a means of using

8:16

SQL to query data, that is

8:18

going to make it

8:20

easy for a lot of different tools

8:22

and people to be able to do

8:24

their exploration and self-serve. DBT

8:27

has capitalized on that as

8:29

a means of being able to build their

8:32

products to be able to say, okay, the

8:34

majority of structured data sources are going to

8:36

be addressable by SQL, so we're going to

8:38

build a tool that allows

8:41

people to build a

8:43

full engineering development and delivery flow

8:45

for that SQL data, manage the

8:48

transformations through our tool, and then

8:50

add a lot of nicety around

8:52

it in terms of the lineage

8:54

and data documentation, et cetera. Because

8:57

of the fact that they have

8:59

invested so much in the ease of

9:01

use of the tool as well as

9:04

doing a lot of advocacy in the

9:06

community to drive adoption,

9:08

that has led to a number

9:11

of different other tools

9:13

integrating with them. So they

9:15

have become a de facto

9:17

interface for managing transformations in

9:20

a warehouse or warehouse-like

9:22

context. So there are tools such

9:24

as Preset that

9:26

are building

9:28

integrations into DBT as

9:31

far as the data types

9:33

and the metadata that they generate. There

9:36

are entire products that are built

9:38

on top of the metadata that

9:40

DBT generates. LiteDash is one of

9:42

those. So

9:46

that has given DBT as a

9:48

tool and as a company a

9:50

lot of momentum, a lot

9:52

of inertia. So it makes

9:55

it more difficult for other

9:57

people who maybe are targeting a similar

9:59

type of tool. use case but

10:01

want to add a step

10:03

change to some of the capabilities

10:05

around that tool to be

10:08

able to actually break in and overtake that

10:10

market. So the most notable one that I'm

10:12

aware of so far is SQL Mesh who

10:15

are actually adding a compatibility layer

10:17

with dbt for being able to

10:20

run SQL Mesh on top of a dbt

10:22

project where you can actually execute the transformations

10:24

without having to do any code changes which

10:27

solves for the adoption step

10:30

of I just want to be able to try out

10:32

this tool but then there is

10:34

still the challenge that they're going to have

10:36

to overcome of building up

10:38

that set of integrations with

10:40

the broader ecosystem that dbt

10:43

has benefited from so

10:45

that makes that road to adoption

10:47

and the road to being a

10:50

viable competitor a lot

10:52

longer. Another

10:54

tool that has benefited from this status

10:57

of being a de facto standard and

10:59

a reference implementation is Airflow

11:01

where they have been around for a long

11:04

enough time and they have been adopted by a

11:06

large enough user base that if

11:08

you as a new tool vendor build

11:11

your initial integration with Airflow

11:14

then you have a large enough addressable market

11:16

that it is worth the time of building

11:18

that integration but it's not

11:21

necessarily worth it to build

11:23

that same integration for Daxter or Prefect

11:26

or one of the other

11:29

orchestration systems that are out there. So as

11:31

you are building

11:34

your own platforms and doing this

11:36

tool selection it's worth considering am

11:38

I choosing something that is going

11:40

to be able to benefit from the existing

11:43

weight of the community integrations that

11:45

are available for it. If

11:48

not then is the value that

11:50

it is providing worth that additional

11:52

complexity of me having to go

11:54

out and build those integrations or

11:56

work with other vendors to build

11:58

those integrations. Developing

12:04

event-driven pipelines is going to be

12:06

a lot easier. Meet Functions. Memphis

12:08

Functions enable developers and data engineers

12:11

to build an organizational toolbox of

12:13

functions to process, transform, and enrich

12:15

ingested events on the fly, in

12:17

a serverless manner using AWS Lambda

12:20

syntax, without boilerplate. It includes

12:22

orchestration, error handling, and infrastructure

12:24

in almost any language, including

12:26

Go, Python, JavaScript, .NET, Java,

12:28

SQL, and more. Go

12:31

to dataengineeringpodcast.com/memphis today to

12:33

get started. And

12:37

then another element of

12:39

that integration question is the

12:43

challenge of a lot of the

12:45

different tools in the stack will

12:47

want to be the single source

12:49

of truth for certain aspects of

12:51

your platform. One of

12:53

the ones that I've been dealing with

12:55

recently is the question of access control

12:57

and role definition and security and auditability

13:00

of the data. So because

13:03

I am building with a

13:05

disaggregated database engine where the

13:08

storage and the query access

13:10

are separated, that gives two

13:12

different locations where those

13:15

definitions can be stored. So

13:18

one camp will say, well, because

13:20

of the fact that the storage layer is

13:23

the lowest level and will have

13:25

potentially multiple different points of access

13:27

where maybe I'm using TreeNode to

13:29

query it by SQL, but I

13:31

may also just be accessing

13:34

the data in S3 directly using

13:36

some Python tool, or maybe I'm

13:38

using something like Spark or Flink

13:40

to do other processing approaches to

13:42

it beyond just SQL. So

13:44

all of that access information needs to

13:46

live in the storage layer. So that's

13:48

the approach that companies like Tabular are

13:50

taking with the Iceberg table format, where

13:54

as long as you have the storage layer secured, then

13:56

it doesn't matter what all the other layers on top

13:58

of it might have to look like. to say about

14:00

access control because it's going to be enforced at the

14:02

table level. And then

14:05

you go the next level up,

14:07

engines like Trino or the Galaxy

14:09

platform from Starburst say we want

14:11

to own the access control because

14:14

we are going to

14:16

have more visibility into the specifics of

14:18

the queries and we can add things

14:20

like role-level filtering in the role definitions.

14:23

So we should be the ones to own that role-based

14:26

access control information or attribute-based

14:29

access control and then

14:32

that's definitely another

14:34

viable option but then you go

14:36

another layer up the stack into

14:38

business intelligence and maybe

14:41

there's even more granularity available because

14:43

you can say oh well this

14:45

person has access to this

14:47

particular data set with this role-level

14:50

filtering and they can view these visualizations

14:52

on that data but maybe they can't

14:54

write their own queries against that data.

14:57

So there's that challenge of okay

15:00

well do I have to define

15:02

the roles in three different places? Do

15:05

I have to have slightly different roles across all

15:07

those three places? How do I align them to

15:09

be able to say well this user in the

15:11

business intelligence is the same as this user in

15:14

the query engine is the same as this user

15:16

in the storage layer and make sure that they

15:18

are getting a cohesive experience across those boundaries. So

15:22

particularly when you have

15:24

tool systems that maybe

15:26

don't want to or better

15:29

maybe disincentivized to do

15:31

that propagation of role information so

15:33

for instance if I use tabular

15:35

and I say I'm going to

15:38

define a role that will grant read access

15:40

to this subset of tables

15:43

but no other visibility of the rest of

15:45

the data set how

15:48

then do I reflect that information to the

15:50

query layer of who that user is to

15:53

be able to enforce those permissions

15:55

and then all the way up to the business

15:57

intelligence layer to say from the

15:59

storage layer These are the permissions that

16:01

you have in the UI. So that

16:04

brings in the need to have some

16:06

manner of single sign-on and

16:08

single source of identity for people

16:10

across all of those boundaries. Beyond

16:13

the question of permissions, there's another set

16:15

of information that is disjoint and

16:19

wants to be owned by different components

16:21

within the stack. And that is the

16:23

question of data flow, data processing,

16:26

data lineage, where

16:29

each tool maybe has a certain

16:31

view of what the lineage graph

16:33

looks like or what the processing

16:35

looks like. But

16:38

you don't necessarily have the complete

16:40

end-to-end view of a piece of

16:42

data from maybe where it

16:44

lands in an application database all the way

16:47

through to where it's being

16:49

presented in a business intelligence dashboard

16:51

or incorporated as a feature for

16:53

machine learning model training workflow. So

16:56

for instance, DBT has the table

16:58

level lineage of the transformations that

17:01

it's providing. Tools like

17:03

Airflow and DAGSTER have the view of

17:05

the lineage of all of the tasks

17:07

that they are responsible for executing, but

17:10

they don't necessarily have information

17:13

about out-of-band transformations that

17:15

are happening by some analysts who

17:17

are running ad hoc queries in

17:19

the data warehouse. Or they maybe

17:22

don't have visibility into the

17:25

data as it's landing in an

17:27

application database and they only see the data

17:29

once it lands in the warehouse, or maybe

17:31

they can see the data

17:34

integration step with an AirBiter or 5TRAN to

17:36

say, okay, and at this

17:38

table in the application database feeds into this table

17:40

in the warehouse and then these DBT flows happen,

17:43

but maybe it loses sight of the

17:45

data dashboards that are being generated. And

17:49

so again, this is the question of open

17:52

protocols, interoperability. So tools like open

17:55

lineage are designed to help address that,

17:57

where you fire events. that

18:00

can be then constituted into a

18:02

more cohesive lineage graph. And

18:05

then you also have systems such

18:07

as metadata platforms that are designed

18:09

to be more

18:11

holistic views of the entire

18:14

ecosystem and incorporate things like

18:16

data discovery, data governance, which

18:18

gives you a single place to be able

18:21

to view all that information. But then you're

18:23

back to that question of integration of, OK,

18:25

well, this can store and convey all of

18:27

this information, but how do

18:29

I get all this information into it? So

18:31

each of those tools are going to have

18:33

different means of being able to

18:35

push or pull data, but

18:38

you have to make sure as the platform designer

18:40

and operator that

18:43

those data flows are also happening. So it's

18:46

an additional set of tags that you need

18:48

to make sure are running, you need to

18:50

make sure they're reliable. So

18:52

it's useful, but it's also an

18:54

additional burden. So

18:57

these are all things that I've been dealing

18:59

with recently. And then in

19:02

the metadata catalog situation, even if you

19:04

do manage to feed all of your

19:06

data into that, it

19:09

is useful as a means of discovery

19:12

or a means of being able to keep

19:16

tabs on what's happening. But

19:18

then it also feeds back into, OK, well,

19:20

if I want to use this as my

19:22

single source of truth, how

19:24

then do I propagate that truth back into other

19:26

systems? And that's where you start to get into

19:28

questions of things like active metadata. And

19:31

then you have another set of integrations

19:33

and another direction of integrations

19:35

where if I say, OK, I have

19:37

my metadata catalog, this is my source

19:39

of truth for role information and who

19:41

can access what. Now I need to

19:43

be able to push that back down

19:45

into the storage layer and into the

19:47

query engine and into the business intelligence

19:49

dashboard. And I need to

19:51

make sure that all of those integrations are reliable

19:54

and that there are appropriate

19:56

mappings between the different concepts throughout

19:58

the different systems. Data

20:04

lakes are notoriously complex. For

20:06

data engineers who battle to build

20:08

and scale high-quality data workflows on

20:10

the data lake, Starburst powers petabyte-scale

20:12

SQL analytics fast at a fraction

20:14

of the cost of traditional methods,

20:16

so that you can meet all

20:18

of your data needs, ranging from

20:20

AI to data applications to complete

20:22

analytics. Trusted by teams of all

20:24

sizes, including Comcast and DoorDash, Starburst

20:26

is a data lake analytics platform

20:28

that delivers the adaptability and flexibility

20:30

a lake has ecosystem promises. And

20:33

Starburst does all of this on an

20:35

open architecture, with first-class support for Apache

20:38

Iceberg, Delta Lake, and Hoodie, so you

20:40

always maintain ownership of your data. Want

20:43

to see Starburst in action? Go

20:46

to dataengineeringpodcast.com/Starburst and get $500 in

20:48

credits to try Starburst Galaxy today,

20:50

the easiest and fastest way to

20:53

get started using Trino. So

20:58

this is definitely one of the benefits

21:00

that fully vertically integrated platforms have of

21:02

you don't have to fight with all

21:04

of those different layers of integration. But

21:08

the problem there is that you have

21:10

to rely on the integrations that have

21:12

been built, and maybe you are constrained

21:14

by what that vertically integrated platform offers.

21:17

You don't necessarily have the means of

21:19

being able to extend it into areas

21:22

that the platform developers

21:25

haven't had the time to implement

21:27

or haven't been exposed to as

21:30

a necessity. So

21:33

it's the continual

21:37

story of software

21:40

design, software development, of how do

21:42

we build these systems that are

21:44

extensible and can be integrated, but

21:46

also make sure that the product

21:49

that you're building doesn't just

21:51

get crushed under the weight of

21:53

having to maintain all of these

21:55

different point solutions. I think that

21:58

we're definitely... iterating towards

22:00

these community standards. I think that tools

22:02

like Open Lineage, I think that the

22:05

work that the open metadata folks are

22:07

doing with their schema first approach to

22:09

metadata is interesting. I

22:11

think that the work that DBT has done to

22:13

become that de facto standard for this is

22:15

how your transformations are represented, so other tools can

22:18

be able to build on top of that, are

22:20

all valuable. So it's great to see the

22:22

direction that the community has taken on

22:24

all of these fronts. But I do

22:27

think that we are definitely still not to

22:30

the point where we have

22:32

a lot of the answers fully baked. I

22:34

think that everybody

22:36

who is investing in this ecosystem, everybody who

22:38

is building these tools and using these tools

22:41

and giving feedback to the tool vendors is

22:43

helping to bring us to a better place. But

22:47

as somebody who is trying to integrate

22:50

so many different pieces and try to figure

22:52

out what does that holistic platform look like,

22:55

how do I build it in a way

22:57

that I can maintain it with a small

23:00

team, but also be able to have

23:02

the flexibility required to address a wide set

23:04

of audiences, it's definitely

23:07

still a challenge, but one

23:09

that I've been enjoying having

23:11

the opportunity to explore and

23:13

invest in. So now

23:15

they have been iterating on these challenges and

23:17

thinking through how best to build that holistic

23:21

platform and something that is going

23:23

to be enjoyable and usable

23:25

by a number of different people. The next

23:28

main architectural component that I've

23:30

been starting to work towards

23:32

is that metadata platform so that I

23:34

can have that more cross-cutting view, so

23:37

that I can improve the data discovery story

23:39

for people who aren't part of the engineering

23:41

team, who just want to see what data

23:43

do you have, where is it, how do

23:46

I access it. So that's where I'm going

23:48

to be doing some of my next work

23:50

of picking the metadata platform, getting it

23:52

integrated, getting all of the data flows

23:54

propagated to that tool so that we

23:56

can see how everything

23:58

is flowing. and then being

24:00

able to start integrating

24:03

single sign-on as a means of identity

24:05

management across the different layers and then

24:07

being able to say, okay, you

24:10

came through the metadata platform to do data

24:12

discovery. Now you say, okay, here's the data

24:14

set that I want to explore

24:17

or here is the chart that I

24:19

want to view, then being able to

24:21

have a simple means of

24:23

clicking a button and jumping them into

24:25

the experience that they're requesting, for instance,

24:27

or being able to say, okay, I

24:29

need to query this database and

24:31

then giving them the pre-filled set of

24:34

credentials or the pre-filled client connection to

24:36

be able to run those

24:38

queries with their tool of choice, whether

24:40

that be something like a Tableau or

24:42

a SQL command line, etc. And

24:45

as I have been doing

24:48

this work, the most interesting

24:50

or innovative or unexpected aspects

24:53

have definitely been these integration

24:55

boundaries of saying, okay, this is the tool that I

24:57

have, this is the other tool that I have, I

24:59

would like to be able to use them together in

25:02

this manner, now how do I go about and do that?

25:04

So I am

25:06

definitely happy that the Python

25:10

language has become one of

25:12

the most widely adopted ecosystems

25:15

for data engineering because it does simplify

25:18

some of that work where you

25:20

can pretty easily assume that there's going to be

25:22

a Python library that does at least 80% of

25:25

what you're trying to do. So in, for

25:28

instance, the open metadata platform, they have

25:30

a Python library for doing

25:32

that metadata ingestion. So even if they

25:34

don't have an out of the box

25:36

solution for ingesting metadata from the tool

25:39

that you're using, they do have

25:41

a Python client that you can provide

25:43

metadata to them in order

25:46

to be able to propagate information from a

25:48

system that they haven't already addressed. So I

25:50

do think that that is a another

25:54

good trend in the ecosystem of providing

25:56

a good set of software clients

25:58

to be able to... to integrate with

26:01

their tool, even if it's from a

26:04

system that they themselves didn't already plan

26:06

to integrate with. So while it does

26:08

add a bit of extra burden to

26:10

the people who are trying to use

26:13

those systems, there is at least a

26:15

path to success for it. I'd

26:17

also say that that's probably the most

26:19

challenging lesson that I've learned as well,

26:21

of figuring out what are the points

26:24

of integration that are worth investing in, and

26:26

what are the ones that I

26:29

don't have to invest in right now, and maybe

26:31

I can wait to see if the ecosystem around

26:33

that tool grows up. So SQL Mesh is a

26:36

tool that I've been keeping an eye on. When

26:39

I first came across it, it didn't have

26:41

support for Trino, so that was an obvious

26:43

no for being able to use it. Now

26:45

they have it, but they don't yet have

26:47

an integration with Dagster. And

26:49

then also, as I was pointing to earlier,

26:51

the existing weight of integration with dbt gives me

26:53

a bit of pause of, okay, even if I

26:56

do get it working with Trino and Dagster, what

26:59

about all the other pre-built

27:01

integrations that I am not

27:03

going to be able to take advantage

27:05

of because I have decided to use

27:07

a newer tool that hasn't been as

27:10

widely adopted and integrated. So definitely

27:12

a challenging aspect of that as well. If

27:15

you have found my musings useful

27:17

or informative, or if they have

27:19

inspired something that you would like

27:21

to discuss further, I'm definitely always

27:23

happy to take suggestions or

27:25

bring people on the show or

27:29

take some feedback. So for anybody who wants to

27:31

get in touch with me, I'll add my contact

27:33

information to the show notes, but the best way

27:35

is on dataengineeringpodcast.com.

27:38

That has links to how you can find me. And

27:42

for the final question of what I see

27:44

as being the biggest gap in the tooling

27:46

or technology for data management today, I

27:49

definitely think it's that single source

27:51

of truth for identity and

27:53

access across the data stack of

27:55

being able to figure out how

27:57

do I manage permissions.

28:00

and roles across a variety

28:02

of tools without having to

28:05

build different integrations every

28:07

single time. So

28:10

definitely look forward to seeing more

28:12

investment in that, maybe even using

28:14

something like Open Policy Agent from

28:16

the cloud native ecosystem. So definitely,

28:20

definitely happy to continue investigating

28:22

that as well. So thank you for taking the

28:24

time to listen to me. I hope

28:26

it has been, if not

28:28

informative, at least entertaining. I'm very

28:30

thankful for being able to

28:33

run this show and bring

28:35

all of these ideas to everybody

28:37

who listens. So in

28:39

honor of this being the Thanksgiving week,

28:41

I just wanted to share that gratitude

28:43

to everybody who takes the time out

28:46

of their life to pay

28:48

attention to myself and the people I bring

28:50

on the show. So thank you again,

28:52

and I hope you all have a good rest of your day.

29:23

I hope you enjoyed this show. If you enjoyed

29:25

this show, please like, share, and subscribe. If

29:30

you want to know more about the show, please leave a

29:32

review on Apple Podcasts.

Rate

Get this podcast via API

From The Podcast

Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More