Using Trino And Iceberg As The Foundation Of Your Data Lakehouse by Data Engineering Podcast | Podchaser

Episode from the podcastData Engineering Podcast

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Released Sunday, 18th February 2024

Good episode? Give it some love!

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Sunday, 18th February 2024

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello! And welcome to the Data Engineering

0:13

Podcast the show about modern data management. Dexter.

0:17

Offers a new approach to building and

0:19

running data platforms and state a pipelines.

0:21

It is an open source cloud native

0:23

orchestrator for the whole development life cycle

0:25

with integrated lineage and observe ability, a

0:27

declared of programming model, and best in

0:29

class test ability. Your. Team

0:31

can get up and running and minutes

0:34

thanks to Dexter Cloud and Enterprise class

0:36

hosted solution that offers several Us and

0:38

hybrid deployments, enhanced security and on demand

0:40

ephemeral test deployments. Good. A

0:42

Date Engineering podcast.com/dexter today to get

0:44

started and your first thirty days

0:47

or free. Data. Lakes are

0:49

notoriously complex for date engineers who

0:51

battle to build and scale high

0:53

quality data work clothes on the

0:55

data. Lake Starboard Powers pet about

0:57

scale Sequel Analytics fast at a

0:59

fraction of the cost of traditional

1:01

methods to that you can meet

1:03

all of your data needs ranging

1:05

from a i to date applications

1:07

to complete analytics trusted by teams

1:09

of all sizes including Comcast and

1:11

Door-star Burst. As a data like

1:13

analytics platform, the delivers the adaptability

1:15

inflexibility a lake has ecosystem promises.

1:17

And. Style burst as all of this on

1:20

an open architecture with first class support

1:22

for Apache Iceberg, Delta Lake, and hoodie.

1:24

See. You always maintain ownership of your data.

1:27

Wants. To see Star Breast and Action

1:29

Good a date engineering podcast.com/starboard East and

1:32

get five hundred dollars in credit to

1:34

Tristar Breast Galaxy Today The easiest, fastest

1:36

way to get started using Tree Know.

1:38

Your hostess Tobias Macy and today I'm

1:40

interviewing Dame Sundstrom about building a data

1:43

lake house with Tree Know and Iceberg

1:45

so Dame can start by introducing yourself.

1:48

While. I'm Danes Hallstrom. I

1:50

am one of the founders

1:52

of Tree Know and Presto.

1:54

Before that, I am Ctl.

1:56

At starters, I'd been. Working.

1:59

in data lake space

2:01

for about 10 years now. Before

2:03

that I worked some other

2:05

startups and before that I was

2:08

one of the original people at JBoss

2:10

and spent a lot of time in Java

2:12

EE and that sort of space. And

2:15

do you remember how you first got started working in data? My

2:18

background mostly was distributed computing. So

2:20

out of college, I started working

2:22

at United Healthcare on distributed computing

2:24

using Intra DCE in the nineties.

2:27

And then switched to like Java EE

2:29

back when it was called something else.

2:32

And, uh, I, as part of

2:35

that, I wrote the object relational

2:37

mapping tools for JBoss. Then

2:40

eventually we long,

2:42

long time forward started working

2:45

at Facebook. And one

2:47

of the original projects from the head

2:49

of infrastructure was to come up with

2:51

a faster, better way of interacting with

2:53

their large data warehouse at the time.

2:56

So this is like 10 years ago

2:58

and it was, I don't

3:00

know, three, 400 petabytes or something, it's dramatically

3:03

bigger now. And they didn't have

3:05

a team to do it. And

3:07

myself and David Phillips and Martine

3:10

have background in Java, extensive background

3:12

and databases and stuff like that.

3:14

So we were available and we

3:16

started working on it, but I'm

3:19

mostly a distributed computing

3:21

person. So I wrote most of

3:23

the distributed computing parts of Trino,

3:25

whereas like Martine's a deep language

3:28

person. So he did a lot

3:30

of the language, uh, optimizations

3:32

and David is

3:35

deeply into databases has been forever. And

3:37

so built a lot of the database

3:39

parts and the tooling and things like

3:42

that. As an outgrowth of

3:44

that effort, along with a number of

3:46

other contributions to the ecosystem, we have

3:48

landed in this space where we have

3:50

a new architectural paradigm for analytical systems

3:53

that is largely phrased as the data

3:55

lake house as a midway point between

3:57

data lakes and data warehouses and. For

4:00

the purposes of this conversation, I'm wondering

4:02

if you can give your definition of

4:04

what constitutes a data lake house. It's

4:07

a really good question because I think

4:09

people play fast and loose with it.

4:11

So historically, I would say a data

4:13

lake is you have traditional

4:15

storage, external storage, so you're talking

4:18

HDFS is generally what people are

4:20

talking about it. But nowadays, like

4:23

HDFS is so rarely used, it's

4:26

almost always some cloud object storage,

4:28

S3, GCS, Azure stuff. So definitely

4:30

all the data stored in that.

4:32

And then I think the important

4:35

part comes with a lake

4:37

house of talking about standard

4:39

data representations. So like you

4:42

can be a vendor and store all your data

4:44

in S3 if it's proprietary stuff. And

4:49

proprietary, I'm just going to define as

4:51

you're the only one who really implements

4:53

it. I don't care if you have

4:55

an open spec or whatever. Like it

4:57

doesn't matter. Like if you're the only

4:59

serious player in it, it's effectively proprietary.

5:01

So where I think about

5:03

it now, it's object storage. It's doing

5:05

it in the lake. So it isn't

5:07

like, Oh, I take the files

5:09

and then I import them into my special

5:11

proprietary format and then I process them. And then

5:14

I dump the data back out. That's the lake

5:16

as a sidecar to you. So it's

5:18

when you're doing transformations, when you're doing

5:20

data maintenance, the data goes is

5:23

operated on directly as the lake being

5:25

your native form. Everything else is,

5:27

you know, a bolt on, which

5:29

not to say is terrible. It's just a different

5:31

thing. Absolutely. And another interesting

5:34

aspect of the idea of the data

5:36

lake house is that the reason for

5:38

framing it as such is that it

5:40

intends to add a lot of the

5:43

user experience benefits that you get from

5:45

a fully vertically integrated

5:47

database system, such as dead warehouses,

5:49

whether that is an actual vertically

5:51

integrated system, as of the

5:54

days of your or a cloud native system

5:56

where compute and storage are disaggregated,

5:58

but still presented as a. single unified

6:01

experience. And I'm wondering

6:03

if you can talk to some

6:05

of the ways that we have

6:07

actually as a community hit that

6:09

mark? And what are some of

6:11

the areas where we're actually still

6:13

falling short of the user experience

6:15

presentation of this cohesive platform versus

6:18

the parts where the gaps still

6:20

show through and you can see that it's actually

6:22

five different pieces that are trying to work together.

6:25

Yeah, I think we've done an okay job.

6:27

I think we got a long ways to

6:29

go though. If you had asked me this

6:31

question three years ago, I would

6:34

have just gone on and on and

6:36

on about the litany of like broken

6:38

weird tools that exist in the lake

6:40

house. I think things are starting to

6:43

get better as people realize that it

6:45

isn't like so much as like the

6:48

community of users, the community of like

6:50

the people implementing and maintaining this system

6:53

where like, I think we've now started

6:55

to figure out that like this paradox

6:58

of choice is not a good thing.

7:00

So before we had

7:03

like Hive and there

7:05

were five competing data formats and

7:07

then that narrowed down to two

7:09

and then everyone realized that what

7:11

Hive was doing was really bad

7:14

and not sustainable and

7:16

having like two different tables

7:18

next to each other and they're maintaining

7:20

completely different ways and have different type

7:22

systems and different schema evolution and so

7:24

on. Like I can go on and

7:26

on and on about the edges of

7:29

it. So I think Iceberg came

7:31

along and said, hey, we're just

7:33

going to come up with a format

7:35

for tables. It includes how tables move,

7:37

how they're evolved, how they're managed and

7:40

covers a whole plethora of

7:42

things including like data types and

7:45

how partitioning works and stats

7:47

now and views and

7:49

so on as a

7:51

written down standard. Before it was just

7:53

the Wild West, like literally like someone

7:56

would check something into Hive and like

7:58

invent an entire new. system, Spark

8:00

does this all the time. Like, okay,

8:03

let's implement Spark bucketing V2, which is different than

8:05

everything else. And if you want to know how

8:07

it works, like go read the Spark code because

8:09

some person just showed up and everyone's like, yeah,

8:12

that's cool. So I think we've gotten a really,

8:14

a lot better on

8:16

data in tables, the type

8:18

system, that sort of

8:21

thing is now fairly

8:23

standardized and well understood. That said, iceberg

8:26

did it. And then immediately

8:28

data bricks came along and dropped a

8:30

competing product, which is kind

8:32

of half finished. And so now I

8:34

get to implement two and

8:36

now there's more of these coming

8:39

along and hoping that this time

8:41

around we consolidate onto one very

8:44

quickly. Cause it's really kind of

8:46

a mess. And you're basically what

8:48

happens is people like us in

8:51

the Trino community, we have to implement all of

8:53

these and we only have so many people. So

8:56

it's like we implement one really well and the

8:58

rest suffer or we implement all of them kind

9:00

of okay. So it's,

9:03

it's difficult. Like right now there are enough

9:05

people. I think we're maintaining three

9:07

of them. High of acid died. And

9:09

that's like one of end

9:12

tools. So like we can have the

9:14

same conversation about security. We can have

9:16

the same conversation about, I don't

9:19

know. There's, there's like lots of these areas.

9:22

Absolutely. So I personally am actually using

9:24

the lake house architecture for my platform.

9:26

For sake of transparency, I am using

9:28

Trino. I'm using the starburst managed galaxy.

9:30

So get that out of the way.

9:32

I'm using the iceberg table format, which

9:35

is largely transparent. I don't have to

9:37

do a lot on the actual table

9:39

format piece because Trino handles that piece

9:41

of it for the most part. And

9:43

so as somebody who's using the lake

9:45

house paradigm, there are definitely a lot

9:47

of niceties. I agree. It's gotten a

9:49

lot easier over the past couple of

9:51

years than it was prior to that.

9:53

A lot of the conversation seems to have

9:56

cohered along a roughly

9:59

standard. a theorized conception of

10:01

what constitutes the lake house. I do

10:03

think that one of the areas

10:05

that is still unfinished, or at least

10:08

not as cohesive across the board, is

10:10

that question of security and access

10:12

control. That seems to be one of

10:14

the areas where the overall data ecosystem

10:17

is not yet figured out. Everybody

10:19

has their own thoughts on how it

10:21

can and should be done. Everybody wants

10:23

to own that experience. There aren't

10:25

a lot of methods for being

10:28

able to communicate roles and access

10:30

across the layer boundaries.

10:33

I'm wondering if you can talk to

10:35

some of the ways that that manifests

10:37

in terms of that overall experience as

10:39

a juxtaposition to the warehouse where everything

10:42

is presented as one system. Yeah.

10:45

As one of the people who's written a

10:48

huge portion of the security systems in

10:51

Trino and in that

10:53

galaxy, it's actually a really

10:55

hard space to be in.

10:57

If you look into the

10:59

open ecosystem, throughout this

11:01

whole thing, we're talking about the

11:03

open ecosystem. The open ecosystem for

11:05

security, historically, you had the Hive

11:08

meta store with its security. Well, the

11:10

most popular meta store out there is

11:12

Glue and it doesn't have the

11:14

Hive security model. The Hive security model

11:17

was always weird and only applies to

11:19

Hive. Trino is a federated system,

11:21

so that doesn't make

11:23

much sense. Ranger pretty much died. I

11:25

haven't seen it around in a while.

11:28

There are people still looking at it,

11:31

but I get a sense for how

11:33

popular things are by when people ask

11:35

about things. It's like two, three years,

11:37

two years ago, it just fell off

11:39

a cliff. The only

11:41

other thing I've seen out recently is OPA, which

11:45

the Bloomberg folks have been working on.

11:47

They really like. OPA is really

11:49

complicated. You write

11:52

security rule policies in a

11:54

security rule server in a

11:56

custom language. I literally

11:59

looked at And I was like, if I

12:01

did this, I would write a tool to write

12:03

the language policy files for me. It's very complicated.

12:06

So I think that's got a long ways to

12:08

go. Hopefully someone builds like

12:10

a UI and tooling and stuff for

12:12

it. So that's really all you have

12:14

in the open space. In proprietary, you

12:16

have AWS is Lake Formation, which like

12:18

I seriously have yet to meet someone

12:20

who's rolled it out. It just looks

12:22

weird. We'll see what happens. Again,

12:25

I'm hoping, I'm hoping it dies. Like every

12:27

one of these things that's successful. We

12:29

have to build and maintain. So like, I'd like one and

12:31

I'd like it to be open. The

12:35

Databricks has their own proprietary thing

12:37

at Starburst. We have our own

12:39

proprietary thing. I think Tabular

12:41

has our own proprietary thing. You

12:44

end up with proprietary things

12:46

because of the complexity of

12:49

the security system. So like

12:51

in Galaxy, we built

12:53

the security system into the core of

12:55

Galaxy itself. So Galaxy's the Starburst hosted

12:57

version of Treno. So like every screen

12:59

you're looking at in Galaxy is viewer

13:01

aware and we're applying your policy on

13:03

like what you're allowed to see and

13:05

it's really core to the whole application.

13:08

It kind of touches like every single

13:10

bit. So how do you put that

13:12

in? And then you're like, Oh, I'm

13:14

going to make this out call to

13:16

a third party system and like, I

13:18

need to know what changes, but like,

13:20

this is something I need to be

13:23

able to do on like a millisecond

13:25

level. And so security is a super

13:27

hard problem. Also, everyone has different viewpoints

13:29

about how security should work. In Galaxy,

13:31

we follow a very traditional database

13:34

security system with roles and

13:38

access controls, et cetera. In other

13:40

systems, like there's different viewpoints. Like

13:42

it's, it's very interesting. Like OPA

13:45

is like this different universe of

13:47

like policy rule systems. So I

13:50

don't think we have a good answer for this right

13:52

now in terms of like a community. And

13:54

I think this is a, one of the things that

13:56

actually is the reason why you would choose a vendor

13:58

is there are. security implementation aligns with

14:01

like what you want to do. Yeah,

14:03

the security and policy space is definitely

14:06

still very much in flux, in particular

14:08

in the Lakehouse ecosystem, but even beyond

14:10

that. So OPA is a tool that

14:12

came out of largely the Kubernetes ecosystem,

14:15

and is being applied to a number

14:17

of different areas because it is a

14:20

generalized policy language. There's another project called

14:22

OSO, which is an open source policy

14:24

engine that has its own policy language

14:26

again, so that you can have the

14:28

policy agent embedded in process in various

14:31

language runtimes, and then you can define

14:33

those policies out of band and apply them

14:35

to the runtime dynamically. So I think

14:38

that that is an interesting approach and maybe something,

14:40

you know, where there's OSO or OPA or one

14:42

of the other tools in that

14:45

ecosystem might start to make inroads into

14:47

the data platform ecosystem as well. And

14:49

then you also have things like identity

14:51

systems like Keycloak or Okta or Auth0,

14:54

etc. that also

14:56

factor into all of that. So it's

14:58

a big, complicated space. I

15:00

think part of the problem here is what

15:02

are we optimizing for? So like OPA

15:04

and Ranger, which is just another

15:06

policy system, was great if you're

15:08

an admin and you want to

15:11

like lay down the

15:13

rules like broadly for

15:15

like lots of tables by using

15:17

table matching. But like SQL security

15:19

was really built around like I

15:21

create a table, I type commands

15:23

to grant access to other folks

15:26

in the platform, I may

15:28

create views or like, you

15:30

know, filter rules or something like

15:32

that. And I'm just typing commands to do

15:34

that in the SQL language. And that SQL

15:36

language is the language

15:39

of the system I'm doing, I'm

15:41

in. So it's like that's a

15:43

system that's optimized for end user

15:45

experience, not admin experience. And

15:47

the admin experience, it's great if

15:49

you're a bank. OPA, like Intrino came

15:51

from Bloomberg and it's like they have

15:54

a lot of data and they have

15:56

data policies they need to apply broadly.

15:58

But if you're like a a small

16:00

group and you want to have a

16:02

security system, like, do you even have

16:04

people that can write these complicated things?

16:06

Can you write a, can you run

16:09

an OPA system that's going to return

16:11

responses in milliseconds because it's part of

16:13

like every query? No. And

16:15

like really you want the system to be kind

16:18

of in a simple, understandable way for a user.

16:20

So it's like, there's these, a lot of the

16:22

stuff in data lakes are provided by big companies

16:25

with big company solutions to big

16:27

company problems, and it does not

16:29

align with like, Hey, I want to

16:31

like grant accesses table of some other person.

16:34

Absolutely. And in the

16:36

data lake and lake house ecosystem

16:39

as well, there's the added complexity that

16:41

by virtue of the storage and the

16:43

compute being disaggregated, you maybe want to

16:45

bring a different compute to that same

16:47

storage. And so then there's the question

16:49

of, okay, well, do I need to

16:52

route all of my requests through the

16:54

other compute engine that has my policy

16:56

information? Do I have to have different

16:58

policy sets and different rule sets across

17:00

those different compute systems? So it's actually

17:02

worse than that too. Cause the outside

17:05

of like Trino, the most

17:07

popular compute engines are map,

17:09

reduce you like things like

17:12

spark and hive. And the

17:14

problem is that those engines

17:16

almost always allow users

17:18

to upload their own third party

17:21

code, untrusted third party code into

17:23

the same process. And that means

17:25

that you can't rely on the

17:28

process to be secure, to protect

17:30

against data access and

17:32

stuff like that. So the

17:35

spark in hive communities are

17:37

pushing for things like column

17:39

level encryption and physical security

17:41

based on file permissions, which

17:43

is like. Anathema to

17:45

like the way SQL works. This would be

17:47

the equivalent of like, Oh, I'm going to

17:49

manage my, my SQL permissions by setting file.

17:55

It's insane, right? And like, this is

17:57

like state of the art and

17:59

it's. because like we, the entire,

18:01

the entire industry went down this

18:04

MapReduce path for 15 years

18:08

and it's not a good

18:10

idea. Like you see like every single

18:12

vendor who's working in the data space

18:14

has moved away from MapReduce. Like, yeah,

18:16

Spark still uses it, but like when

18:18

you get into like high performance stuff,

18:20

like everyone has moved away

18:22

from MapReduce. It's just not a thing

18:24

you do anymore. And we're

18:27

still building our security systems to

18:29

like the lowest common denominator. And

18:32

so taking a step back now

18:34

from ragging about the complexities of

18:36

security, bringing it

18:38

back around to Trino and iceberg,

18:41

I guess maybe keeping it in

18:43

the context of security, what are

18:45

the benefits that that particular pairing

18:47

provides and maybe in juxtaposition to

18:49

other technology stacks or vendors that

18:52

purport to provide a data lake

18:54

house experience? Today, I

18:56

think the data warehouse, like

18:58

the folks talking about the data lake

19:00

experience, and I'm using that in quotes,

19:03

I think it kind of

19:05

breaks down into two camps, you have

19:08

folks who have a traditional

19:10

data warehouse that can pretend

19:13

like it's in the data lake. That's

19:15

almost always done by you run

19:18

a query, it loads the data into Snowflake

19:20

format, they run their query and then they

19:22

throw the data away or they catch it

19:24

or something like that. But they don't actually

19:27

execute directly on the lake house data. So

19:29

that's like one camp. And then the other

19:31

camp would be, obviously you have iceberg

19:35

camp and then you have like the

19:37

Delta Lake camp, which is similar.

19:40

I have my bias. My bias

19:42

is absolutely towards iceberg. I

19:45

was pretty unhappy when Delta Lake

19:47

actually came out. It's

19:50

unfortunate that like, I thought

19:52

we had this brief moment

19:54

where it looked like the

19:56

entire ecosystem was going to

19:58

move onto iceberg. And

20:00

we would only have one thing to implement, not like

20:02

five. And then Databricks dropped

20:05

their format. And in my

20:07

experience, the only people using it are Databricks

20:09

customers, but they have a lot of customers.

20:12

And so like everyone is having to

20:14

implement it because Databricks made it the

20:16

default format for their customers. When

20:19

honestly, like their customers would

20:21

be just as happy with Iceberg.

20:23

So now we all get to

20:26

build twice and yeah, it's got

20:28

a community, but like it's

20:30

not the same thing as it being an

20:32

Apache community. But even then having, if

20:35

there were two Apache projects, I'd be annoyed

20:37

also. And that doesn't, and then there's other

20:39

groups that are trying to build stuff. So,

20:44

so Trino and Iceberg,

20:47

I think we're combining

20:49

together, like in my

20:51

opinion, the best analytics in

20:54

query engine we have available along

20:56

with the current best storage

20:58

format. Uh, so

21:02

without like Iceberg

21:04

without Trino is like, great, I have

21:06

storage format, but like, how do I

21:09

query it? How do I, how

21:11

do I interact and change and

21:13

produce these files? Like, you know,

21:16

it's nice, but like, it's not,

21:18

um, you're still suffering the

21:20

problems of some of the other engines

21:23

and Trino on the

21:26

other hand provides this great query engine

21:28

that's adaptable. Like Trino has

21:30

the ability to add in custom

21:32

data types. Uh, we have, uh,

21:35

direct readers for everything. So it can

21:37

actually, we can actually build an engine

21:40

that's really, really tightly,

21:42

uh, set up for what, uh, Iceberg

21:45

can do, and we can

21:47

do that in a, like in a way where

21:49

you get really, really great performance. So

21:52

what Trino was, was suffering

21:54

from until Iceberg came

21:57

along was the data formats weren't

21:59

particularly good. And so like

22:01

they, you would have performance problems,

22:03

you would be missing stats. You

22:05

know, there's this really, most of

22:07

the data formats and the way

22:09

Hive was worked was actually designed

22:11

for HDFS, which has a very

22:13

specific performance profile that S3 does

22:15

not have. Like listing files is

22:17

great and HDFS is an insanely

22:19

slow in S3 and Iceberg doesn't

22:23

require listing files. Like there's a whole

22:25

bunch of things like that where Iceberg

22:28

was designed to deal with the performance

22:30

characteristics of object storage as

22:32

opposed to like HDFS is

22:34

designed that I mean hardly

22:36

anyone uses HDFS anymore. So

22:39

like Iceberg gave us the,

22:41

a really stable format with

22:43

a well-run community that likes

22:45

specs that understands like the

22:47

performance of modern things. And

22:50

we were able to work really closely with

22:52

them and build a query

22:54

engine that's really tuned.

22:56

The integration we're doing to

22:58

Iceberg is fundamentally designed for Iceberg.

23:00

It isn't like a bolt on. It's like

23:03

we took Hive and like swapped out a

23:05

little bit. So like we wrote a custom

23:07

plugin just for Iceberg that does exactly what

23:09

Iceberg wants. Are

23:13

you sick and tired of salesy data conferences? You

23:16

know, the ones run by large tech companies and

23:18

cloud vendors? Well, so am I. And

23:21

that's why I started Data Council, the

23:23

best vendor neutral, no BS data conference

23:26

around. I'm Pete Soderling

23:28

and I'd like to personally invite you to

23:30

Austin this March 26 to 28th where I'll

23:32

play host to hundreds of attendees, 100 plus

23:35

top speakers, and dozens of hot startups

23:37

on the cutting edge of data science,

23:39

engineering, and AI. The

23:41

community that attends Data Council are some

23:44

of the smartest founders, data scientists, lead

23:46

engineers, CTOs, heads of data, investors, and

23:48

community organizers who are all working together

23:51

to build the future of data and

23:53

AI. And as a listener

23:55

to the Data Engineering Podcast, you can join us.

23:58

Get a special discount off ticket. by

24:00

using the promo code DEPOD20. That's

24:03

D-E-P-O-D-2-0. I

24:07

guarantee that you'll be inspired by the folks at the

24:09

event, and I can't wait to see you there. And

24:11

when somebody is building a data platform or

24:19

building their warehouse implementation, they decide,

24:21

okay, this combination of Treno and

24:23

Iceberg does what I want. I

24:25

have the benefits of a performant

24:27

query engine. I have the flexibility

24:29

and scalability of object storage. I

24:31

can scale those two things independently.

24:33

How does that influence the other

24:35

upstream and downstream choices that they

24:38

might make for the other components

24:40

of their data platform? So once

24:43

you decide you're gonna go with Iceberg

24:45

and Treno, you have the complexities of

24:48

like, how do I actually get my

24:50

data into these platforms? The bootstrap problem

24:52

is a really big problem in data

24:54

warehousing in general. It's like, how do

24:56

I get my data in? In general,

24:59

since Iceberg has become so popular that

25:01

a lot of tools are adopting it,

25:03

so actually getting your data in is

25:05

less of a problem, but you definitely

25:07

wanna go and look at the vendors

25:10

you're gonna use for landing

25:12

the data into your S3

25:14

bucket and make sure they

25:16

support Parquet at the very least and

25:19

Iceberg, hopefully. And if they're not

25:21

supporting it, when are they gonna

25:23

support it? Because most of them

25:25

have it on their roadmap unless

25:28

they're actually, unless they're Databricks. Like,

25:30

actually even Databricks is starting to

25:32

Iceberg support. So making

25:34

sure your vendors actually are supporting

25:36

landing data in Iceberg format. Then

25:38

in terms of like other choices,

25:41

you obviously have things like, how

25:43

am I going? Like, how

25:45

is security gonna work? How is

25:47

data maintenance gonna work? So Iceberg

25:49

tables require maintenance on them. And

25:51

depending on how you're importing data,

25:53

they may require compaction. And you

25:55

wanna keep only so much snapshot

25:57

data because they have the ability

25:59

to. to query historic data, but that

26:01

means you're holding historic data, which could

26:03

be expensive. So there's a bunch of

26:06

like maintenance things and you're going to

26:08

have to choose a tool that supports

26:10

the maintenance. So many of the platforms

26:12

like Starburst, we're integrating all of this

26:14

stuff into our platform because we want

26:16

to create the simplest experience for people.

26:18

Like we don't want them to have

26:20

to go and like integrate with a

26:22

third party tool to like run some

26:24

compaction jobs. Then I think there's additional

26:26

things around like you're going to use

26:28

probably some sort of data transformation, guiding

26:31

pipeline like tool, almost always

26:33

dbt. I don't even

26:35

know if they have competitors honestly.

26:37

Yeah. And then obviously you're going to

26:40

want some sort of BI tools.

26:42

Most of them are supporting Trito or

26:44

Starburst or both today. So there's a

26:46

much of a choice reduction there,

26:48

but I think the big thing is

26:51

like data ingest, getting it into iceberg

26:53

and maintaining those files are currently

26:55

a big part of the platforms. Absolutely.

26:57

And I started my

27:00

data lake house journey, I think maybe

27:02

going on two years ago now. And

27:04

in that two years, it has gotten

27:06

better. Initially there wasn't really any out

27:08

of the box support for being able

27:10

to write into a lake house, you

27:12

could write data into S3, but then

27:15

you would have to perform

27:17

a different step to actually tell whatever meta

27:19

store you were using. Hey, these files exist.

27:21

This is the schema. These are the tables,

27:23

et cetera. So my team is actually using

27:25

air bite. And so we actually had to

27:27

write a custom output plugin that sat on

27:29

top of their S3 plugin to be able

27:31

to automate generation of those AWS glue tables

27:33

for the data that was just written out

27:35

rather than having it be an out of

27:37

band process of, Oh, hey, I wrote all

27:39

this data S to S3. And I'm going

27:41

to wait for the crawler to run, to

27:43

tell me what those tables are. And it's

27:45

probably going to be wrong anyway, et cetera.

27:50

Absolutely. I, I, air bite actually all

27:52

of them. They either have it. If

27:55

they do, it's not, not always the best, but

27:57

like every single one of those vendors I think

27:59

is. realize that Iceberg is an important

28:01

part of the Data Lake future and they

28:04

just need to be able to ingest directly

28:06

into Iceberg. And AirBite does have that out

28:08

of the box now. There are a couple

28:11

of implementations. The level of support is not

28:13

quite where I would like it to be.

28:15

And then going back to one of your

28:17

earlier comments as well, as far as the

28:20

data type specifications being a bit all over

28:22

the place, one of the things that is

28:24

my personal pet peeve, at least in the

28:27

AirBite toolchain. I don't know if it exists

28:29

elsewhere, but anything that has a decimal

28:31

value is automatically a float, which

28:34

if anybody knows anything about data types, that

28:36

is an awful choice. Yes,

28:38

that is an absolutely awful choice.

28:42

Funny enough, the first versions of Trino, we

28:44

didn't have decimal, we only had doubles. And

28:47

the actual migration away from them was

28:49

quite an undertaking. We had like backwards

28:51

compatible flags for a long while where

28:54

you'd be like, oh, if you see

28:56

a literal, it's actually a double, not

28:58

a decimal, like it should have been in

29:01

the spec. So the version of the plugin

29:03

that my team uses, we actually implemented the

29:05

logic that says if it is a numeric

29:07

type that has a decimal place treated as

29:09

a decimal value, not as a float. And

29:15

so for people who are looking at the

29:17

Data Lakehouse ecosystem, going from where we are

29:19

today and looking into the near to medium

29:21

term forward, what are some of the areas

29:23

of progress that you see as

29:25

far as overall improvement in the

29:27

capabilities and user experience for the

29:29

tooling that's available? So I think

29:31

we are finally at the point

29:34

as of this year,

29:36

that the rest of

29:38

the vendor space has

29:41

realized that iceberg is a critical component.

29:43

And they're starting to, they aren't even

29:45

just starting, they figured this out like

29:47

six months ago, their products are starting

29:49

to land. And that's

29:52

a big change. Whereas like before, as you said,

29:54

like, you know, the history of like the Data

29:56

Lake is you end up having to build a

29:58

bunch of the stuff. off yourselves while

30:01

the vendors figure out what's important.

30:03

So where, uh, there's, there's a,

30:05

there's, there's a bunch of interesting

30:07

parts to this. So there's like, obviously

30:09

things like landing data and data maintenance.

30:11

It's going to be interesting to

30:13

see how this shakes out in the

30:16

next like year or two as

30:18

what happened before is happening again, everyone

30:20

realizes is important. So everyone's going

30:22

to build products around this. So now

30:24

we're going to have competing products that

30:27

all have slightly different features, which

30:29

is a good thing, but it's also

30:31

like a bad thing because it's the

30:33

paradox of choice for the end users.

30:35

You're going to have a lot of

30:37

stuff to look at and you have

30:39

to consider like the data Lake is

30:41

about how things integrate together. So it's

30:43

like, if I choose this product from

30:45

this vendor, how does that work with

30:47

my other products that I might be

30:49

interested in from other vendors? Can I

30:51

use air bite to land my data

30:53

and then use a separate data maintenance

30:55

tool that plays well with that land

30:57

of data? And it's

30:59

going to be a interesting next set

31:02

of things around like now that

31:04

we're moving on to iceberg and

31:06

we have tree. No.

31:09

So it's like, how do we get

31:12

these different products to play well with

31:14

it? And everyone's got kind of a

31:16

different viewpoint on that. And

31:18

as a vendor supporting tree, no

31:20

building a product powered by tree,

31:23

no, what are some of the

31:25

areas of investment that you see

31:27

as being most critical to easing

31:29

that adoption curve, improving the effectiveness

31:32

and user experience for people who

31:34

are using Starburst specifically and Trino

31:36

indirectly to just make their lives

31:38

easier and help them get their

31:40

jobs done. Well, I

31:43

should have mentioned this earlier. The most

31:45

challenging thing that people have is actually

31:47

like how they query their data. So

31:50

we set up Trino and the

31:52

first thing you see in Starburst

31:54

is A way of

31:56

actually entering queries right in our

31:59

UI and. The real the rugs

32:01

inquiries and then you're like great I

32:03

want to put this might be I

32:05

to a like how do I get

32:07

this to my p I to have

32:09

to like that is a big area

32:11

we actually think about is like how

32:13

do we empower users to get this

32:15

into the tools they want to use

32:17

Then the other part is kind of

32:19

like generally like the admin park how

32:21

do I manage my security We spend

32:23

a lot of time around that and

32:25

I think the big areas that we

32:27

look for our how do we make

32:29

it easier. And easier for people like

32:32

set up their data lake. So

32:34

one of the first things we

32:36

focused on in the Galaxy development

32:38

was I caught time to first

32:40

query. So you go sign up.

32:42

You may run inquiries on your

32:44

data warehouse in. A minute,

32:46

couple minutes. That's great, how do

32:48

you get your data? And so

32:50

we spent much time around data

32:52

discovery, integrations, etc. and we're continuing

32:55

to do more and more work

32:57

around how you actually build up

32:59

your initial lake and get your

33:01

data endear lake. So I still

33:03

think that's one of the big

33:06

problem. So is this how do

33:08

you get data An and just

33:10

kind of thought seeing a lot

33:12

of the the daily stuff it's

33:14

it's duty stuff. It's like stuff

33:16

I love but it's like really detail.

33:19

There's a lot choice in the space

33:21

and really what I want as a

33:23

non had to end user or even

33:25

honestly my other friends that are insane

33:28

the tactical they're like that's great but

33:30

like I don't want to learn how

33:32

the low level file system stuff works

33:34

dislike own around some queries so I

33:36

spent all we spent a lot of

33:39

time of just like let's get it

33:41

all working and then if you want

33:43

to like integrate with some additional stuff.

33:45

Cause like that's important to like. We.

33:48

Didn't talk about how we do that,

33:50

but really, it's like get up, get

33:52

queries going, Get excited about the. Tree.

33:55

know and what we're doing and then

33:57

we can talk about like some people

34:00

are very opinionated about like they want a

34:02

certain specific integration the way they want to

34:05

do it. But it's pretty rare. We

34:07

hear it because we're in the community. But

34:09

like outside of like data heads, people don't even

34:11

like people don't know what Ranger is or parquet

34:14

or like they don't know what any of this

34:16

is. They're like, I just want to run some

34:18

queries. Yeah, as somebody who's

34:20

been running this podcast for, I guess,

34:23

seven years now, whenever I talk to somebody

34:26

who isn't deeply embedded in this space, I'm

34:28

always struck by the fact that the

34:30

things that I'm talking about, they have no clue and

34:32

they don't care. I'm like, wait a minute. All right,

34:34

reset. I'm going to remember that I'm talking to somebody

34:36

who doesn't do this every day. Yeah, I

34:39

often find myself saying outside of data space. So

34:41

you know, in Excel, when you do x, we

34:44

kind of do that, but the table's infinite,

34:47

like, yeah. Right. And

34:51

going back to that question of landing

34:54

data and the transformation, as you mentioned, most

34:56

people these days are using dbt. There are

34:58

some competitors, but not a lot of them

35:00

and not on the same scale. But one

35:03

of the benefits that Treno provides is, as

35:05

you mentioned, it's a federated query engine. So

35:08

rather than being constrained to, Oh, I can only

35:10

work on the data that's in my iceberg tables,

35:12

you can say, Oh, I actually just want to

35:15

directly query against my Postgres or my MySQL database

35:17

or some of the other numerous data connectors that

35:19

are out there. And I'm wondering what you

35:21

see as the general pattern of people

35:23

who are adopting Treno, whether they are

35:26

still using the air bite or five

35:28

Tran as the only means of landing

35:30

data into their lake house, or if

35:32

they're largely using that federated query capability

35:35

to be able to do more kind

35:37

of real time data updates of from

35:40

source systems into their lake house via

35:42

those transformation routes. Very,

35:44

very interesting question. So you're going to

35:46

get the database answer, which is it

35:49

depends. Uh, so

35:51

it's interesting. So like

35:54

federation is awesome. You

35:56

generally, typically you're

35:58

not keeping your main data. Actually,

36:00

let me back up. So normally when we're

36:02

talking about federation, so like Trino

36:04

in its heart is a federated

36:06

query engine. That is like, we

36:09

don't own the data. We're interacting

36:11

with data and the

36:13

descriptions of the tables that are

36:15

all external that said the connectors

36:17

that read data from like

36:19

object store and glue and that

36:21

sort of thing, those are effectively

36:23

native formats to Trino. Like we

36:25

implement all the raw file reading

36:27

logic. We talk directly to glue.

36:30

We're not talking to like another query

36:32

engine. Whereas when we talk to my

36:34

SQL, we send a query

36:36

in my SQL's language to my SQL.

36:39

So normally when we're talking about federation,

36:41

we're talking about the stuff that's not

36:43

in normal data like queries, folks

36:46

that are a lot

36:48

of companies and users, et

36:50

cetera, we'll have what I'll call

36:53

dimensional data sitting in a production

36:56

store that's like a my SQL or postgres,

36:58

this could be as simple

37:00

as like demographics for users, et

37:02

cetera. So like they'll have their

37:04

main feed of data say it's

37:06

an ad feed and it's like,

37:08

okay, user so-and-so saw this ad

37:10

you join in with their demographics

37:12

and then you can do analytics

37:14

of like, you know, uh, the

37:16

amount of ad clicks by age

37:18

range or something like that, and

37:21

you don't have age range in your,

37:23

uh, in your normal ad feed.

37:25

So that's really powerful and

37:27

it's easy to do because like you

37:29

just connect them together, you don't have

37:31

to set anything up. The downside is

37:34

you're now accessing a production

37:38

data store that like keeps your

37:40

website running from your query engine

37:42

that can be fine if you're

37:44

using like my SQL and you

37:46

have a bunch of reader applicants

37:48

for your, uh, your database that

37:50

can also be expensive because in

37:52

a transaction processing database

37:55

Is more expensive to run than an analytics database

37:57

for the amount of data. Though

38:00

sometimes you'll want to, you

38:02

instead copy that data into

38:04

your data warehouse. The other

38:06

reason that you want to

38:08

copy data in his he

38:10

sometimes one historic data. So

38:12

you need the demographics for

38:14

that user when they saw

38:16

the ad. Especially when you're

38:18

doing stuff where there's money

38:20

involved and give are paying

38:22

for certain ad impressions or

38:24

you know you're doing got

38:26

our products dos you're selling

38:28

things in like. You want to record

38:31

the state of the system at that point, so

38:33

a lot of times then you'll still be either

38:35

dumping and data. Daily. Or you

38:37

can. With. A lot of work

38:39

trying south from the like to Bcm

38:42

and get a feed into a data

38:44

warehouse. It's very complicated today, so a

38:46

lot of times you'll want to mirror

38:48

the data in because you actually want

38:50

a nod when been snapshot. because you

38:52

wanna know who these apps are? You

38:54

want to reduce the pressure. So a

38:57

lot of you start with alive and

38:59

then moved to the other one when

39:01

they realize the cost or the pressure

39:03

on their database. Moving can be really

39:05

really complicated though. Like the tools there

39:07

are not good. The state of the

39:10

are the best tools are very challenging.

39:12

The. Absolutely. Digging

39:14

a little bit deeper in their

39:16

i'm wondering if there any other

39:18

differences the you see in terms

39:21

of the overall pipeline, design, access

39:23

and usage patterns that folks are

39:25

building around the usage of Chino

39:27

an iceberg as compared to maybe

39:29

a warehouse or some of the

39:32

other lake house compositions that you

39:34

seen. So. There's the data

39:36

warehousing space I think in

39:38

general is kind of developing

39:41

two different directions, especially in

39:43

the are the open daily

39:45

so there's a large swath

39:47

of people that are. using.

39:50

something like tv t to do

39:52

step by step transformations ah there

39:54

is a movement towards materialized views

39:56

which he just say i want

39:58

to materialise a of this query

40:01

and here's the policy for keeping that

40:03

up to date. A lot of

40:05

people think they're equivalent, but they are not. So

40:08

materialized views are

40:10

about when you're querying that it's supposed

40:12

to be the equivalent as if you

40:14

just ran the underlying query and so

40:16

the data changes. Whereas like pipeline data

40:19

has the advantage and disadvantage that like

40:21

typically like you're processing on like, I

40:23

don't know, let's say a daily or

40:25

an hourly basis. If like the query

40:27

changes or something like that changes in

40:29

the pipeline, only future data will affect

40:32

it, which is good and bad depending

40:34

on what you're trying to accomplish. So

40:36

like, I think that's an important split that's

40:38

happening in the open community and I'm curious to

40:40

see which one's going to win. In

40:42

terms of like open data lakes

40:45

versus like proprietary ones, the biggest

40:47

difference is that people don't keep

40:49

all their data in their proprietary

40:51

data lakes. Just too expensive or

40:53

it's too complicated to move

40:55

it all in. Whereas like normally

40:58

people are storing all their data in S3,

41:00

whether it's a data lake or not, because

41:02

it's cheap and they can have a backup,

41:04

but you don't keep all your data in

41:06

Snowflake because it's either too expensive or it's

41:08

too much of a burden to keep all

41:10

the feeds to load it into their format.

41:12

You see the same thing with Redshift

41:14

and basically everything else out there. It's

41:17

like even if it were free, it's

41:19

still just annoying. And then

41:22

another consideration that folks have when they're

41:24

deciding whether or not they want to

41:26

use a lake house approach is sometimes

41:28

they have queries that need to be

41:30

able to operate very quickly. And so

41:32

that's where they'll typically bring in something

41:34

like a click house or a druid

41:36

when they're dealing with fast

41:38

moving data that needs to be updated quickly.

41:40

And I'm wondering what you see as some

41:42

of the decision points

41:44

around going wholesale into one of

41:47

those systems or using those as

41:49

a supplement to a Treno and

41:51

Iceberg setup. Yeah, so

41:53

my experience with those systems

41:55

is that they're limited in

41:57

their capabilities. So they're

41:59

almost almost always used with a

42:01

custom application, especially in the case

42:03

of like Druid where it's not

42:05

standard SQL at all. Very

42:08

powerful, but you basically,

42:10

your application is custom written to

42:13

it. So you're not typically using

42:15

it for general analytics. And

42:17

if you're in that space, like you end up

42:19

having a lot of choices of different things you

42:21

can do. So in terms of

42:23

like fast moving data, I think

42:25

the open data lake is getting better at

42:28

this very fast. I

42:31

think that's a thing everyone's

42:33

focusing on. So with iceberg,

42:36

you now have the iceberg

42:38

appending stuff that came in

42:41

what, two years ago, three years

42:43

ago, like you see more and

42:45

more people using tools to take

42:47

data off of event

42:49

streams like Kafka and landing it

42:51

into tables at high resolution, and

42:54

then having background compaction jobs to

42:56

deal with the insane number of

42:58

files you create. And

43:00

then downstream of that, there are a

43:02

bunch of vendors and open source projects

43:04

working on taking like, okay, so now

43:07

we have this new data, how do

43:09

we integrate that into the computations? I

43:11

would guess within a couple of years,

43:13

you're going to see everyone building something

43:15

around this, you know, it'll be like

43:18

everything else. A lot of them

43:20

will be bad, but I think

43:22

the overall community is going to

43:24

be more and more of bringing

43:26

in data at near real time

43:29

and being able to have it manipulated

43:31

in a near real time feed. That

43:34

said, that is near real

43:36

time, getting down to like

43:38

milliseconds, like anything under like

43:40

30 seconds typically

43:43

means you have a custom engine where as

43:45

you're bringing the feeds in, they're going into

43:47

main memory and they're being held in memory.

43:49

You can't even get them to distribute a

43:52

disk. It's not fast enough. Those

43:54

I think will continue to be fairly

43:56

proprietary systems. They're Kind of complicated to write.

43:58

So That's where you're going. Though, feed a

44:01

few vendors in that space. My

44:03

experience is that most people don't

44:05

need anything short of a minute.

44:07

Very rare to see that. the

44:09

reason you see like could he

44:11

came out of goober is because

44:13

they were using their real time

44:15

system to adjust pricing on the

44:17

fly. Well, how many. Group.

44:19

How many organizations have that

44:21

problem? None that like outside

44:23

of like delivery services those are

44:26

like the only people I know

44:28

you systems absolutely. And as somebody

44:30

who has been working in the

44:33

space for a number of years,

44:35

as somebody who is building and

44:37

investing in the lake house architecture

44:39

paradigm be very deeply entrenched in

44:42

that ecosystem. What are some of

44:44

the most interesting are innovative her

44:46

unexpected ways You have seen Trina

44:49

Lighthouses. Applied. So. That

44:51

most interesting cases almost

44:53

always our. Custom applications

44:56

I a I've seen so many

44:58

like standard warehouse stuff that like

45:00

they all kind of blend together

45:02

be less interesting were because really

45:04

interesting is when someone builds a

45:07

custom application especially ensure you know

45:09

if they're building a custom data

45:11

stored a match. So you have

45:13

things like ah companies that run

45:15

not big like a Cd and

45:18

and stuff like that building a

45:20

custom data store the hooks directly

45:22

into their see the ad and

45:24

kid like. Show the live

45:26

data feeds and like security systems

45:29

where you're hooked into the lives,

45:31

security views or add systems like

45:33

we don't have been studies at

45:35

Facebook for like cooking into the

45:38

live at system why they be

45:40

testing where you're like a custom

45:42

data store at specifically to no

45:44

problem with like. Indexes that

45:47

are for petabyte scale data you

45:49

can do you really really powerful

45:51

things with three know because of

45:53

the way that the query engine

45:55

is extensively to add new types

45:57

and functions in all sorts of

45:59

stuff. into it and

46:01

end up with extremely

46:04

responsive systems that do

46:06

really custom things at big scale. That

46:08

said, you need a team of high

46:11

skilled engineers to build something like

46:13

that, which is worthwhile. It's like,

46:15

this is your entire business. I

46:17

think the more common, interesting thing

46:19

is ingesting data and setting it up

46:21

and getting a bunch of people running

46:24

their queries, which is pretty mundane, but

46:26

it's like the power of when you

46:28

give your people access to data and

46:30

their ability to make better decisions is

46:32

just like, it's night and day. And

46:35

in your experience of building

46:37

these systems, working with customers, what are

46:39

some of the most interesting or unexpected

46:42

or challenging lessons that you've learned in

46:44

the process of working in this data

46:46

lake house ecosystem? I think

46:48

the most frustrating thing

46:51

is you run into different

46:53

requirement viewpoints on things. So

46:55

it's like, you think you

46:58

understand what people are

47:00

interested in and you start building that.

47:02

And then someone comes along and they're

47:04

like, no, I actually am very interested

47:07

in the opposite direction. So we had

47:09

a bunch of people that were interested

47:11

in, I don't care what the file

47:13

formats are, I just want this stuff

47:15

to go really fast. You have this

47:17

advantage of your ability

47:20

to move faster and build really custom

47:22

things. If you can change anything you

47:24

want at any time, it's actually a

47:26

huge advantage that the big proprietary vendors

47:28

have. Well, once you get the scale,

47:30

you can't really do that. But in

47:32

the early days, it's very fun. You

47:34

can move very fast. But

47:36

at the same time, like in

47:39

our space, the reality is

47:41

like, we are in this

47:43

open data space. So it's like if

47:46

I extend stuff and no one uses

47:48

it, I'm no longer in that space.

47:50

So it's often challenging to figure out

47:52

like, how do we thread the needle

47:54

of like, actually making things

47:57

a lot better without stepping

47:59

out side that bound. So

48:01

like we're doing a lot of work around iceberg

48:04

and iceberg maintenance. And we

48:06

spent a lot of time

48:09

thinking about like, Hey, should we just

48:11

be like in starboard, should we just

48:13

pulling this into our separate

48:15

space? And then like, maybe we're

48:17

not even using iceberg manifest files.

48:19

Maybe we're using something else in

48:21

like a transactional database. And then

48:23

I can do indexing in ways

48:26

that are impossible right now. And

48:28

we decided that no, we're the open data lake space.

48:31

So it's like, we got to figure out how to

48:33

do it in the, in the open format.

48:35

Sometimes it's like we have augmented data in

48:37

special fields or sidecar files or that sort

48:40

of thing to be able to like, give

48:42

us the additional information that we need to

48:44

make our age go faster. Sometimes like you

48:46

get on Slack and you hit up Ryan

48:49

blue and you're like, yeah, how about we

48:51

just add some, some stuff into the spec

48:53

to be able to handle this? Like I'm

48:55

sure everyone has this problem. So that that's

48:58

the, the like, I want to move faster,

49:00

but I can't move faster that I like,

49:03

I want to, it drives me

49:05

nuts when it's like, I know there's a

49:07

better solution and it's like, I can't do

49:09

it without breaking and making

49:11

the thing proprietary and then, you know,

49:13

even then I have to like wait

49:16

for others to catch up. Absolutely.

49:18

For people who are in the

49:20

process of designing their data systems

49:22

or they're looking to build a

49:24

new set of capabilities in their

49:26

data platform, what are the cases

49:28

where a lake house architecture is

49:31

the wrong choice? So I,

49:33

I also would say a few years

49:36

ago, this answer was a lot easier.

49:38

I think nowadays the open

49:40

data lakes are very good,

49:42

or I think it's helpful with

49:44

some of the vertically integrated players

49:47

is you don't have to

49:49

understand a whole lot, you're just, again,

49:51

you shop and you just use the tool. And I think

49:53

that's where data lakes suffered

49:55

like back to the original cloud era stuff.

49:58

And if you are trying to. install it,

50:00

they had like 10,000 choices of different

50:02

tools to install. It's like, I

50:04

just want to work my data. So

50:06

it's like their entire id was choice.

50:08

And that was the worst part about

50:10

their product. It was like too much

50:12

choice. I think we've done a great

50:14

job at Starburst around like simplifying, getting

50:16

started on your Lake and getting going

50:18

in your Lake. You also had this

50:21

problem in the past where I would

50:23

say that there's a lot of people

50:25

who feel like they need to

50:27

use a Lake cause they heard it or a data

50:29

warehouse in general, and they

50:31

don't actually have a data warehouse problem. Like

50:33

they could just use Postgres and don't

50:35

have a lot of data. Also, we

50:37

see a lot of people that want

50:40

to do Federation and they don't understand

50:42

like Federation is like, we just send

50:44

queries to the other system and they're

50:46

like, well, it'll make my stuff faster.

50:48

And so I don't think we've done

50:50

a great job of describing when you

50:53

would choose to even move to a

50:55

data warehouse. And then in terms of

50:57

like proprietary versus non, it's a, it's

50:59

a tough choice. They can get yourself

51:01

going, but they can be very expensive,

51:04

complex to manage. And you're bolted into that thing.

51:06

Like I don't know if you've ever seen someone

51:08

tried to move from a traditional

51:10

warehouse to an open one. It's not super easy.

51:13

I don't want to say it's hard. Like we

51:15

do a lot of business with moving people off

51:17

the lakes, but it would be, would have been

51:19

a lot easier if they had started on the

51:22

Lake. Absolutely. And as

51:24

you continue to build and iterate on

51:26

the Trino platform and the Starburst product,

51:28

what are some of the things you

51:30

have planned for the near to medium

51:32

term or any particular projects or problem

51:34

areas you're excited to explore? So

51:37

on the open source side, there's

51:39

a bunch of stuff I'm very

51:42

interested in around how

51:44

we can spin people up on

51:46

Trino in a faster

51:49

and easier way. So

51:51

we're doing more around

51:53

simplifying the setup, simplifying

51:56

the installation process, making

51:58

it work. work in

52:00

smaller environments, things like that,

52:02

better integrations with the different

52:04

ecosystems. Like, I want to

52:06

see much more work done

52:08

with better integrations with the

52:11

Python ecosystem in particular. So

52:14

one of the big areas that

52:16

I have been focusing on recently

52:18

has been around how you actually

52:21

set up Treano. So historically,

52:24

Treano was designed and operated as

52:26

you had a data lake with

52:28

Hive in it and now they

52:30

spark in it. And you're adding

52:33

Treano because both of those query

52:35

engines are really slow and not particularly

52:38

good to use. Now we're at the

52:40

point where a lot of people

52:42

just run it and they don't have Hive and Spark.

52:45

So there were things we would assume would

52:47

already exist because you have those other tools.

52:50

Like now we're going back and adding a

52:52

bunch of things where normally

52:55

you would have just fired up the

52:57

Hive console and run some commands and

52:59

you just don't have that anymore. So

53:02

another big area is you set

53:04

up Treano and it's like, oh, you want to set

53:06

up a new catalog. And

53:09

in the old days, you knew what you wanted

53:11

to connect because you already had a data lake

53:13

and so you just create this little catalog file

53:15

and modify it and just restart your server until

53:17

things work. Well, that's just not how people do

53:19

it anymore. Now they fire up the Treano and

53:21

it's like, okay, I want to connect to my

53:23

S3 and it's like to go edit

53:25

a file, I can run a SQL

53:27

command. So recently added a bunch of

53:29

stuff around create catalog, drop catalog. There's

53:31

still more to be done like alter

53:33

catalog. Right now it's still just under

53:35

the covers modifying like local

53:37

files, but we have some work on like

53:40

putting it into a real database. So it's

53:42

funny, it's like you think about this and

53:44

it's like in the Treano ecosystem and it's

53:46

like, what do you mean? You're not storing

53:49

your catalogs in like a normal catalog system.

53:51

It's like we never needed to. And

53:53

it's like at Starburst, like with Galaxy,

53:55

we've had this from the beginning, you go into

53:57

the UI, you just modify your catalogs in like.

54:00

Like everything's kind of live-ish,

54:02

getting even more live with these

54:04

changes we're putting in, in Trino.

54:06

So like you will be able

54:08

to just add catalog to remove

54:10

them a lot easier and maintain

54:12

your system, put in a bunch

54:14

more stuff around like data evolution

54:16

and things like that. So I'm

54:18

excited about this. Like how do

54:20

we bring more people into this

54:23

community? Cause I think we're, we're

54:25

very much at the point where

54:27

the difference between what I can

54:29

do in a traditional data warehouse, what I

54:31

can do in Trino is

54:33

a much, much smaller gap. Like when we

54:36

started Trino, we're like, we're going to be

54:38

able to take out traditional data warehouses with

54:40

us. Like we're going to build something that's

54:42

as good as that. We're 10 years in

54:44

and I think we're for the vast majority

54:46

of cases, like we've been able to take

54:48

them out for years and years and years,

54:50

but it's like this new user

54:53

case, I think is like the one

54:55

remaining spot. And like when we started

54:57

this project, we said, it's going to

54:59

take 10 years. I think we're

55:01

there like just need to like just a

55:03

little bit more. And I think we will

55:05

have covered pretty much everything all the way

55:07

down to like a new user with like

55:10

a couple of files they want to process.

55:13

It's funny how persistent

55:15

that 10 year time horizon is. Pretty

55:17

much every time I talk to somebody

55:19

who has built or is building a

55:21

database engine, they always say it takes

55:23

10 years before you get it right.

55:26

Yeah. The other thing they don't say is

55:29

like, it kind of takes five years before,

55:31

you know, kind of doesn't

55:33

suck. You know, it was pretty good,

55:35

but like, you know, people

55:37

like we didn't have the ability to write

55:39

tables for the first year, like whatever we

55:41

got data, we got hive, it's writing data

55:44

for us. We'll just run queries, right? That

55:46

select the data out. So it's like the

55:48

amount of stuff from like, Oh, this is

55:50

actually interesting. It kind of works to like,

55:53

I can use it everywhere is

55:55

like people have no idea. Absolutely.

55:58

It's amazing how many. products have been built

56:00

because the person building it didn't realize how hard

56:03

it was going to be. Yeah.

56:05

Yeah. I honestly, I think that's

56:07

almost every project I work on is like,

56:10

if I knew, I probably wouldn't start. And

56:14

are there any other aspects of the work that

56:16

you're doing on Trino and this overall space of

56:18

the data lake has ecosystem, the combination of Trino

56:20

and iceberg that we didn't discuss yet that you'd

56:22

like to cover before we close out the show?

56:25

I think we actually covered all of it. All

56:27

right. Well, for anybody who wants to get in

56:29

touch with you and follow along with the work

56:31

that you're doing, I'll have you add your preferred

56:33

contact information to the show notes. And as the

56:36

final question, I'd like to get your perspective on

56:38

what you see as being the biggest gap and

56:40

the tooling or technology that's available for data management

56:42

today. I really, really think

56:44

we need a big improvement in

56:46

the security space. And I don't

56:49

really care what it is other

56:51

than like, it needs to work

56:53

well with things like Trino and

56:55

the maintenance, like the amount of

56:57

complexity you have to go through

56:59

to set those policies. You have

57:01

to learn a new language. That's

57:04

way too complicated. And frankly, it's even

57:06

if you do it in language, you're

57:08

going to get the policies wrong because

57:10

you're no expert in it. So it's,

57:13

they're too complex of models. The other

57:15

spaces, I still think it's too hard

57:17

to get data into the lakes. It

57:19

just needs to work and land and

57:21

be maintained and like, you shouldn't have

57:23

to think about it. It should be,

57:25

it should always work and be low

57:27

cost and data just shows up. Like

57:29

why do I have to worry about,

57:31

you know, all the feeds? All

57:35

right. Well, thank you very much for taking

57:37

the time today to join me and share

57:39

the work that you and your team have

57:41

been doing on bringing the data lake house

57:43

ecosystem into a better place and all the

57:45

work that you're doing to build the starburst

57:47

product definitely makes the onboarding a lot easier

57:49

for folks. So definitely like the work that

57:51

you and your team are doing there. So

57:53

thanks again for taking the time and I

57:55

hope you enjoy the rest of your day.

57:57

Thank you. This is great. Thank

58:03

you for listening.

58:05

Don't forget to check out our other shows, podcast.init, which covers the

58:10

Python language, its community, and the innovative

58:12

ways it is being used, and the

58:14

machine learning podcast, which helps you go

58:16

from idea to production with machine learning.

58:18

Visit the site at dataengineeringpodcast.com to subscribe

58:21

to the show, sign up for the

58:23

mailing list, and read the show notes.

58:25

And if you've learned something or tried out a product from the

58:28

show, then tell us about it. Email

58:30

hosts at dataengineeringpodcast.com with

58:32

your story. And to help other people find

58:35

the show, please leave a review on Apple

58:37

Podcasts or tell your friends.

Rate

Get this podcast via API

From The Podcast

Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More