Podchaser Logo
Home
Version Your Data Lakehouse Like Your Software With Nessie

Version Your Data Lakehouse Like Your Software With Nessie

Released Sunday, 10th March 2024
Good episode? Give it some love!
Version Your Data Lakehouse Like Your Software With Nessie

Version Your Data Lakehouse Like Your Software With Nessie

Version Your Data Lakehouse Like Your Software With Nessie

Version Your Data Lakehouse Like Your Software With Nessie

Sunday, 10th March 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello and welcome to the Data Engineering

0:13

Podcast, the show about modern data management. Data

0:16

lakes are notoriously complex. For

0:19

data engineers who battle to build

0:21

and scale high-quality data workflows on

0:23

the data lake, Starburst powers petabyte-scale

0:25

SQL analytics fast at a fraction

0:27

of the cost of traditional methods

0:29

so that you can meet all

0:31

of your data needs, ranging from

0:33

AI to data applications to complete

0:35

analytics. Trusted by teams of all

0:37

sizes, including Comcast and DoorDash, Starburst

0:39

is a data lake analytics platform

0:41

that delivers the adaptability and flexibility

0:43

a lake has ecosystem promises. And

0:46

Starburst does all of this on an

0:48

open architecture, with first-class support for Apache

0:50

Iceberg, Delta Lake, and Hoodie, so you

0:53

always maintain ownership of your data. Want

0:56

to see Starburst in action? Go

0:58

to dataengineeringpodcast.com/Starburst and get $500

1:00

in credits to try Starburst

1:02

Galaxy today, the easiest and

1:04

fastest way to get started

1:06

using Trino. DAGSTOR

1:08

offers a new approach to building

1:10

and running data platforms and data

1:13

pipelines. It is an open-source, cloud-native

1:15

orchestrator for the whole development lifecycle,

1:17

with integrated lineage and observability, a

1:19

declarative programming model, and best-in-class testability.

1:22

Your team can get up and running

1:25

in minutes thanks to DAGSTOR Cloud, an

1:27

enterprise-class hosted solution that offers serverless and

1:29

hybrid deployments, enhanced security, and on-demand ephemeral

1:32

test deployments. Go to

1:34

dataengineeringpodcast.com/DAGSTOR today to get started, and

1:36

your first 30 days are free.

1:39

Your host is Tobias Macy, and today

1:42

I'm interviewing Alex Merced about Nessie, a

1:44

Git-like versioned catalog for data lakes using

1:46

Apache Iceberg. So Alex, can you

1:48

start by introducing yourself? Hey, everybody. My name

1:50

is Alex Merced. I'm a Nessie,

2:00

which will be something I'm going to love

2:02

to talk about today, but all about the

2:05

Lakehouse, even so much being one of the

2:07

co-authors of Apache Iceberg, the Definitive Guide, an

2:09

upcoming book more rightly. And do you

2:11

remember how you first got started working in data? It's

2:13

a fun story. I have a very long,

2:15

not traditional way I kind of got here.

2:17

So the long and short of

2:20

it is basically, and then, you know, definitely I

2:22

have a longer version of the story in places,

2:24

but basically I did start off as a computer

2:27

science major, but then I got really into music

2:29

and kind of went into this completely different category

2:31

of studying like culture marketing and which somehow led

2:33

me into a career training people in finance. And

2:35

I ended up training people in finance for 10

2:38

years. So I spent a lot of time breaking

2:40

down really complex ideas and helping people kind of

2:42

understand them in a more accessible way. But

2:45

I then eventually ended up back in software and

2:47

came back as a software developer and did that

2:49

for a few years and also trained software developers.

2:52

But I was always a big fan of working

2:55

with databases. So like some of my

2:57

favorite projects were finding ways to optimize

3:00

the database, finding ways to offload workloads,

3:03

work from business logic from the wrong places

3:05

when people put like me too much of that stuff in

3:07

their client side of their websites. So

3:09

I started kind of gravitating more and more to the

3:11

database. And then I also started gravitating more towards like

3:13

a dev world of advocacy because I was always

3:15

naturally someone who would like to teach. I like

3:17

to create content. I like to break

3:20

down ideas. So I decided to

3:22

make the shift from software development into the dev world

3:24

of advocacy world. And I ended up

3:26

finding a home in Dremio where I got to spend a

3:28

lot of time learning about this really cool exciting thing called

3:30

the data lake house. And it's definitely

3:32

got makes me wake up really excited every day.

3:35

And now I get to help people understand that

3:37

and bring that understanding of how not only what

3:39

it is, but how to implement it, technologies around

3:41

it and so forth. And

3:43

for the conversation today, we're focused on the

3:45

Nessie project. And I'm wondering if you can

3:47

describe a bit about what it is some

3:49

of the story behind it and where it

3:51

fits in that context of the data lake

3:53

house. Got it. Okay,

3:56

so bottom line is the Nessie project

3:59

at its core. a catalog. So when it

4:01

comes to the Apache iceberg table format, there

4:04

is a need for a mechanism to be

4:06

active to catalog. So it tracks all the

4:08

different tables and primarily what it does attracts

4:11

a reference to what is the most current

4:13

metadata.json file for that particular table. So

4:16

at the core, that's what Nensi does.

4:18

What Nensi provides is additional ability to

4:20

actually create commits, not at the individual

4:22

table level, but at the catalog level.

4:24

So it actually, every time that those

4:27

catalog references changes, the business should

4:29

like a commit, which means it allows you to have the

4:31

same sort of get like semantics as

4:33

get, as far as being able to do

4:35

branching, tagging. And this kind of changes

4:37

the dynamics of sort of how you interact with

4:40

your catalog and how you plan sort of like

4:42

data ops type practices where you want to kind

4:44

of isolate developer environments or roll

4:46

back when it comes to disaster recovery. It changes a lot

4:48

of things and actually makes it oftentimes easier

4:50

and creates sort of new patterns when it comes to the

4:53

data lake house. You

4:55

mentioned the ability to do branching

4:57

and committing and merging and tagging.

4:59

And I'm wondering, in terms of

5:02

the context of data lake houses,

5:04

the overall data pipelining

5:07

and workflows, what are some

5:09

of the core problems and complexities that Nensi

5:11

is designed to solve for? I mean,

5:13

bottom line, like a couple of different situations

5:16

where Nensi becomes really useful is probably the

5:18

lowest hanging fruit is like data rollback. So

5:20

basically you have maybe a pipeline that fails

5:22

and now you have bad data

5:24

or partial or inconsistent data and let's say

5:26

a handful of dozen tables. Now

5:28

you technically can roll back those tables directly from the

5:30

table format and Apache iceberg. We have to do each

5:32

table one by one. By having a

5:34

catalog level abstraction, I can just roll back the

5:37

catalog to the commit that was like the last

5:39

clean commit. And I can do that in all

5:41

in one fell swoop and move the whole catalog

5:43

back to before the that suggestion job. But

5:46

also what happens a lot of times is that people would

5:48

create duplicates of their data, like for a developer environment. And

5:50

then they would do all their work there and then have

5:52

to merge those environments and it was

5:55

harder to create these environments and to more costly because of

5:57

the storage. But with versioning

5:59

like Nensi, I can basically

6:01

create that isolated branch environment without

6:03

creating a single duplicate of my

6:05

existing data. It would just

6:08

basically isolate the new snapshots going forward, which was

6:10

only, so the only new data is

6:12

really the new data of those new transactions. In

6:16

terms of my experience of surveying the

6:18

overall data ecosystem, in particular, the data

6:20

lake and data lake house environments, the

6:22

closest thing that I've seen to NSE

6:24

as far as this branching and merging

6:26

semantics, the ability to do that kind

6:28

of zero copy cloning, I guess, there

6:31

are two pieces to that. One is

6:33

the zero copy cloning and being able

6:36

to do very low cost developer environments,

6:38

copy on write semantics is with

6:40

Snowflake. I know that they have the

6:42

ability to do that, kind of snapshot

6:45

tables, create a copy of a table

6:47

using the same existing underlying data. But

6:49

from the lake perspective, the closest project

6:52

I've seen is LakeFS, which

6:54

has that same idea of Git

6:57

semantics, but at the S3 abstraction

6:59

layer. And I'm wondering if

7:01

you can talk to some of the overlap

7:03

and some of the divergence between NSE and

7:06

LakeFS and when you might decide to use

7:08

one versus the other. Oh,

7:10

yes, actually, I find the difference is quite interesting. And the

7:13

funny thing is I think they were both sort of kind

7:15

of coming into existence around the same time. I

7:17

recently saw a talk where they talked about sort of the evolution

7:20

of LakeFS. And I remember seeing a talk about the evolution of

7:22

NSE. And those initial questions were the

7:24

same, and both of them started basically asking questions,

7:26

can we just use Git and realizing, okay, like

7:28

the type of throughput, the type of, the

7:31

amount of changing that happens in data is

7:34

not really built for that. So basically you

7:36

have to kind of find some other abstraction. So

7:38

LakeFS went the approach of where you

7:40

basically capture sort of deltas in the

7:42

actual files. So you say, okay, add

7:44

this file, subtract this file, while

7:47

NSE goes through the approach of

7:49

just capturing sort of that metadata change. So

7:51

a couple ways to kind of think about

7:53

it is imagine I updated an iceberg table

7:55

with the insert. That might create a thousand

7:57

new files. So in the case of the...

8:00

The lakeffs command, it's not aware of the table. So

8:02

it's not aware of where the table exists, it's just

8:04

the way there's a thousand new files in

8:06

my file system and then captures a command that says, okay,

8:08

hey, these thousand files have been added. While Nessie,

8:11

the only thing that changes, and the Nessie

8:13

is one thing, it changes the metadata, there's

8:15

a new metadata.json file. So instead of tracking

8:17

a thousand new things and then a thousand

8:19

new things were added, it's just this table

8:21

snapshot has changed from pointing to here to

8:23

there. So it's a much more

8:25

sort of lightweight change that can handle sort

8:27

of very high velocity throughput as

8:29

far as like if you're making a lot of changes of time because

8:32

you're not tracking as many different items. But it's also a couple

8:34

of other differences are that it's sort of more

8:36

table aware because it is at the catalog level,

8:39

which allows you to sort of move

8:41

all of those GIF-like semantics into

8:43

SQL. So I can create a branch

8:46

using SQL, I can merge a branch using SQL,

8:48

I can create a tag, while with

8:50

lakefs, it's usually they're done through, it's

8:52

mostly done through the file path. So basically what

8:54

it does, it takes advantage of object storage and

8:57

says, okay, hey, there's gonna be this dynamic part

8:59

of the file path that represents what branch you're

9:01

on. And then oftentimes you create these

9:03

branches, oftentimes all of the work has to be done with

9:05

a CLI. So while

9:07

probably like for a lot less technical

9:09

users, SQL can be a much more

9:11

accessible approach to doing a

9:13

lot of these things. And the CLI

9:16

tool might be maybe a little less accessible.

9:18

So there's also some ergonomic differences, I would

9:20

say. Zeroing in

9:22

on that catalog element, we've mentioned a

9:24

few times that Nessie is a catalog

9:27

and it corresponds to various pointers into

9:29

the iceberg table format. And I'm wondering

9:31

if we can dig a bit more

9:33

into the context of what purpose does

9:36

the catalog serve in that data lake,

9:38

data lake house environment and what are

9:40

some of the alternatives or

9:42

what are some of the pieces that

9:44

Nessie might replace if somebody already has

9:46

an existing lake house environment? So

9:50

a couple of things first, like so

9:52

right now Nessie primarily works with iceberg. The cool

9:54

thing about Nessie architecture is that it just tracks

9:56

sort of these like little metadata objects. So basically

9:58

it's really just an object. It has

10:00

like a data type and right now the main data

10:02

types you see are iceberg tables, iceberg views. Theoretically,

10:05

other table formats could come into the picture

10:07

pretty easily, but basically tracks

10:09

that metadata. Now the thing

10:11

is that the way the iceberg spec works

10:13

is that generally the catalog, that catalog reference

10:16

is sort of like your source

10:18

of truth when it comes to the current state of the table. So

10:21

the problem is you generally don't want your iceberg references

10:23

in more than one catalog. So

10:25

this is where basically, hey, if I choose

10:28

Nessie as my catalog, then that precludes me

10:30

from using another catalog like an AWS glue

10:32

or a tabular or something like that. So

10:34

oftentimes when you are adopting an Apache iceberg

10:36

lakehouse, you do have to take a look

10:38

at sort of like, what are the

10:40

tools you're using? And what are

10:42

the different features of the different catalogs? Most of them

10:44

are going to generally provide you the main service of,

10:47

of basically, hey, I can identify my tables and I can

10:49

take, hey, I can take this catalog to Spark, Spark sees

10:51

on my tables. I take this catalog to Flank, it

10:53

sees on my tables. I take it to Dremio, it sees on my tables. But

10:57

not every catalog works with every tool currently.

10:59

I think that story has gotten a lot

11:01

better. So most catalogs are workable in most

11:03

places nowadays, but that is essentially sort

11:05

of one of the big sort of cost

11:08

benefit calculations you have to make when selecting a

11:10

catalog. And when it comes

11:12

to particularly with like Nessie, it works with most

11:14

of pretty much all the name, the typical open

11:16

source tools. So it works with Trino, it works

11:18

with Presto, it works with Dremio, it works with

11:20

Apache Spark, it works with Apache Flank. So you

11:23

get that branching emerging across all these tools. So

11:25

if your workflows incorporate these tools, you can then

11:27

add that branching level emerging tagging to it. And

11:31

now digging into the

11:33

versioning capabilities specifically, you mentioned

11:36

that at a

11:38

high level what Nessie does is it keeps

11:40

a reference to all

11:43

of the table metadata

11:45

pointers so that within each set of

11:47

transactions or each commit, you can say

11:50

I am pointing at this set of

11:52

metadata for all of these tables. And

11:55

so you can have commit and

11:57

rollback functionality across tables across transactions.

12:00

levels. And in terms

12:02

of the actual versioning of the data,

12:04

I know that Iceberg has built-in support

12:06

for being able to do optimistic

12:08

and concurrency control and being able to

12:10

keep snapshots to different points in time

12:12

of data based on the underlying

12:15

files and the changes there. I

12:17

also know that it requires a certain amount

12:19

of maintenance to keep the tables kind of

12:21

happy and performant as far as doing things

12:24

like vacuuming and pruning old references

12:26

and old versions there. I'm curious if you can

12:28

talk to some of the ways that Nessie handles

12:30

the interoperability with the

12:32

versioning in Iceberg as well

12:35

as any of the maintenance

12:37

pieces that it can help with as

12:40

far as pruning old versions, running table

12:42

compactions, etc. Yes.

12:44

Okay. So basically the architecture of Nessie is

12:46

that mainly it's basically going to be a

12:48

running service that you would run. You could

12:51

also get it as part of... It's

12:54

actually integrated into Dremios. It's an integrated catalog.

12:56

But essentially it interacts through a REST

12:59

API. And when it comes

13:01

to the versioning aspects, right now if I

13:03

were to capture a commit, basically

13:05

it creates a sort of like JSON-like entry in

13:07

the backing store. So it could be like a

13:09

ROC-CB, a Postgres, whatever you choose as

13:11

your backing store. That'll say basically

13:13

I have a timestamp for that commit, sort of

13:15

like the parent commit to that. So that way

13:17

he knows what the sort of the tree looks

13:20

like. And then just a couple of other metadata

13:22

pieces. So right now it's like a very small

13:24

metadata imprint. So right now it's ability to like

13:26

generally the best practice is oftentimes like one

13:28

branch at a time. And there's actually, I'll give you

13:30

a couple examples of people who are actually doing that

13:32

in production in that way. But when

13:34

it comes to the maintenance side, this is where it gets a little

13:36

bit tricky. Because typically when it comes to Iceberg, when

13:38

you're doing like expired snapshots or something like that,

13:41

the assumption is that there's

13:43

only essentially that like that

13:45

tables metadata is aware of all of its

13:47

own snapshots, which then that sort of with

13:49

Nessie, you might have different branches where there's

13:51

different versions of metadata JSON that has references

13:53

to different snapshots for it to be

13:56

aware of like, okay, hey, which when I expire, which snap

13:58

the snapshot, how do I know which files? I

14:01

can safely delete. So what

14:03

Nestle did is they created their own tool called the GC Cleaner, which

14:06

does that kind of garbage cleanup. So it'll actually take a

14:08

look at the metadata JSON at the head

14:10

of each sort of branch and be able to

14:12

kind of safely identify, hey, which files are able

14:14

to be deleted. So when you run

14:17

the vacuum command on, either

14:19

when you run the GC Cleaner independently, or if

14:21

you're using Dremio, you use the vacuum command, it'll

14:24

use that tool to then safely make sure

14:26

it deletes the right data file

14:28

source without affecting other branches. Now,

14:31

as far as the versioning pieces,

14:34

anybody who's used Git for any length of

14:36

time has dealt with the dreaded merge conflict.

14:38

And when you're dealing with numerous tables, potentially

14:40

dozens or hundreds, the last thing that you

14:43

want to think about is how do I

14:45

deal with a merge conflict? If I'm creating

14:47

a branch and then I need to merge

14:49

it back after somebody else has created their

14:51

own branch and merged it ahead of mine.

14:53

And I'm curious if you can talk to

14:55

some of the ways that those versioning changes,

14:57

branching and merging are

14:59

kind of sanitized so that we don't have to

15:01

deal with these big complex messy merges

15:04

in the event that underlying data has

15:06

changed in a manner that is incompatible

15:08

across branches. Yeah, I mean, right

15:10

now it's pretty shallow. So it's just tracking basically

15:13

that metadata reference and essentially a timestamp and a

15:15

parent. So right now you can get a merge

15:17

conflict pretty easily if you're starting like several branches

15:19

at the same time. Typically the pattern

15:21

we've been seeing is that what people will do is they'll

15:23

start a branch at the beginning of the day. So what

15:25

they'll do is they'll create a branch for that day. And

15:28

then they'll do all their ingestion for that day on that

15:30

branch, run some validating logic at the end of the day

15:32

and then basically merge that branch at the end. So instead of

15:34

creating like lots of branches, at

15:38

least for ingestion purposes, usually you wanna stick to sort of

15:40

like one branch per catalog. And

15:42

then that, or you could have new branch for each use case.

15:45

So basically I'll create a branch for today. We

15:47

validate at the end of the day. And then basically at the

15:49

end of the day, you're always merging that validated data back into

15:51

production. And then other uses, if

15:53

you're gonna do more branches, usually other use cases

15:55

would be like, okay, I'm just creating a branch

15:57

just for experimentation purposes. Or I mean. creating

16:00

a branch to isolate some particular changes that I don't plan

16:02

to merge again but I want to kind of keep this

16:05

separated. But generally, as far as merging in,

16:07

right now you probably would prefer to keep it

16:09

sort of make a branch merge it in. Part

16:12

of what's evolving in the project is kind

16:14

of adding more metadata at that, at what

16:16

the catalog tracks so that way later on

16:18

you can have more sophisticated sort of merge

16:20

resolution. So right now, best practice would be sort

16:22

of like make an ingestion,

16:24

have like a branch that is your ingestion branch and

16:27

keep it that way and then merge it and then

16:29

create another branch for the next ingestion job after that

16:31

ingestion job is complete. Digging more

16:33

into Nessie specifically, you mentioned a little bit

16:35

about some of the specifics of running and

16:37

I'm wondering if you can talk to the

16:40

overall architecture and design of the Nessie project

16:42

and some of the ways that it has

16:44

changed and evolved in scope and purpose from

16:46

when it was first started. Yeah, I mean

16:49

I think when it first started it was

16:51

and I think it still wants to, it's

16:53

still in this regard of being sort of

16:55

a lakehouse catalog. So while it mainly works

16:58

with Apache Iceberg, it has the

17:00

architecture so that it can expand in a

17:02

sense because basically what happens is that there's

17:04

all these different types of things

17:06

that can track and then there's just

17:08

essentially just deciding on an agreed schema

17:10

that's built into or built-in types. So

17:13

right now like the types are like namespace so if

17:15

you're creating a subfolder or like a database however you

17:17

want to think of these namespaces, there's

17:19

iceberg views, iceberg tables. There's also Delta

17:21

Lake tables that are actually part of

17:23

the spec right now and

17:25

they did try to make, there was a

17:28

pull request made to the Delta Lake repository

17:30

to kind of have that functionality but that

17:32

pull request never got merged in. So that

17:34

is a to-be-seen in the future to see

17:36

if we can eventually get that change made.

17:38

But I mean you know from like a

17:41

format like a Delta Lake or a hoodie, most of

17:43

the time the table is just a particular directory. So it

17:45

could just be as easy as just having a schema

17:48

that's just basically hoodie table, Delta Lake table that

17:50

just points to a directory and then it could

17:52

catalog those as well. It doesn't

17:54

now but it wouldn't

17:57

be hard to do because it has a very... Again,

18:00

it's very flexible, it's just capturing. This

18:02

is the type of metadata that this little

18:04

object tracks, and then making sure that you have

18:06

a metadata object

18:09

attached to that that matches the schema for

18:11

that type. So Iceberg has a particular

18:13

set of information that you would keep with it. But

18:15

the way you interact with the catalog is to arrest

18:18

API. Now, so you could

18:20

always custom make these

18:22

API calls, but there is a client in

18:24

Java and then Python to

18:26

directly interact with Nessie on top of the integrations that

18:28

are already used with a bunch of tools. But

18:31

basically, there is a standard specification that

18:33

is there is an open, there is

18:35

the open API spec on

18:37

the Nessie documentation to

18:39

help the endpoints. I

18:41

definitely spent a few days exploring that quite in

18:43

depth because I made like an unofficial client just

18:46

to kind of get more acquainted with it. And

18:48

that was a fun adventure, but

18:50

it's a pretty straightforward API. Are

18:55

you sick and tired of sales data conferences? You

18:58

know, the ones run by large tech companies and

19:00

cloud vendors? Well, so am I. And

19:02

that's why I started Data Council, the

19:05

best vendor neutral, no BS data conference

19:07

around. I'm Pete Soderling,

19:09

and I'd like to personally invite you to Austin

19:11

this March 26th to 28th, where

19:14

I'll play host to hundreds of attendees, 100

19:16

plus top speakers, and dozens of hot startups

19:19

on the cutting edge of data science, engineering

19:21

and AI. The community

19:23

that attends Data Council are some of

19:25

the smartest founders, data scientists, lead engineers,

19:28

CTOs, heads of data, investors and community

19:30

organizers who are all working together to

19:32

build the future of data and AI.

19:36

And as a listener to the Data Engineering Podcast,

19:38

you can join us. Get

19:40

a special discount off tickets by using

19:42

the promo code DEPOD20. That's

19:45

D-E-P-O-D-2-0. I

19:48

guarantee that you'll be inspired by the folks at the

19:50

event, and I can't wait to see you there. Another

19:55

interesting aspect of this project, going back

19:57

to its nature as a catalog, is

19:59

that... the overall space of

20:01

data catalogs for data lake environments

20:04

has largely been a pretty static

20:06

target of you have the Hive

20:08

catalog, or you have the Hive

20:10

catalog, maybe in the form of AWS glue is

20:13

just actually still just the Hive catalog. And

20:15

I'm curious in the work of

20:17

building an evolving messy using it

20:20

as an alternative catalog to that

20:22

Hive ecosystem. Some of

20:24

the ways that you have been constrained

20:27

from innovating a lot in terms

20:29

of what the catalog can offer and how to

20:31

operate with it. And some of the ways that

20:33

you're able to try to move

20:35

the entire ecosystem along a bit to

20:38

understanding some of the new ways that

20:40

the catalog can and should be thought

20:42

of in this data lake

20:44

house ecosystem and maybe some of

20:47

the arbitrary limitations that the Hive

20:49

catalog API has imposed upon us

20:51

until now. Yeah, I mean,

20:53

I think a lot of like a lot of the

20:56

solutions to that particular problem were more

20:59

repaired on the like the table format side. So

21:01

essentially, like iceberg really kind of broke away from

21:03

sort of like the constraints of having to have

21:05

Hive where you have to kind of have folders

21:07

and sub folders that define your table. And

21:10

then Nest is able to leverage that by being able

21:12

to just refer to that table metadata and just focusing

21:14

on capturing the versions of that. So basically, it

21:17

almost takes a whole different paradigm of

21:19

what the catalog does that instead of it

21:21

being the bearer of the metadata, but

21:23

instead of it being sort of the gatekeeper of

21:26

where the metadata is. So basically, where Hive, you

21:29

have the Hive minister that kind of acts as

21:31

both your catalog and metastore. Nesse

21:33

basic acts the catalog and the iceberg

21:35

will then be sort of really where

21:37

the metadata is stored on your S3

21:39

in those manifest and manifest lists. And

21:42

in that case, you can much easier

21:44

incorporate future formats and new paradigms the

21:46

catalog. So I don't think

21:48

it's initially been constrained. It's just a matter of

21:51

like people choosing to adopt Nesse. That's

21:53

become a lot easier in recent times, particularly just because

21:55

it's like now that it's integrated the Dremio, a lot of

21:57

people are just using it because it's Once

22:00

you have a Dremio Lakehouse, it just kind of

22:02

is there. So it's just, it's there. So

22:04

why not use it? And then you don't have

22:06

to stand up the servers, you don't have to maintain it. So

22:09

it makes the whole process a lot easier. But there's still also

22:11

a lot of people who just deploy Nissi on their own and

22:13

are just using it that way. Because they prefer to have

22:16

that service that they manage on their own. They want to

22:18

use a different backing store. They just want to have control

22:20

over it. So we have seen a

22:22

lot of adoption on that site too. Especially

22:25

over like, again, this last year has definitely been a

22:28

big year for processing, growing adoption for Nissi. For

22:31

that integration process or running it yourself,

22:33

what are some of the steps involved

22:35

in actually getting it deployed, getting

22:37

it integrated into a data stack

22:40

and maybe some of the

22:42

complexities that people should be planning

22:44

for, especially if they have an

22:46

existing catalog that they want to migrate away from? I

22:49

guess the first step as far as deployment goes, I

22:51

mean, if you just want to try it out, there's a Docker

22:53

container and that's pretty straightforward to use. If you want to deploy

22:55

it for production, there is a Helm chart. So you can deploy

22:57

that pretty easily using the Kubernetes Helm chart. And

23:00

then very soon, there'll

23:02

be an iteration of the Dremio Helm chart that also

23:04

should incorporate a lot of those details. So that way

23:06

you can simultaneously deploy them

23:09

easily. But once you actually have

23:11

it deployed, far as like migration goes, it just depends

23:13

on sort of what your use case is. So basically

23:16

the function would be, hey, you're probably

23:18

using Apache Iceberg or going to Apache Iceberg.

23:21

So if you're already using Apache Iceberg before

23:23

you adopt in Nissi, the question

23:25

is then becomes what is your prior existing

23:27

catalog? So regardless of which catalog

23:29

it is, what happens actually part of the Nissi project,

23:31

they came out with a CLI tool for catalog migration,

23:34

which is not just for Nissi, but for any Iceberg

23:36

catalog. So you could literally, you would just

23:38

put in the credentials for the source catalog, and

23:40

then you put in the credentials for the destination catalog

23:42

and what it does, it'll move all the references over.

23:45

So then that catalog will have all

23:47

the metadata references basically

23:50

in one fell swoop. The only challenge there

23:52

always becomes is, well,

23:54

not really a challenge there. That should work

23:56

fine. It's always like issues like there's the

23:58

catalog because the actual. So query engine has to

24:01

have access to the catalog and then separately have access

24:03

to the storage where the actual metadata is stored. So

24:06

where an accident can happen is that, you

24:08

know, you decide you are

24:11

using an engine that doesn't read, you know, your

24:13

files are in Hadoop, you just do a blanket

24:15

migration of catalogs, but now you're using a tool

24:17

that can read Hadoop file storage, so now you

24:19

still can't read those tables, you can read the

24:21

catalog. So you definitely have to kind

24:23

of keep in mind that you always have to think about,

24:25

hey, does the tool have access to the catalog and

24:27

the storage? As long as you keep those two in check,

24:30

usually you shouldn't really run into any problems because essentially

24:32

the query engine's path is packed with the catalog, then

24:34

check the storage. As long as it can do

24:36

both, you're gonna be able to read those tables just

24:39

fine, assuming it adopts the iceberg

24:41

spec. Now, if you're not using iceberg, then you're probably

24:43

not using this, so it's less of a consideration there.

24:46

In terms of iceberg itself, that also

24:48

provides a moving target because it's a

24:50

very active project, a lot of different

24:53

engines are adopting it, it has been

24:55

growing in terms of its overall capabilities

24:57

and usage, and I'm curious how that

24:59

has influenced the direction and

25:02

development of Nessie and some of

25:04

the ways that Nessie has been

25:06

able to capitalize on the newer

25:08

features in iceberg. Basically, Nessie

25:10

just operates as a way to discover the tables, in

25:13

that case, it's independent of what's in the metadata. All

25:15

it cares about is the location, right now it only

25:17

cares about the location of that metadata at JSON, so

25:20

what's inside the metadata at JSON, what's inside the other

25:22

metadata files, so as we start adding things like the

25:24

leaf files, the coffin files, whatnot to

25:26

the iceberg specification, and in

25:28

the future, other files, I think, there's also some other

25:30

things that are sort of in discussion right now, all

25:33

of that would not affect the way Nessie operates,

25:36

since basically it's only versioning

25:38

the references and not versioning the actual metadata

25:41

itself right now. Again, in the future, it'll

25:43

probably start holding more of the metadata so

25:45

that way it can do those more sophisticated

25:47

merges and be more context aware of the

25:49

tables, but the kind of data that

25:51

it's probably going to need to track to do that is

25:53

probably not the kind of stuff that's changing right now, because

25:55

I mean we're talking about like, okay, what are the trials

25:57

that got added, what are the files that were subtracted? It

26:00

doesn't necessarily have to track every single thing that

26:02

the same iceberg metadata does, just what it needs

26:04

to be aware of not

26:06

tripping up once merging. And

26:09

then another responsibility that can

26:11

often get pushed into the catalog

26:13

layer is the question of

26:16

access control or permissioning. And I'm curious

26:18

how NSE handles that aspect of the

26:20

problem space. Yeah, there's two ways you

26:22

can handle that right now. Essentially,

26:27

you can have different users that are

26:29

accessing the NSE catalog. And essentially, the

26:31

access controls are applied to the user.

26:33

So basically, if I access the catalog

26:35

with a particular token, well,

26:37

basically, it'll be aware of, hey,

26:40

this person using this particular access token

26:42

can only access these branches, these

26:45

objects, these kind of things. So you can do that manually

26:47

with NSE. And there's ways of

26:49

configuring a lot of that. That still probably

26:52

requires a lot of manual configuration. When

26:55

you're using NSE as it's

26:57

integrated into Dremio, then it falls

26:59

into Dremio's more point and click

27:02

type of authorization, where you can

27:04

basically have role-based access controls, role-based

27:06

access controls, column-based access controls at

27:09

the query engine layer. So

27:11

basically, it'll leverage some of NSE's

27:14

branch-level controls and then also leverage Dremio's

27:17

query engine-level controls when

27:19

you give different users tokens from

27:21

different tools. Since

27:23

NSE is part of a given

27:26

data stack, the versioning

27:28

and branching and merging capabilities are

27:30

part of the core primitives of

27:32

the system. How have you seen

27:35

that influence the overall

27:37

workflow and design approach that teams

27:40

take as far as the development,

27:42

deployment, evolution of their data processing

27:44

and data delivery flows? Actually,

27:47

it was pretty simple, because oftentimes, again,

27:50

the pattern I've seen most was the pattern I mentioned earlier where

27:52

people just do a daily branch. So basically,

27:54

all you do is you just tweak all your jobs to

27:56

just hit that particular branch. say,

28:00

let me branch whatever you want to call the branch,

28:02

and then it may be a timestamp or a date.

28:05

Basically, it's pretty easy to programmatically set up your

28:07

pipeline to always make sure that they're targeting the

28:09

right branch name. So then you can just

28:11

kind of run the branch. It'll always hit that day's branch. And

28:14

then basically, everything becomes very turnkey. But

28:17

midday, you're not going to see your

28:19

production data getting painted because it goes

28:21

through that daily process. And then systems

28:23

that need to access that data real

28:25

time, they have access to

28:27

that branch. So they'll query that branch directly without

28:30

the same sort of guarantees you would get

28:32

with the production branch, and that'd be sort

28:34

of clearly communicated. But that's usually it because

28:36

once you have it, basically, the actual creation

28:38

and merging of branches is pretty straightforward enough

28:41

to do an SQL and then

28:43

automating that SQL with whatever, whether

28:46

it's Spark, Flink, or Dremio, is

28:48

pretty easy. So what they have

28:50

is basically just kind of deciding

28:52

what is the frequency of their branching

28:54

patterns. Do they want to do hourly branches, daily

28:56

branches, weekly branches, and what their merge

28:59

cancel is going to be? But once you kind of figure

29:01

that out, once it's implemented, you don't

29:03

really think about it anymore. It just kind of works.

29:06

Another common target for operating with

29:08

this data is something like a

29:10

DBT. And you mentioned

29:12

the zero copy clones, effectively, of

29:14

being able to create per-user branches.

29:17

I'm curious how you've seen folks

29:19

incorporate Nessie's versioning and

29:21

branching capabilities into the

29:23

development workflow of DBT

29:25

users and data analysts.

29:28

I've seen it with DBT users because Dremio does

29:30

work with... Well, again, any tool works with DBT.

29:33

But basically, in the SQL, you

29:35

can specify the branch in your query. So you

29:37

can just sit there. So I've seen it personally.

29:39

I've seen it with Dremio. And then basically, you

29:41

can just sit there and just add app branch

29:43

at the end of each of your queries, and

29:46

then you get all the benefits of DBT and all the

29:48

orchestration and using Git version control

29:50

on your DBT models. But then you

29:52

also get this other layer of versioning at the catalog

29:54

level, so you get to leverage both and get the

29:57

benefits of both. In your

29:59

experience... of working with Nessie, exploring

30:01

its ecosystem, diving deep into the iceberg

30:03

table format and the ways that the

30:06

two interoperate. What are some of the

30:08

most interesting or innovative or unexpected ways

30:10

that you've seen the Nessie project applied?

30:13

There was one, I'm trying to remember what

30:15

the exact details were, but I've seen some

30:18

interesting applications of just creating a branch

30:20

for like, just

30:22

to kind of create wildly different versions of the data.

30:24

Like actually, you

30:27

know, one example with you

30:30

still using that data pattern I mentioned before, but

30:33

also what they'll do is they'll create experimental

30:35

branches, because these are like, generally like large

30:37

financial institutions who we've seen this pattern with.

30:40

And what they'll do is they'll create a branch that

30:42

they use for doing like stress testing type,

30:45

whatnot. Because what they can do is that

30:48

they can create a safe copy of their production

30:50

data for that day to then make

30:52

the changes to that data that they don't want

30:54

to permanently make to then run all their stress

30:56

testing calculations on. And then they can

30:58

just throw away the branch at the end

31:00

of the day without having to really worry about rolling it

31:03

back or undoing the data. So there's makeup

31:05

branches at the beginning of the day, that's for like stress testing, add

31:08

in the, hey, bad scenario here,

31:10

worst case scenario there, and

31:12

then run their tests, and then they can dispose of it

31:15

every day. In your experience of

31:17

exploring this space, keeping up to date

31:19

with the use cases, the

31:21

technologies behind it, what are some of the

31:23

most interesting or unexpected or challenging lessons that

31:25

you've learned? I mean, oftentimes

31:27

I think where it's

31:30

always gonna be sort of like the great thing about the

31:32

lake house is that everything's very modular. So you can kind of

31:34

swap out the different pieces you want, but

31:37

there's still like little gotchas,

31:39

particularly in sort of like,

31:41

as I mentioned earlier, when you're working with any

31:43

catalog in the iceberg space, there's

31:45

sort of two layers. So you have to make sure that

31:47

you have the authentication to access the catalog and you have

31:49

the authentication to access the storage and different tools have different

31:52

stories when it comes to both of those layers. And that's

31:54

oftentimes where a lot of gotchas kind of come in. So

31:57

I always just say, hey, doing the legwork to make

31:59

sure that... When you're working with a catalog,

32:02

making sure that the tools you use can read the

32:04

catalog and then also access the storage. Because I can

32:07

definitely find people on the boat where they're working with

32:09

something that they like, but then they move

32:11

to X tool and then now,

32:14

they were using let's say X

32:17

object storage. Now that particular

32:19

storage layer is in readable layer tool,

32:21

so it interferes with their plan, even

32:23

though it could interact with messy or some other catalog. And

32:26

for people who are interested in these

32:28

versioning capabilities, what are the cases where

32:31

a messy is the wrong choice and

32:33

maybe you're better served by just using

32:35

an AWS glue or maybe just not

32:37

even using iceberg at all? Yeah, I

32:40

mean, well, basically, they're using iceberg. I think

32:42

messy is a good option. Now, the reasons

32:44

you would not use, you might choose a

32:46

glue, but oftentimes, because you're really inside the

32:48

AWS ecosystem. So if you're connecting to Athena,

32:51

you can redshift, you're connecting to all these

32:53

tools, then, you know, AWS glue is

32:55

going to be a very easy sell because

32:57

it's going to have interactivity. But if

32:59

you're operating multi cloud or in a completely

33:01

different cloud, that's, you know, that's

33:03

not necessarily going to be the

33:06

same, same saliency. But if

33:08

you're not even using iceberg at all, if you're

33:10

using Delta Lake or hoodie, then oftentimes, like different

33:12

solutions might work better. Like generally, with there, the

33:14

only option, it would be like lake, if you

33:16

only have file level versioning available

33:18

at the moment. Which again, another, I always

33:20

like think of it like another feather in the cap for iceberg.

33:22

Not only does it have the rich ecosystem

33:25

of things that can write to it, read to

33:27

it, manage tables, but

33:29

you also have a rich options of how

33:31

you can version control your tables, but it's

33:33

file versioning, table level versioning, or catalog level versioning.

33:35

Iceberg really gives you a lot of options

33:37

to kind of really architect the lakehouse you

33:39

need. Have you ever seen where

33:42

people are using both lake FS and Nessie

33:44

in tandem? I don't think I've seen

33:46

it yet. I've seen one or the other. Theoretically,

33:48

they can work together. I mean, it could

33:51

be like basically one of the issues like lake

33:53

FS has with like iceberg in particular is that

33:55

iceberg really depends on absolute path. And

33:58

lake FS depends on relative path. So LakeFS

34:00

had to create their own custom catalog. Or

34:02

the problem with the custom catalog though, is engine support. So

34:05

it works with X-bar, RK link, but then you get

34:07

too many other engines, you have

34:10

trouble connecting that catalog. So I could see a world where

34:12

basically someone is working with multiple formats, they may be working

34:14

with a Delta Lake and an iceberg, and they

34:16

might want to use Nessie for iceberg, but they want to use

34:18

LakeFS for Delta Lake. And I can

34:20

see that. And I mean, I can

34:22

see different situations where you're working with data that's outside of

34:24

a table, that you're going to want to roll back, whether

34:27

it's like, you know, a group of CSV files, like

34:29

LakeFS would be helpful. But again, when

34:32

it comes to your main Lake house

34:34

catalog, you might prefer Nessie to

34:36

provide those kind of semantics. So I can see a

34:38

world where all three levels have

34:40

benefits, because even at the table level with

34:43

iceberg, a nice thing about being able to

34:45

tag tables in iceberg, is

34:48

that it prevents them from being cleaned up when you

34:50

do cleanup operations. So if I tag like an end

34:52

of month snapshot, then when I expire

34:54

snapshots, it won't clean up those tags

34:56

snapshots. I mean, the same story when

34:58

you tag commits in the catalog level. But again,

35:00

there's going to be different situations where you might

35:02

want each of these levers to be available to

35:04

you. So I haven't seen it too much

35:07

yet, because I just feel like I'm still seeing, I'm

35:09

just starting to see people start adopting these kind of patterns, at

35:12

least on the Lakehouse level. And then also like the

35:14

sort of get style delivery of them, but I'm starting

35:16

to see it more and more adopted, but it's still

35:18

sort of very early days. For

35:21

people who are interested in

35:23

Nessie and want to keep

35:25

abreast of its development, its

35:28

future direction, what are some of the things that are planned for

35:30

the near to medium term, or anything that

35:32

you're keeping an eye on, or you're excited

35:34

to see come to fruition? Oh,

35:36

I guess my wish list is

35:38

going to be probably be like

35:41

high iceberg support for Nessie. That's

35:43

definitely going to be, that's definitely high on my wish list.

35:46

I tried to make that contribution. I started

35:48

like writing some of

35:50

the full requests, but I didn't ended up

35:53

just not having the time that

35:55

I would have liked. So if anybody wants to

35:57

help contribute that, please, please go join over there.

36:00

There's a lot of great work to do there. And

36:02

there's a lot of really great devs working over there

36:04

on Nessie that you can communicate

36:07

directly with them on the Nessie Zulip. So Nessie

36:09

uses, instead of Slack, he uses Zulip, which is

36:11

like the open source Slack. So

36:13

you can communicate there. So there you can like learn,

36:15

participate in the conversation about the

36:17

evolution of the format. I'm not the format

36:19

of the catalog and its future features.

36:21

But I would say like my

36:24

short-term worst list would be PyEisberg support. One

36:26

of the cool things that I keep hearing about long

36:29

run is it's gonna be again, that sort of more

36:31

context awareness. So, and then

36:33

also would be really cool is that

36:35

eventually, you know, the pull requests can

36:38

get accepted over there in Databricks. So that way,

36:40

I mean, Delta Lake to be able to support,

36:44

be able to support Delta Lake and Nessie or something like

36:46

that to just offer more options. So that way it can

36:48

become, ideally, you know, you have a catalog, you can hold

36:50

all the things. But it's a pretty cool

36:52

tool. And the patterns I'm seeing with it are pretty fun.

36:54

And then I think what's most unique about

36:56

it is just sort of, when you start doing the SQL for

36:58

anything, and how easy it is to do it. Yeah,

37:01

to me, that's when I was like, okay, this is nice.

37:03

This is just easy and simple

37:05

to use that, it

37:08

really does make a lot of new patterns, a lot

37:10

easier to execute. Are there

37:12

any other aspects of the Nessie

37:14

project, the overall kind of

37:16

use cases or capabilities of data versioning in the

37:19

Lakehouse that we didn't discuss yet that you'd like

37:21

to cover before we close out the show? I

37:23

guess a couple other use cases that

37:26

I think are implied, but just to make them

37:28

explicit, are like multi-table

37:30

transactions. And one

37:33

thing I think, like right now, they

37:35

have introduced like multi-table transactions at the

37:37

table level versioning or in the table

37:39

level in iceberg. But the way

37:41

it's done is you have to use a catalog

37:43

that's for a catalog and they have to kind

37:45

of implement this multi-table transactions. And

37:47

it's more like a traditional sort of begin

37:49

and transaction type style, where basically you

37:52

have to kind of do everything at one time. The nice

37:54

thing about the git style, would you get with like a

37:56

Nessie or like a fast so they're both taking that sort

37:58

of git approach? is that

38:00

I can create a branch, and

38:02

I can do multiple transactions. And none of

38:04

those transactions are published until I do a

38:06

merge. So I could be doing one transaction

38:08

on one table in Spark, another transaction on

38:11

another table in Flink, another transaction on another

38:13

table from Tino or Dremio. And

38:15

then, with all those transactions are done, all

38:17

those transactions can be published simultaneously to all

38:19

those tables through one merge. And that's sort

38:21

of a unique form that just doesn't actually

38:23

currently exist in a data warehouse at all.

38:26

And so that's a really

38:28

neat thought process, because

38:31

I do think it opens up some new

38:34

ways that you think about how you do

38:36

those transactions across multiple tables and work with

38:38

multiple table semantics. Well, for

38:40

anybody who wants to get in touch with you and

38:42

follow along with the work that you're doing, I'll have

38:44

you add your preferred contact information to the show notes.

38:47

And as the final question, I'd like to get your

38:49

perspective on what you see as being the biggest gap

38:51

in the tooling or technology that's available for data management

38:53

today. Basically, my

38:55

opinion is going to be something that's tied

38:57

all together. And that's kind of what I

38:59

find working at Dremio really exciting, because we

39:02

do have a tool that's really trying to

39:04

tie things like Iceberg, NSE, all these different

39:06

data sources, and tie it together in sort

39:08

of one cohesive platform, where it feels like

39:10

you're getting that modular system. But it

39:12

comes with the ease of use and nice

39:14

sort of flavor that you get with a

39:16

more integrated system like Snowflake,

39:18

where you get that ease of use

39:20

in a more deconstructed system on the

39:22

late-house. And I think that

39:24

has been the thing that people are really, really

39:26

looking for. And I do

39:29

feel like we are,

39:31

or on the verge, are really kind of providing

39:33

the solution to that. So that's

39:35

a pain you're feeling. Definitely come talk to me. All

39:38

right. Well, thank you very much for taking the

39:40

time today to join me and share

39:43

your perspective and your experiences working with

39:45

NSE and helping us understand the problems

39:47

that it solves and how to incorporate

39:49

it into a data lake environment. It's

39:51

definitely a very cool project. It's great

39:53

to see more investment

39:55

and evolution of this data

39:58

versioning capability in. the

40:00

data processing ecosystem. So appreciate the time and energy

40:02

you're putting into that and I hope we enjoy

40:04

the rest of your day. Thank you

40:06

very much. It was a pleasure. Thank

40:14

you for listening. Don't forget to check

40:16

out our other shows, Podcasts.init, which covers

40:18

the Python language, its community, and the

40:20

innovative ways it is being used, and

40:22

the Machine Learning Podcast, which helps you

40:25

go from idea to production with machine

40:27

learning. Visit the site at dataengineeringpodcast.com to

40:29

subscribe to the show, sign up for

40:31

the mailing list, and read the show

40:33

notes. And if you've learned something or tried

40:35

out a project from the show, then tell us about it. Email

40:38

host at dataengineeringpodcast.com with

40:41

your story. And to help other people find

40:43

the show, please leave a review on Apple

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features