Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello and welcome to the Data Engineering
0:13
Podcast, the show about modern data management. Data
0:16
lakes are notoriously complex. For
0:19
data engineers who battle to build
0:21
and scale high-quality data workflows on
0:23
the data lake, Starburst powers petabyte-scale
0:25
SQL analytics fast at a fraction
0:27
of the cost of traditional methods
0:29
so that you can meet all
0:31
of your data needs, ranging from
0:33
AI to data applications to complete
0:35
analytics. Trusted by teams of all
0:37
sizes, including Comcast and DoorDash, Starburst
0:39
is a data lake analytics platform
0:41
that delivers the adaptability and flexibility
0:43
a lake has ecosystem promises. And
0:46
Starburst does all of this on an
0:48
open architecture, with first-class support for Apache
0:50
Iceberg, Delta Lake, and Hoodie, so you
0:53
always maintain ownership of your data. Want
0:56
to see Starburst in action? Go
0:58
to dataengineeringpodcast.com/Starburst and get $500
1:00
in credits to try Starburst
1:02
Galaxy today, the easiest and
1:04
fastest way to get started
1:06
using Trino. DAGSTOR
1:08
offers a new approach to building
1:10
and running data platforms and data
1:13
pipelines. It is an open-source, cloud-native
1:15
orchestrator for the whole development lifecycle,
1:17
with integrated lineage and observability, a
1:19
declarative programming model, and best-in-class testability.
1:22
Your team can get up and running
1:25
in minutes thanks to DAGSTOR Cloud, an
1:27
enterprise-class hosted solution that offers serverless and
1:29
hybrid deployments, enhanced security, and on-demand ephemeral
1:32
test deployments. Go to
1:34
dataengineeringpodcast.com/DAGSTOR today to get started, and
1:36
your first 30 days are free.
1:39
Your host is Tobias Macy, and today
1:42
I'm interviewing Alex Merced about Nessie, a
1:44
Git-like versioned catalog for data lakes using
1:46
Apache Iceberg. So Alex, can you
1:48
start by introducing yourself? Hey, everybody. My name
1:50
is Alex Merced. I'm a Nessie,
2:00
which will be something I'm going to love
2:02
to talk about today, but all about the
2:05
Lakehouse, even so much being one of the
2:07
co-authors of Apache Iceberg, the Definitive Guide, an
2:09
upcoming book more rightly. And do you
2:11
remember how you first got started working in data? It's
2:13
a fun story. I have a very long,
2:15
not traditional way I kind of got here.
2:17
So the long and short of
2:20
it is basically, and then, you know, definitely I
2:22
have a longer version of the story in places,
2:24
but basically I did start off as a computer
2:27
science major, but then I got really into music
2:29
and kind of went into this completely different category
2:31
of studying like culture marketing and which somehow led
2:33
me into a career training people in finance. And
2:35
I ended up training people in finance for 10
2:38
years. So I spent a lot of time breaking
2:40
down really complex ideas and helping people kind of
2:42
understand them in a more accessible way. But
2:45
I then eventually ended up back in software and
2:47
came back as a software developer and did that
2:49
for a few years and also trained software developers.
2:52
But I was always a big fan of working
2:55
with databases. So like some of my
2:57
favorite projects were finding ways to optimize
3:00
the database, finding ways to offload workloads,
3:03
work from business logic from the wrong places
3:05
when people put like me too much of that stuff in
3:07
their client side of their websites. So
3:09
I started kind of gravitating more and more to the
3:11
database. And then I also started gravitating more towards like
3:13
a dev world of advocacy because I was always
3:15
naturally someone who would like to teach. I like
3:17
to create content. I like to break
3:20
down ideas. So I decided to
3:22
make the shift from software development into the dev world
3:24
of advocacy world. And I ended up
3:26
finding a home in Dremio where I got to spend a
3:28
lot of time learning about this really cool exciting thing called
3:30
the data lake house. And it's definitely
3:32
got makes me wake up really excited every day.
3:35
And now I get to help people understand that
3:37
and bring that understanding of how not only what
3:39
it is, but how to implement it, technologies around
3:41
it and so forth. And
3:43
for the conversation today, we're focused on the
3:45
Nessie project. And I'm wondering if you can
3:47
describe a bit about what it is some
3:49
of the story behind it and where it
3:51
fits in that context of the data lake
3:53
house. Got it. Okay,
3:56
so bottom line is the Nessie project
3:59
at its core. a catalog. So when it
4:01
comes to the Apache iceberg table format, there
4:04
is a need for a mechanism to be
4:06
active to catalog. So it tracks all the
4:08
different tables and primarily what it does attracts
4:11
a reference to what is the most current
4:13
metadata.json file for that particular table. So
4:16
at the core, that's what Nensi does.
4:18
What Nensi provides is additional ability to
4:20
actually create commits, not at the individual
4:22
table level, but at the catalog level.
4:24
So it actually, every time that those
4:27
catalog references changes, the business should
4:29
like a commit, which means it allows you to have the
4:31
same sort of get like semantics as
4:33
get, as far as being able to do
4:35
branching, tagging. And this kind of changes
4:37
the dynamics of sort of how you interact with
4:40
your catalog and how you plan sort of like
4:42
data ops type practices where you want to kind
4:44
of isolate developer environments or roll
4:46
back when it comes to disaster recovery. It changes a lot
4:48
of things and actually makes it oftentimes easier
4:50
and creates sort of new patterns when it comes to the
4:53
data lake house. You
4:55
mentioned the ability to do branching
4:57
and committing and merging and tagging.
4:59
And I'm wondering, in terms of
5:02
the context of data lake houses,
5:04
the overall data pipelining
5:07
and workflows, what are some
5:09
of the core problems and complexities that Nensi
5:11
is designed to solve for? I mean,
5:13
bottom line, like a couple of different situations
5:16
where Nensi becomes really useful is probably the
5:18
lowest hanging fruit is like data rollback. So
5:20
basically you have maybe a pipeline that fails
5:22
and now you have bad data
5:24
or partial or inconsistent data and let's say
5:26
a handful of dozen tables. Now
5:28
you technically can roll back those tables directly from the
5:30
table format and Apache iceberg. We have to do each
5:32
table one by one. By having a
5:34
catalog level abstraction, I can just roll back the
5:37
catalog to the commit that was like the last
5:39
clean commit. And I can do that in all
5:41
in one fell swoop and move the whole catalog
5:43
back to before the that suggestion job. But
5:46
also what happens a lot of times is that people would
5:48
create duplicates of their data, like for a developer environment. And
5:50
then they would do all their work there and then have
5:52
to merge those environments and it was
5:55
harder to create these environments and to more costly because of
5:57
the storage. But with versioning
5:59
like Nensi, I can basically
6:01
create that isolated branch environment without
6:03
creating a single duplicate of my
6:05
existing data. It would just
6:08
basically isolate the new snapshots going forward, which was
6:10
only, so the only new data is
6:12
really the new data of those new transactions. In
6:16
terms of my experience of surveying the
6:18
overall data ecosystem, in particular, the data
6:20
lake and data lake house environments, the
6:22
closest thing that I've seen to NSE
6:24
as far as this branching and merging
6:26
semantics, the ability to do that kind
6:28
of zero copy cloning, I guess, there
6:31
are two pieces to that. One is
6:33
the zero copy cloning and being able
6:36
to do very low cost developer environments,
6:38
copy on write semantics is with
6:40
Snowflake. I know that they have the
6:42
ability to do that, kind of snapshot
6:45
tables, create a copy of a table
6:47
using the same existing underlying data. But
6:49
from the lake perspective, the closest project
6:52
I've seen is LakeFS, which
6:54
has that same idea of Git
6:57
semantics, but at the S3 abstraction
6:59
layer. And I'm wondering if
7:01
you can talk to some of the overlap
7:03
and some of the divergence between NSE and
7:06
LakeFS and when you might decide to use
7:08
one versus the other. Oh,
7:10
yes, actually, I find the difference is quite interesting. And the
7:13
funny thing is I think they were both sort of kind
7:15
of coming into existence around the same time. I
7:17
recently saw a talk where they talked about sort of the evolution
7:20
of LakeFS. And I remember seeing a talk about the evolution of
7:22
NSE. And those initial questions were the
7:24
same, and both of them started basically asking questions,
7:26
can we just use Git and realizing, okay, like
7:28
the type of throughput, the type of, the
7:31
amount of changing that happens in data is
7:34
not really built for that. So basically you
7:36
have to kind of find some other abstraction. So
7:38
LakeFS went the approach of where you
7:40
basically capture sort of deltas in the
7:42
actual files. So you say, okay, add
7:44
this file, subtract this file, while
7:47
NSE goes through the approach of
7:49
just capturing sort of that metadata change. So
7:51
a couple ways to kind of think about
7:53
it is imagine I updated an iceberg table
7:55
with the insert. That might create a thousand
7:57
new files. So in the case of the...
8:00
The lakeffs command, it's not aware of the table. So
8:02
it's not aware of where the table exists, it's just
8:04
the way there's a thousand new files in
8:06
my file system and then captures a command that says, okay,
8:08
hey, these thousand files have been added. While Nessie,
8:11
the only thing that changes, and the Nessie
8:13
is one thing, it changes the metadata, there's
8:15
a new metadata.json file. So instead of tracking
8:17
a thousand new things and then a thousand
8:19
new things were added, it's just this table
8:21
snapshot has changed from pointing to here to
8:23
there. So it's a much more
8:25
sort of lightweight change that can handle sort
8:27
of very high velocity throughput as
8:29
far as like if you're making a lot of changes of time because
8:32
you're not tracking as many different items. But it's also a couple
8:34
of other differences are that it's sort of more
8:36
table aware because it is at the catalog level,
8:39
which allows you to sort of move
8:41
all of those GIF-like semantics into
8:43
SQL. So I can create a branch
8:46
using SQL, I can merge a branch using SQL,
8:48
I can create a tag, while with
8:50
lakefs, it's usually they're done through, it's
8:52
mostly done through the file path. So basically what
8:54
it does, it takes advantage of object storage and
8:57
says, okay, hey, there's gonna be this dynamic part
8:59
of the file path that represents what branch you're
9:01
on. And then oftentimes you create these
9:03
branches, oftentimes all of the work has to be done with
9:05
a CLI. So while
9:07
probably like for a lot less technical
9:09
users, SQL can be a much more
9:11
accessible approach to doing a
9:13
lot of these things. And the CLI
9:16
tool might be maybe a little less accessible.
9:18
So there's also some ergonomic differences, I would
9:20
say. Zeroing in
9:22
on that catalog element, we've mentioned a
9:24
few times that Nessie is a catalog
9:27
and it corresponds to various pointers into
9:29
the iceberg table format. And I'm wondering
9:31
if we can dig a bit more
9:33
into the context of what purpose does
9:36
the catalog serve in that data lake,
9:38
data lake house environment and what are
9:40
some of the alternatives or
9:42
what are some of the pieces that
9:44
Nessie might replace if somebody already has
9:46
an existing lake house environment? So
9:50
a couple of things first, like so
9:52
right now Nessie primarily works with iceberg. The cool
9:54
thing about Nessie architecture is that it just tracks
9:56
sort of these like little metadata objects. So basically
9:58
it's really just an object. It has
10:00
like a data type and right now the main data
10:02
types you see are iceberg tables, iceberg views. Theoretically,
10:05
other table formats could come into the picture
10:07
pretty easily, but basically tracks
10:09
that metadata. Now the thing
10:11
is that the way the iceberg spec works
10:13
is that generally the catalog, that catalog reference
10:16
is sort of like your source
10:18
of truth when it comes to the current state of the table. So
10:21
the problem is you generally don't want your iceberg references
10:23
in more than one catalog. So
10:25
this is where basically, hey, if I choose
10:28
Nessie as my catalog, then that precludes me
10:30
from using another catalog like an AWS glue
10:32
or a tabular or something like that. So
10:34
oftentimes when you are adopting an Apache iceberg
10:36
lakehouse, you do have to take a look
10:38
at sort of like, what are the
10:40
tools you're using? And what are
10:42
the different features of the different catalogs? Most of them
10:44
are going to generally provide you the main service of,
10:47
of basically, hey, I can identify my tables and I can
10:49
take, hey, I can take this catalog to Spark, Spark sees
10:51
on my tables. I take this catalog to Flank, it
10:53
sees on my tables. I take it to Dremio, it sees on my tables. But
10:57
not every catalog works with every tool currently.
10:59
I think that story has gotten a lot
11:01
better. So most catalogs are workable in most
11:03
places nowadays, but that is essentially sort
11:05
of one of the big sort of cost
11:08
benefit calculations you have to make when selecting a
11:10
catalog. And when it comes
11:12
to particularly with like Nessie, it works with most
11:14
of pretty much all the name, the typical open
11:16
source tools. So it works with Trino, it works
11:18
with Presto, it works with Dremio, it works with
11:20
Apache Spark, it works with Apache Flank. So you
11:23
get that branching emerging across all these tools. So
11:25
if your workflows incorporate these tools, you can then
11:27
add that branching level emerging tagging to it. And
11:31
now digging into the
11:33
versioning capabilities specifically, you mentioned
11:36
that at a
11:38
high level what Nessie does is it keeps
11:40
a reference to all
11:43
of the table metadata
11:45
pointers so that within each set of
11:47
transactions or each commit, you can say
11:50
I am pointing at this set of
11:52
metadata for all of these tables. And
11:55
so you can have commit and
11:57
rollback functionality across tables across transactions.
12:00
levels. And in terms
12:02
of the actual versioning of the data,
12:04
I know that Iceberg has built-in support
12:06
for being able to do optimistic
12:08
and concurrency control and being able to
12:10
keep snapshots to different points in time
12:12
of data based on the underlying
12:15
files and the changes there. I
12:17
also know that it requires a certain amount
12:19
of maintenance to keep the tables kind of
12:21
happy and performant as far as doing things
12:24
like vacuuming and pruning old references
12:26
and old versions there. I'm curious if you can
12:28
talk to some of the ways that Nessie handles
12:30
the interoperability with the
12:32
versioning in Iceberg as well
12:35
as any of the maintenance
12:37
pieces that it can help with as
12:40
far as pruning old versions, running table
12:42
compactions, etc. Yes.
12:44
Okay. So basically the architecture of Nessie is
12:46
that mainly it's basically going to be a
12:48
running service that you would run. You could
12:51
also get it as part of... It's
12:54
actually integrated into Dremios. It's an integrated catalog.
12:56
But essentially it interacts through a REST
12:59
API. And when it comes
13:01
to the versioning aspects, right now if I
13:03
were to capture a commit, basically
13:05
it creates a sort of like JSON-like entry in
13:07
the backing store. So it could be like a
13:09
ROC-CB, a Postgres, whatever you choose as
13:11
your backing store. That'll say basically
13:13
I have a timestamp for that commit, sort of
13:15
like the parent commit to that. So that way
13:17
he knows what the sort of the tree looks
13:20
like. And then just a couple of other metadata
13:22
pieces. So right now it's like a very small
13:24
metadata imprint. So right now it's ability to like
13:26
generally the best practice is oftentimes like one
13:28
branch at a time. And there's actually, I'll give you
13:30
a couple examples of people who are actually doing that
13:32
in production in that way. But when
13:34
it comes to the maintenance side, this is where it gets a little
13:36
bit tricky. Because typically when it comes to Iceberg, when
13:38
you're doing like expired snapshots or something like that,
13:41
the assumption is that there's
13:43
only essentially that like that
13:45
tables metadata is aware of all of its
13:47
own snapshots, which then that sort of with
13:49
Nessie, you might have different branches where there's
13:51
different versions of metadata JSON that has references
13:53
to different snapshots for it to be
13:56
aware of like, okay, hey, which when I expire, which snap
13:58
the snapshot, how do I know which files? I
14:01
can safely delete. So what
14:03
Nestle did is they created their own tool called the GC Cleaner, which
14:06
does that kind of garbage cleanup. So it'll actually take a
14:08
look at the metadata JSON at the head
14:10
of each sort of branch and be able to
14:12
kind of safely identify, hey, which files are able
14:14
to be deleted. So when you run
14:17
the vacuum command on, either
14:19
when you run the GC Cleaner independently, or if
14:21
you're using Dremio, you use the vacuum command, it'll
14:24
use that tool to then safely make sure
14:26
it deletes the right data file
14:28
source without affecting other branches. Now,
14:31
as far as the versioning pieces,
14:34
anybody who's used Git for any length of
14:36
time has dealt with the dreaded merge conflict.
14:38
And when you're dealing with numerous tables, potentially
14:40
dozens or hundreds, the last thing that you
14:43
want to think about is how do I
14:45
deal with a merge conflict? If I'm creating
14:47
a branch and then I need to merge
14:49
it back after somebody else has created their
14:51
own branch and merged it ahead of mine.
14:53
And I'm curious if you can talk to
14:55
some of the ways that those versioning changes,
14:57
branching and merging are
14:59
kind of sanitized so that we don't have to
15:01
deal with these big complex messy merges
15:04
in the event that underlying data has
15:06
changed in a manner that is incompatible
15:08
across branches. Yeah, I mean, right
15:10
now it's pretty shallow. So it's just tracking basically
15:13
that metadata reference and essentially a timestamp and a
15:15
parent. So right now you can get a merge
15:17
conflict pretty easily if you're starting like several branches
15:19
at the same time. Typically the pattern
15:21
we've been seeing is that what people will do is they'll
15:23
start a branch at the beginning of the day. So what
15:25
they'll do is they'll create a branch for that day. And
15:28
then they'll do all their ingestion for that day on that
15:30
branch, run some validating logic at the end of the day
15:32
and then basically merge that branch at the end. So instead of
15:34
creating like lots of branches, at
15:38
least for ingestion purposes, usually you wanna stick to sort of
15:40
like one branch per catalog. And
15:42
then that, or you could have new branch for each use case.
15:45
So basically I'll create a branch for today. We
15:47
validate at the end of the day. And then basically at the
15:49
end of the day, you're always merging that validated data back into
15:51
production. And then other uses, if
15:53
you're gonna do more branches, usually other use cases
15:55
would be like, okay, I'm just creating a branch
15:57
just for experimentation purposes. Or I mean. creating
16:00
a branch to isolate some particular changes that I don't plan
16:02
to merge again but I want to kind of keep this
16:05
separated. But generally, as far as merging in,
16:07
right now you probably would prefer to keep it
16:09
sort of make a branch merge it in. Part
16:12
of what's evolving in the project is kind
16:14
of adding more metadata at that, at what
16:16
the catalog tracks so that way later on
16:18
you can have more sophisticated sort of merge
16:20
resolution. So right now, best practice would be sort
16:22
of like make an ingestion,
16:24
have like a branch that is your ingestion branch and
16:27
keep it that way and then merge it and then
16:29
create another branch for the next ingestion job after that
16:31
ingestion job is complete. Digging more
16:33
into Nessie specifically, you mentioned a little bit
16:35
about some of the specifics of running and
16:37
I'm wondering if you can talk to the
16:40
overall architecture and design of the Nessie project
16:42
and some of the ways that it has
16:44
changed and evolved in scope and purpose from
16:46
when it was first started. Yeah, I mean
16:49
I think when it first started it was
16:51
and I think it still wants to, it's
16:53
still in this regard of being sort of
16:55
a lakehouse catalog. So while it mainly works
16:58
with Apache Iceberg, it has the
17:00
architecture so that it can expand in a
17:02
sense because basically what happens is that there's
17:04
all these different types of things
17:06
that can track and then there's just
17:08
essentially just deciding on an agreed schema
17:10
that's built into or built-in types. So
17:13
right now like the types are like namespace so if
17:15
you're creating a subfolder or like a database however you
17:17
want to think of these namespaces, there's
17:19
iceberg views, iceberg tables. There's also Delta
17:21
Lake tables that are actually part of
17:23
the spec right now and
17:25
they did try to make, there was a
17:28
pull request made to the Delta Lake repository
17:30
to kind of have that functionality but that
17:32
pull request never got merged in. So that
17:34
is a to-be-seen in the future to see
17:36
if we can eventually get that change made.
17:38
But I mean you know from like a
17:41
format like a Delta Lake or a hoodie, most of
17:43
the time the table is just a particular directory. So it
17:45
could just be as easy as just having a schema
17:48
that's just basically hoodie table, Delta Lake table that
17:50
just points to a directory and then it could
17:52
catalog those as well. It doesn't
17:54
now but it wouldn't
17:57
be hard to do because it has a very... Again,
18:00
it's very flexible, it's just capturing. This
18:02
is the type of metadata that this little
18:04
object tracks, and then making sure that you have
18:06
a metadata object
18:09
attached to that that matches the schema for
18:11
that type. So Iceberg has a particular
18:13
set of information that you would keep with it. But
18:15
the way you interact with the catalog is to arrest
18:18
API. Now, so you could
18:20
always custom make these
18:22
API calls, but there is a client in
18:24
Java and then Python to
18:26
directly interact with Nessie on top of the integrations that
18:28
are already used with a bunch of tools. But
18:31
basically, there is a standard specification that
18:33
is there is an open, there is
18:35
the open API spec on
18:37
the Nessie documentation to
18:39
help the endpoints. I
18:41
definitely spent a few days exploring that quite in
18:43
depth because I made like an unofficial client just
18:46
to kind of get more acquainted with it. And
18:48
that was a fun adventure, but
18:50
it's a pretty straightforward API. Are
18:55
you sick and tired of sales data conferences? You
18:58
know, the ones run by large tech companies and
19:00
cloud vendors? Well, so am I. And
19:02
that's why I started Data Council, the
19:05
best vendor neutral, no BS data conference
19:07
around. I'm Pete Soderling,
19:09
and I'd like to personally invite you to Austin
19:11
this March 26th to 28th, where
19:14
I'll play host to hundreds of attendees, 100
19:16
plus top speakers, and dozens of hot startups
19:19
on the cutting edge of data science, engineering
19:21
and AI. The community
19:23
that attends Data Council are some of
19:25
the smartest founders, data scientists, lead engineers,
19:28
CTOs, heads of data, investors and community
19:30
organizers who are all working together to
19:32
build the future of data and AI.
19:36
And as a listener to the Data Engineering Podcast,
19:38
you can join us. Get
19:40
a special discount off tickets by using
19:42
the promo code DEPOD20. That's
19:45
D-E-P-O-D-2-0. I
19:48
guarantee that you'll be inspired by the folks at the
19:50
event, and I can't wait to see you there. Another
19:55
interesting aspect of this project, going back
19:57
to its nature as a catalog, is
19:59
that... the overall space of
20:01
data catalogs for data lake environments
20:04
has largely been a pretty static
20:06
target of you have the Hive
20:08
catalog, or you have the Hive
20:10
catalog, maybe in the form of AWS glue is
20:13
just actually still just the Hive catalog. And
20:15
I'm curious in the work of
20:17
building an evolving messy using it
20:20
as an alternative catalog to that
20:22
Hive ecosystem. Some of
20:24
the ways that you have been constrained
20:27
from innovating a lot in terms
20:29
of what the catalog can offer and how to
20:31
operate with it. And some of the ways that
20:33
you're able to try to move
20:35
the entire ecosystem along a bit to
20:38
understanding some of the new ways that
20:40
the catalog can and should be thought
20:42
of in this data lake
20:44
house ecosystem and maybe some of
20:47
the arbitrary limitations that the Hive
20:49
catalog API has imposed upon us
20:51
until now. Yeah, I mean,
20:53
I think a lot of like a lot of the
20:56
solutions to that particular problem were more
20:59
repaired on the like the table format side. So
21:01
essentially, like iceberg really kind of broke away from
21:03
sort of like the constraints of having to have
21:05
Hive where you have to kind of have folders
21:07
and sub folders that define your table. And
21:10
then Nest is able to leverage that by being able
21:12
to just refer to that table metadata and just focusing
21:14
on capturing the versions of that. So basically, it
21:17
almost takes a whole different paradigm of
21:19
what the catalog does that instead of it
21:21
being the bearer of the metadata, but
21:23
instead of it being sort of the gatekeeper of
21:26
where the metadata is. So basically, where Hive, you
21:29
have the Hive minister that kind of acts as
21:31
both your catalog and metastore. Nesse
21:33
basic acts the catalog and the iceberg
21:35
will then be sort of really where
21:37
the metadata is stored on your S3
21:39
in those manifest and manifest lists. And
21:42
in that case, you can much easier
21:44
incorporate future formats and new paradigms the
21:46
catalog. So I don't think
21:48
it's initially been constrained. It's just a matter of
21:51
like people choosing to adopt Nesse. That's
21:53
become a lot easier in recent times, particularly just because
21:55
it's like now that it's integrated the Dremio, a lot of
21:57
people are just using it because it's Once
22:00
you have a Dremio Lakehouse, it just kind of
22:02
is there. So it's just, it's there. So
22:04
why not use it? And then you don't have
22:06
to stand up the servers, you don't have to maintain it. So
22:09
it makes the whole process a lot easier. But there's still also
22:11
a lot of people who just deploy Nissi on their own and
22:13
are just using it that way. Because they prefer to have
22:16
that service that they manage on their own. They want to
22:18
use a different backing store. They just want to have control
22:20
over it. So we have seen a
22:22
lot of adoption on that site too. Especially
22:25
over like, again, this last year has definitely been a
22:28
big year for processing, growing adoption for Nissi. For
22:31
that integration process or running it yourself,
22:33
what are some of the steps involved
22:35
in actually getting it deployed, getting
22:37
it integrated into a data stack
22:40
and maybe some of the
22:42
complexities that people should be planning
22:44
for, especially if they have an
22:46
existing catalog that they want to migrate away from? I
22:49
guess the first step as far as deployment goes, I
22:51
mean, if you just want to try it out, there's a Docker
22:53
container and that's pretty straightforward to use. If you want to deploy
22:55
it for production, there is a Helm chart. So you can deploy
22:57
that pretty easily using the Kubernetes Helm chart. And
23:00
then very soon, there'll
23:02
be an iteration of the Dremio Helm chart that also
23:04
should incorporate a lot of those details. So that way
23:06
you can simultaneously deploy them
23:09
easily. But once you actually have
23:11
it deployed, far as like migration goes, it just depends
23:13
on sort of what your use case is. So basically
23:16
the function would be, hey, you're probably
23:18
using Apache Iceberg or going to Apache Iceberg.
23:21
So if you're already using Apache Iceberg before
23:23
you adopt in Nissi, the question
23:25
is then becomes what is your prior existing
23:27
catalog? So regardless of which catalog
23:29
it is, what happens actually part of the Nissi project,
23:31
they came out with a CLI tool for catalog migration,
23:34
which is not just for Nissi, but for any Iceberg
23:36
catalog. So you could literally, you would just
23:38
put in the credentials for the source catalog, and
23:40
then you put in the credentials for the destination catalog
23:42
and what it does, it'll move all the references over.
23:45
So then that catalog will have all
23:47
the metadata references basically
23:50
in one fell swoop. The only challenge there
23:52
always becomes is, well,
23:54
not really a challenge there. That should work
23:56
fine. It's always like issues like there's the
23:58
catalog because the actual. So query engine has to
24:01
have access to the catalog and then separately have access
24:03
to the storage where the actual metadata is stored. So
24:06
where an accident can happen is that, you
24:08
know, you decide you are
24:11
using an engine that doesn't read, you know, your
24:13
files are in Hadoop, you just do a blanket
24:15
migration of catalogs, but now you're using a tool
24:17
that can read Hadoop file storage, so now you
24:19
still can't read those tables, you can read the
24:21
catalog. So you definitely have to kind
24:23
of keep in mind that you always have to think about,
24:25
hey, does the tool have access to the catalog and
24:27
the storage? As long as you keep those two in check,
24:30
usually you shouldn't really run into any problems because essentially
24:32
the query engine's path is packed with the catalog, then
24:34
check the storage. As long as it can do
24:36
both, you're gonna be able to read those tables just
24:39
fine, assuming it adopts the iceberg
24:41
spec. Now, if you're not using iceberg, then you're probably
24:43
not using this, so it's less of a consideration there.
24:46
In terms of iceberg itself, that also
24:48
provides a moving target because it's a
24:50
very active project, a lot of different
24:53
engines are adopting it, it has been
24:55
growing in terms of its overall capabilities
24:57
and usage, and I'm curious how that
24:59
has influenced the direction and
25:02
development of Nessie and some of
25:04
the ways that Nessie has been
25:06
able to capitalize on the newer
25:08
features in iceberg. Basically, Nessie
25:10
just operates as a way to discover the tables, in
25:13
that case, it's independent of what's in the metadata. All
25:15
it cares about is the location, right now it only
25:17
cares about the location of that metadata at JSON, so
25:20
what's inside the metadata at JSON, what's inside the other
25:22
metadata files, so as we start adding things like the
25:24
leaf files, the coffin files, whatnot to
25:26
the iceberg specification, and in
25:28
the future, other files, I think, there's also some other
25:30
things that are sort of in discussion right now, all
25:33
of that would not affect the way Nessie operates,
25:36
since basically it's only versioning
25:38
the references and not versioning the actual metadata
25:41
itself right now. Again, in the future, it'll
25:43
probably start holding more of the metadata so
25:45
that way it can do those more sophisticated
25:47
merges and be more context aware of the
25:49
tables, but the kind of data that
25:51
it's probably going to need to track to do that is
25:53
probably not the kind of stuff that's changing right now, because
25:55
I mean we're talking about like, okay, what are the trials
25:57
that got added, what are the files that were subtracted? It
26:00
doesn't necessarily have to track every single thing that
26:02
the same iceberg metadata does, just what it needs
26:04
to be aware of not
26:06
tripping up once merging. And
26:09
then another responsibility that can
26:11
often get pushed into the catalog
26:13
layer is the question of
26:16
access control or permissioning. And I'm curious
26:18
how NSE handles that aspect of the
26:20
problem space. Yeah, there's two ways you
26:22
can handle that right now. Essentially,
26:27
you can have different users that are
26:29
accessing the NSE catalog. And essentially, the
26:31
access controls are applied to the user.
26:33
So basically, if I access the catalog
26:35
with a particular token, well,
26:37
basically, it'll be aware of, hey,
26:40
this person using this particular access token
26:42
can only access these branches, these
26:45
objects, these kind of things. So you can do that manually
26:47
with NSE. And there's ways of
26:49
configuring a lot of that. That still probably
26:52
requires a lot of manual configuration. When
26:55
you're using NSE as it's
26:57
integrated into Dremio, then it falls
26:59
into Dremio's more point and click
27:02
type of authorization, where you can
27:04
basically have role-based access controls, role-based
27:06
access controls, column-based access controls at
27:09
the query engine layer. So
27:11
basically, it'll leverage some of NSE's
27:14
branch-level controls and then also leverage Dremio's
27:17
query engine-level controls when
27:19
you give different users tokens from
27:21
different tools. Since
27:23
NSE is part of a given
27:26
data stack, the versioning
27:28
and branching and merging capabilities are
27:30
part of the core primitives of
27:32
the system. How have you seen
27:35
that influence the overall
27:37
workflow and design approach that teams
27:40
take as far as the development,
27:42
deployment, evolution of their data processing
27:44
and data delivery flows? Actually,
27:47
it was pretty simple, because oftentimes, again,
27:50
the pattern I've seen most was the pattern I mentioned earlier where
27:52
people just do a daily branch. So basically,
27:54
all you do is you just tweak all your jobs to
27:56
just hit that particular branch. say,
28:00
let me branch whatever you want to call the branch,
28:02
and then it may be a timestamp or a date.
28:05
Basically, it's pretty easy to programmatically set up your
28:07
pipeline to always make sure that they're targeting the
28:09
right branch name. So then you can just
28:11
kind of run the branch. It'll always hit that day's branch. And
28:14
then basically, everything becomes very turnkey. But
28:17
midday, you're not going to see your
28:19
production data getting painted because it goes
28:21
through that daily process. And then systems
28:23
that need to access that data real
28:25
time, they have access to
28:27
that branch. So they'll query that branch directly without
28:30
the same sort of guarantees you would get
28:32
with the production branch, and that'd be sort
28:34
of clearly communicated. But that's usually it because
28:36
once you have it, basically, the actual creation
28:38
and merging of branches is pretty straightforward enough
28:41
to do an SQL and then
28:43
automating that SQL with whatever, whether
28:46
it's Spark, Flink, or Dremio, is
28:48
pretty easy. So what they have
28:50
is basically just kind of deciding
28:52
what is the frequency of their branching
28:54
patterns. Do they want to do hourly branches, daily
28:56
branches, weekly branches, and what their merge
28:59
cancel is going to be? But once you kind of figure
29:01
that out, once it's implemented, you don't
29:03
really think about it anymore. It just kind of works.
29:06
Another common target for operating with
29:08
this data is something like a
29:10
DBT. And you mentioned
29:12
the zero copy clones, effectively, of
29:14
being able to create per-user branches.
29:17
I'm curious how you've seen folks
29:19
incorporate Nessie's versioning and
29:21
branching capabilities into the
29:23
development workflow of DBT
29:25
users and data analysts.
29:28
I've seen it with DBT users because Dremio does
29:30
work with... Well, again, any tool works with DBT.
29:33
But basically, in the SQL, you
29:35
can specify the branch in your query. So you
29:37
can just sit there. So I've seen it personally.
29:39
I've seen it with Dremio. And then basically, you
29:41
can just sit there and just add app branch
29:43
at the end of each of your queries, and
29:46
then you get all the benefits of DBT and all the
29:48
orchestration and using Git version control
29:50
on your DBT models. But then you
29:52
also get this other layer of versioning at the catalog
29:54
level, so you get to leverage both and get the
29:57
benefits of both. In your
29:59
experience... of working with Nessie, exploring
30:01
its ecosystem, diving deep into the iceberg
30:03
table format and the ways that the
30:06
two interoperate. What are some of the
30:08
most interesting or innovative or unexpected ways
30:10
that you've seen the Nessie project applied?
30:13
There was one, I'm trying to remember what
30:15
the exact details were, but I've seen some
30:18
interesting applications of just creating a branch
30:20
for like, just
30:22
to kind of create wildly different versions of the data.
30:24
Like actually, you
30:27
know, one example with you
30:30
still using that data pattern I mentioned before, but
30:33
also what they'll do is they'll create experimental
30:35
branches, because these are like, generally like large
30:37
financial institutions who we've seen this pattern with.
30:40
And what they'll do is they'll create a branch that
30:42
they use for doing like stress testing type,
30:45
whatnot. Because what they can do is that
30:48
they can create a safe copy of their production
30:50
data for that day to then make
30:52
the changes to that data that they don't want
30:54
to permanently make to then run all their stress
30:56
testing calculations on. And then they can
30:58
just throw away the branch at the end
31:00
of the day without having to really worry about rolling it
31:03
back or undoing the data. So there's makeup
31:05
branches at the beginning of the day, that's for like stress testing, add
31:08
in the, hey, bad scenario here,
31:10
worst case scenario there, and
31:12
then run their tests, and then they can dispose of it
31:15
every day. In your experience of
31:17
exploring this space, keeping up to date
31:19
with the use cases, the
31:21
technologies behind it, what are some of the
31:23
most interesting or unexpected or challenging lessons that
31:25
you've learned? I mean, oftentimes
31:27
I think where it's
31:30
always gonna be sort of like the great thing about the
31:32
lake house is that everything's very modular. So you can kind of
31:34
swap out the different pieces you want, but
31:37
there's still like little gotchas,
31:39
particularly in sort of like,
31:41
as I mentioned earlier, when you're working with any
31:43
catalog in the iceberg space, there's
31:45
sort of two layers. So you have to make sure that
31:47
you have the authentication to access the catalog and you have
31:49
the authentication to access the storage and different tools have different
31:52
stories when it comes to both of those layers. And that's
31:54
oftentimes where a lot of gotchas kind of come in. So
31:57
I always just say, hey, doing the legwork to make
31:59
sure that... When you're working with a catalog,
32:02
making sure that the tools you use can read the
32:04
catalog and then also access the storage. Because I can
32:07
definitely find people on the boat where they're working with
32:09
something that they like, but then they move
32:11
to X tool and then now,
32:14
they were using let's say X
32:17
object storage. Now that particular
32:19
storage layer is in readable layer tool,
32:21
so it interferes with their plan, even
32:23
though it could interact with messy or some other catalog. And
32:26
for people who are interested in these
32:28
versioning capabilities, what are the cases where
32:31
a messy is the wrong choice and
32:33
maybe you're better served by just using
32:35
an AWS glue or maybe just not
32:37
even using iceberg at all? Yeah, I
32:40
mean, well, basically, they're using iceberg. I think
32:42
messy is a good option. Now, the reasons
32:44
you would not use, you might choose a
32:46
glue, but oftentimes, because you're really inside the
32:48
AWS ecosystem. So if you're connecting to Athena,
32:51
you can redshift, you're connecting to all these
32:53
tools, then, you know, AWS glue is
32:55
going to be a very easy sell because
32:57
it's going to have interactivity. But if
32:59
you're operating multi cloud or in a completely
33:01
different cloud, that's, you know, that's
33:03
not necessarily going to be the
33:06
same, same saliency. But if
33:08
you're not even using iceberg at all, if you're
33:10
using Delta Lake or hoodie, then oftentimes, like different
33:12
solutions might work better. Like generally, with there, the
33:14
only option, it would be like lake, if you
33:16
only have file level versioning available
33:18
at the moment. Which again, another, I always
33:20
like think of it like another feather in the cap for iceberg.
33:22
Not only does it have the rich ecosystem
33:25
of things that can write to it, read to
33:27
it, manage tables, but
33:29
you also have a rich options of how
33:31
you can version control your tables, but it's
33:33
file versioning, table level versioning, or catalog level versioning.
33:35
Iceberg really gives you a lot of options
33:37
to kind of really architect the lakehouse you
33:39
need. Have you ever seen where
33:42
people are using both lake FS and Nessie
33:44
in tandem? I don't think I've seen
33:46
it yet. I've seen one or the other. Theoretically,
33:48
they can work together. I mean, it could
33:51
be like basically one of the issues like lake
33:53
FS has with like iceberg in particular is that
33:55
iceberg really depends on absolute path. And
33:58
lake FS depends on relative path. So LakeFS
34:00
had to create their own custom catalog. Or
34:02
the problem with the custom catalog though, is engine support. So
34:05
it works with X-bar, RK link, but then you get
34:07
too many other engines, you have
34:10
trouble connecting that catalog. So I could see a world where
34:12
basically someone is working with multiple formats, they may be working
34:14
with a Delta Lake and an iceberg, and they
34:16
might want to use Nessie for iceberg, but they want to use
34:18
LakeFS for Delta Lake. And I can
34:20
see that. And I mean, I can
34:22
see different situations where you're working with data that's outside of
34:24
a table, that you're going to want to roll back, whether
34:27
it's like, you know, a group of CSV files, like
34:29
LakeFS would be helpful. But again, when
34:32
it comes to your main Lake house
34:34
catalog, you might prefer Nessie to
34:36
provide those kind of semantics. So I can see a
34:38
world where all three levels have
34:40
benefits, because even at the table level with
34:43
iceberg, a nice thing about being able to
34:45
tag tables in iceberg, is
34:48
that it prevents them from being cleaned up when you
34:50
do cleanup operations. So if I tag like an end
34:52
of month snapshot, then when I expire
34:54
snapshots, it won't clean up those tags
34:56
snapshots. I mean, the same story when
34:58
you tag commits in the catalog level. But again,
35:00
there's going to be different situations where you might
35:02
want each of these levers to be available to
35:04
you. So I haven't seen it too much
35:07
yet, because I just feel like I'm still seeing, I'm
35:09
just starting to see people start adopting these kind of patterns, at
35:12
least on the Lakehouse level. And then also like the
35:14
sort of get style delivery of them, but I'm starting
35:16
to see it more and more adopted, but it's still
35:18
sort of very early days. For
35:21
people who are interested in
35:23
Nessie and want to keep
35:25
abreast of its development, its
35:28
future direction, what are some of the things that are planned for
35:30
the near to medium term, or anything that
35:32
you're keeping an eye on, or you're excited
35:34
to see come to fruition? Oh,
35:36
I guess my wish list is
35:38
going to be probably be like
35:41
high iceberg support for Nessie. That's
35:43
definitely going to be, that's definitely high on my wish list.
35:46
I tried to make that contribution. I started
35:48
like writing some of
35:50
the full requests, but I didn't ended up
35:53
just not having the time that
35:55
I would have liked. So if anybody wants to
35:57
help contribute that, please, please go join over there.
36:00
There's a lot of great work to do there. And
36:02
there's a lot of really great devs working over there
36:04
on Nessie that you can communicate
36:07
directly with them on the Nessie Zulip. So Nessie
36:09
uses, instead of Slack, he uses Zulip, which is
36:11
like the open source Slack. So
36:13
you can communicate there. So there you can like learn,
36:15
participate in the conversation about the
36:17
evolution of the format. I'm not the format
36:19
of the catalog and its future features.
36:21
But I would say like my
36:24
short-term worst list would be PyEisberg support. One
36:26
of the cool things that I keep hearing about long
36:29
run is it's gonna be again, that sort of more
36:31
context awareness. So, and then
36:33
also would be really cool is that
36:35
eventually, you know, the pull requests can
36:38
get accepted over there in Databricks. So that way,
36:40
I mean, Delta Lake to be able to support,
36:44
be able to support Delta Lake and Nessie or something like
36:46
that to just offer more options. So that way it can
36:48
become, ideally, you know, you have a catalog, you can hold
36:50
all the things. But it's a pretty cool
36:52
tool. And the patterns I'm seeing with it are pretty fun.
36:54
And then I think what's most unique about
36:56
it is just sort of, when you start doing the SQL for
36:58
anything, and how easy it is to do it. Yeah,
37:01
to me, that's when I was like, okay, this is nice.
37:03
This is just easy and simple
37:05
to use that, it
37:08
really does make a lot of new patterns, a lot
37:10
easier to execute. Are there
37:12
any other aspects of the Nessie
37:14
project, the overall kind of
37:16
use cases or capabilities of data versioning in the
37:19
Lakehouse that we didn't discuss yet that you'd like
37:21
to cover before we close out the show? I
37:23
guess a couple other use cases that
37:26
I think are implied, but just to make them
37:28
explicit, are like multi-table
37:30
transactions. And one
37:33
thing I think, like right now, they
37:35
have introduced like multi-table transactions at the
37:37
table level versioning or in the table
37:39
level in iceberg. But the way
37:41
it's done is you have to use a catalog
37:43
that's for a catalog and they have to kind
37:45
of implement this multi-table transactions. And
37:47
it's more like a traditional sort of begin
37:49
and transaction type style, where basically you
37:52
have to kind of do everything at one time. The nice
37:54
thing about the git style, would you get with like a
37:56
Nessie or like a fast so they're both taking that sort
37:58
of git approach? is that
38:00
I can create a branch, and
38:02
I can do multiple transactions. And none of
38:04
those transactions are published until I do a
38:06
merge. So I could be doing one transaction
38:08
on one table in Spark, another transaction on
38:11
another table in Flink, another transaction on another
38:13
table from Tino or Dremio. And
38:15
then, with all those transactions are done, all
38:17
those transactions can be published simultaneously to all
38:19
those tables through one merge. And that's sort
38:21
of a unique form that just doesn't actually
38:23
currently exist in a data warehouse at all.
38:26
And so that's a really
38:28
neat thought process, because
38:31
I do think it opens up some new
38:34
ways that you think about how you do
38:36
those transactions across multiple tables and work with
38:38
multiple table semantics. Well, for
38:40
anybody who wants to get in touch with you and
38:42
follow along with the work that you're doing, I'll have
38:44
you add your preferred contact information to the show notes.
38:47
And as the final question, I'd like to get your
38:49
perspective on what you see as being the biggest gap
38:51
in the tooling or technology that's available for data management
38:53
today. Basically, my
38:55
opinion is going to be something that's tied
38:57
all together. And that's kind of what I
38:59
find working at Dremio really exciting, because we
39:02
do have a tool that's really trying to
39:04
tie things like Iceberg, NSE, all these different
39:06
data sources, and tie it together in sort
39:08
of one cohesive platform, where it feels like
39:10
you're getting that modular system. But it
39:12
comes with the ease of use and nice
39:14
sort of flavor that you get with a
39:16
more integrated system like Snowflake,
39:18
where you get that ease of use
39:20
in a more deconstructed system on the
39:22
late-house. And I think that
39:24
has been the thing that people are really, really
39:26
looking for. And I do
39:29
feel like we are,
39:31
or on the verge, are really kind of providing
39:33
the solution to that. So that's
39:35
a pain you're feeling. Definitely come talk to me. All
39:38
right. Well, thank you very much for taking the
39:40
time today to join me and share
39:43
your perspective and your experiences working with
39:45
NSE and helping us understand the problems
39:47
that it solves and how to incorporate
39:49
it into a data lake environment. It's
39:51
definitely a very cool project. It's great
39:53
to see more investment
39:55
and evolution of this data
39:58
versioning capability in. the
40:00
data processing ecosystem. So appreciate the time and energy
40:02
you're putting into that and I hope we enjoy
40:04
the rest of your day. Thank you
40:06
very much. It was a pleasure. Thank
40:14
you for listening. Don't forget to check
40:16
out our other shows, Podcasts.init, which covers
40:18
the Python language, its community, and the
40:20
innovative ways it is being used, and
40:22
the Machine Learning Podcast, which helps you
40:25
go from idea to production with machine
40:27
learning. Visit the site at dataengineeringpodcast.com to
40:29
subscribe to the show, sign up for
40:31
the mailing list, and read the show
40:33
notes. And if you've learned something or tried
40:35
out a project from the show, then tell us about it. Email
40:38
host at dataengineeringpodcast.com with
40:41
your story. And to help other people find
40:43
the show, please leave a review on Apple
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More