Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello! And welcome to the Data Engineering
0:13
Podcast the show about modern data management. Dexter.
0:17
Offers a new approach to building and
0:19
running data platforms and state a pipelines.
0:21
It is an open source cloud native
0:23
orchestrator for the whole development life cycle
0:25
with integrated lineage and observe ability, a
0:27
declared of programming model, and best in
0:29
class test ability. Your. Team
0:31
can get up and running and minutes
0:34
thanks to Dexter Cloud and Enterprise class
0:36
hosted solution that offers several Us and
0:38
hybrid deployments, enhanced security and on demand
0:40
ephemeral test deployments. Good. A
0:42
Date Engineering podcast.com/dexter today to get
0:44
started and your first thirty days
0:47
or free. Data. Lakes are
0:49
notoriously complex for date engineers who
0:51
battle to build and scale high
0:53
quality data work clothes on the
0:55
data. Lake Starboard Powers pet about
0:57
scale Sequel Analytics fast at a
0:59
fraction of the cost of traditional
1:01
methods to that you can meet
1:03
all of your data needs ranging
1:05
from a i to date applications
1:07
to complete analytics trusted by teams
1:09
of all sizes including Comcast and
1:11
Door-star Burst. As a data like
1:13
analytics platform, the delivers the adaptability
1:15
inflexibility a lake has ecosystem promises.
1:17
And. Style burst as all of this on
1:20
an open architecture with first class support
1:22
for Apache Iceberg, Delta Lake, and hoodie.
1:24
See. You always maintain ownership of your data.
1:27
Wants. To see Star Breast and Action
1:29
Good a date engineering podcast.com/starboard East and
1:32
get five hundred dollars in credit to
1:34
Tristar Breast Galaxy Today The easiest, fastest
1:36
way to get started using Tree Know.
1:38
Your hostess Tobias Macy and today I'm
1:40
interviewing Dame Sundstrom about building a data
1:43
lake house with Tree Know and Iceberg
1:45
so Dame can start by introducing yourself.
1:48
While. I'm Danes Hallstrom. I
1:50
am one of the founders
1:52
of Tree Know and Presto.
1:54
Before that, I am Ctl.
1:56
At starters, I'd been. Working.
1:59
in data lake space
2:01
for about 10 years now. Before
2:03
that I worked some other
2:05
startups and before that I was
2:08
one of the original people at JBoss
2:10
and spent a lot of time in Java
2:12
EE and that sort of space. And
2:15
do you remember how you first got started working in data? My
2:18
background mostly was distributed computing. So
2:20
out of college, I started working
2:22
at United Healthcare on distributed computing
2:24
using Intra DCE in the nineties.
2:27
And then switched to like Java EE
2:29
back when it was called something else.
2:32
And, uh, I, as part of
2:35
that, I wrote the object relational
2:37
mapping tools for JBoss. Then
2:40
eventually we long,
2:42
long time forward started working
2:45
at Facebook. And one
2:47
of the original projects from the head
2:49
of infrastructure was to come up with
2:51
a faster, better way of interacting with
2:53
their large data warehouse at the time.
2:56
So this is like 10 years ago
2:58
and it was, I don't
3:00
know, three, 400 petabytes or something, it's dramatically
3:03
bigger now. And they didn't have
3:05
a team to do it. And
3:07
myself and David Phillips and Martine
3:10
have background in Java, extensive background
3:12
and databases and stuff like that.
3:14
So we were available and we
3:16
started working on it, but I'm
3:19
mostly a distributed computing
3:21
person. So I wrote most of
3:23
the distributed computing parts of Trino,
3:25
whereas like Martine's a deep language
3:28
person. So he did a lot
3:30
of the language, uh, optimizations
3:32
and David is
3:35
deeply into databases has been forever. And
3:37
so built a lot of the database
3:39
parts and the tooling and things like
3:42
that. As an outgrowth of
3:44
that effort, along with a number of
3:46
other contributions to the ecosystem, we have
3:48
landed in this space where we have
3:50
a new architectural paradigm for analytical systems
3:53
that is largely phrased as the data
3:55
lake house as a midway point between
3:57
data lakes and data warehouses and. For
4:00
the purposes of this conversation, I'm wondering
4:02
if you can give your definition of
4:04
what constitutes a data lake house. It's
4:07
a really good question because I think
4:09
people play fast and loose with it.
4:11
So historically, I would say a data
4:13
lake is you have traditional
4:15
storage, external storage, so you're talking
4:18
HDFS is generally what people are
4:20
talking about it. But nowadays, like
4:23
HDFS is so rarely used, it's
4:26
almost always some cloud object storage,
4:28
S3, GCS, Azure stuff. So definitely
4:30
all the data stored in that.
4:32
And then I think the important
4:35
part comes with a lake
4:37
house of talking about standard
4:39
data representations. So like you
4:42
can be a vendor and store all your data
4:44
in S3 if it's proprietary stuff. And
4:49
proprietary, I'm just going to define as
4:51
you're the only one who really implements
4:53
it. I don't care if you have
4:55
an open spec or whatever. Like it
4:57
doesn't matter. Like if you're the only
4:59
serious player in it, it's effectively proprietary.
5:01
So where I think about
5:03
it now, it's object storage. It's doing
5:05
it in the lake. So it isn't
5:07
like, Oh, I take the files
5:09
and then I import them into my special
5:11
proprietary format and then I process them. And then
5:14
I dump the data back out. That's the lake
5:16
as a sidecar to you. So it's
5:18
when you're doing transformations, when you're doing
5:20
data maintenance, the data goes is
5:23
operated on directly as the lake being
5:25
your native form. Everything else is,
5:27
you know, a bolt on, which
5:29
not to say is terrible. It's just a different
5:31
thing. Absolutely. And another interesting
5:34
aspect of the idea of the data
5:36
lake house is that the reason for
5:38
framing it as such is that it
5:40
intends to add a lot of the
5:43
user experience benefits that you get from
5:45
a fully vertically integrated
5:47
database system, such as dead warehouses,
5:49
whether that is an actual vertically
5:51
integrated system, as of the
5:54
days of your or a cloud native system
5:56
where compute and storage are disaggregated,
5:58
but still presented as a. single unified
6:01
experience. And I'm wondering
6:03
if you can talk to some
6:05
of the ways that we have
6:07
actually as a community hit that
6:09
mark? And what are some of
6:11
the areas where we're actually still
6:13
falling short of the user experience
6:15
presentation of this cohesive platform versus
6:18
the parts where the gaps still
6:20
show through and you can see that it's actually
6:22
five different pieces that are trying to work together.
6:25
Yeah, I think we've done an okay job.
6:27
I think we got a long ways to
6:29
go though. If you had asked me this
6:31
question three years ago, I would
6:34
have just gone on and on and
6:36
on about the litany of like broken
6:38
weird tools that exist in the lake
6:40
house. I think things are starting to
6:43
get better as people realize that it
6:45
isn't like so much as like the
6:48
community of users, the community of like
6:50
the people implementing and maintaining this system
6:53
where like, I think we've now started
6:55
to figure out that like this paradox
6:58
of choice is not a good thing.
7:00
So before we had
7:03
like Hive and there
7:05
were five competing data formats and
7:07
then that narrowed down to two
7:09
and then everyone realized that what
7:11
Hive was doing was really bad
7:14
and not sustainable and
7:16
having like two different tables
7:18
next to each other and they're maintaining
7:20
completely different ways and have different type
7:22
systems and different schema evolution and so
7:24
on. Like I can go on and
7:26
on and on about the edges of
7:29
it. So I think Iceberg came
7:31
along and said, hey, we're just
7:33
going to come up with a format
7:35
for tables. It includes how tables move,
7:37
how they're evolved, how they're managed and
7:40
covers a whole plethora of
7:42
things including like data types and
7:45
how partitioning works and stats
7:47
now and views and
7:49
so on as a
7:51
written down standard. Before it was just
7:53
the Wild West, like literally like someone
7:56
would check something into Hive and like
7:58
invent an entire new. system, Spark
8:00
does this all the time. Like, okay,
8:03
let's implement Spark bucketing V2, which is different than
8:05
everything else. And if you want to know how
8:07
it works, like go read the Spark code because
8:09
some person just showed up and everyone's like, yeah,
8:12
that's cool. So I think we've gotten a really,
8:14
a lot better on
8:16
data in tables, the type
8:18
system, that sort of
8:21
thing is now fairly
8:23
standardized and well understood. That said, iceberg
8:26
did it. And then immediately
8:28
data bricks came along and dropped a
8:30
competing product, which is kind
8:32
of half finished. And so now I
8:34
get to implement two and
8:36
now there's more of these coming
8:39
along and hoping that this time
8:41
around we consolidate onto one very
8:44
quickly. Cause it's really kind of
8:46
a mess. And you're basically what
8:48
happens is people like us in
8:51
the Trino community, we have to implement all of
8:53
these and we only have so many people. So
8:56
it's like we implement one really well and the
8:58
rest suffer or we implement all of them kind
9:00
of okay. So it's,
9:03
it's difficult. Like right now there are enough
9:05
people. I think we're maintaining three
9:07
of them. High of acid died. And
9:09
that's like one of end
9:12
tools. So like we can have the
9:14
same conversation about security. We can have
9:16
the same conversation about, I don't
9:19
know. There's, there's like lots of these areas.
9:22
Absolutely. So I personally am actually using
9:24
the lake house architecture for my platform.
9:26
For sake of transparency, I am using
9:28
Trino. I'm using the starburst managed galaxy.
9:30
So get that out of the way.
9:32
I'm using the iceberg table format, which
9:35
is largely transparent. I don't have to
9:37
do a lot on the actual table
9:39
format piece because Trino handles that piece
9:41
of it for the most part. And
9:43
so as somebody who's using the lake
9:45
house paradigm, there are definitely a lot
9:47
of niceties. I agree. It's gotten a
9:49
lot easier over the past couple of
9:51
years than it was prior to that.
9:53
A lot of the conversation seems to have
9:56
cohered along a roughly
9:59
standard. a theorized conception of
10:01
what constitutes the lake house. I do
10:03
think that one of the areas
10:05
that is still unfinished, or at least
10:08
not as cohesive across the board, is
10:10
that question of security and access
10:12
control. That seems to be one of
10:14
the areas where the overall data ecosystem
10:17
is not yet figured out. Everybody
10:19
has their own thoughts on how it
10:21
can and should be done. Everybody wants
10:23
to own that experience. There aren't
10:25
a lot of methods for being
10:28
able to communicate roles and access
10:30
across the layer boundaries.
10:33
I'm wondering if you can talk to
10:35
some of the ways that that manifests
10:37
in terms of that overall experience as
10:39
a juxtaposition to the warehouse where everything
10:42
is presented as one system. Yeah.
10:45
As one of the people who's written a
10:48
huge portion of the security systems in
10:51
Trino and in that
10:53
galaxy, it's actually a really
10:55
hard space to be in.
10:57
If you look into the
10:59
open ecosystem, throughout this
11:01
whole thing, we're talking about the
11:03
open ecosystem. The open ecosystem for
11:05
security, historically, you had the Hive
11:08
meta store with its security. Well, the
11:10
most popular meta store out there is
11:12
Glue and it doesn't have the
11:14
Hive security model. The Hive security model
11:17
was always weird and only applies to
11:19
Hive. Trino is a federated system,
11:21
so that doesn't make
11:23
much sense. Ranger pretty much died. I
11:25
haven't seen it around in a while.
11:28
There are people still looking at it,
11:31
but I get a sense for how
11:33
popular things are by when people ask
11:35
about things. It's like two, three years,
11:37
two years ago, it just fell off
11:39
a cliff. The only
11:41
other thing I've seen out recently is OPA, which
11:45
the Bloomberg folks have been working on.
11:47
They really like. OPA is really
11:49
complicated. You write
11:52
security rule policies in a
11:54
security rule server in a
11:56
custom language. I literally
11:59
looked at And I was like, if I
12:01
did this, I would write a tool to write
12:03
the language policy files for me. It's very complicated.
12:06
So I think that's got a long ways to
12:08
go. Hopefully someone builds like
12:10
a UI and tooling and stuff for
12:12
it. So that's really all you have
12:14
in the open space. In proprietary, you
12:16
have AWS is Lake Formation, which like
12:18
I seriously have yet to meet someone
12:20
who's rolled it out. It just looks
12:22
weird. We'll see what happens. Again,
12:25
I'm hoping, I'm hoping it dies. Like every
12:27
one of these things that's successful. We
12:29
have to build and maintain. So like, I'd like one and
12:31
I'd like it to be open. The
12:35
Databricks has their own proprietary thing
12:37
at Starburst. We have our own
12:39
proprietary thing. I think Tabular
12:41
has our own proprietary thing. You
12:44
end up with proprietary things
12:46
because of the complexity of
12:49
the security system. So like
12:51
in Galaxy, we built
12:53
the security system into the core of
12:55
Galaxy itself. So Galaxy's the Starburst hosted
12:57
version of Treno. So like every screen
12:59
you're looking at in Galaxy is viewer
13:01
aware and we're applying your policy on
13:03
like what you're allowed to see and
13:05
it's really core to the whole application.
13:08
It kind of touches like every single
13:10
bit. So how do you put that
13:12
in? And then you're like, Oh, I'm
13:14
going to make this out call to
13:16
a third party system and like, I
13:18
need to know what changes, but like,
13:20
this is something I need to be
13:23
able to do on like a millisecond
13:25
level. And so security is a super
13:27
hard problem. Also, everyone has different viewpoints
13:29
about how security should work. In Galaxy,
13:31
we follow a very traditional database
13:34
security system with roles and
13:38
access controls, et cetera. In other
13:40
systems, like there's different viewpoints. Like
13:42
it's, it's very interesting. Like OPA
13:45
is like this different universe of
13:47
like policy rule systems. So I
13:50
don't think we have a good answer for this right
13:52
now in terms of like a community. And
13:54
I think this is a, one of the things that
13:56
actually is the reason why you would choose a vendor
13:58
is there are. security implementation aligns with
14:01
like what you want to do. Yeah,
14:03
the security and policy space is definitely
14:06
still very much in flux, in particular
14:08
in the Lakehouse ecosystem, but even beyond
14:10
that. So OPA is a tool that
14:12
came out of largely the Kubernetes ecosystem,
14:15
and is being applied to a number
14:17
of different areas because it is a
14:20
generalized policy language. There's another project called
14:22
OSO, which is an open source policy
14:24
engine that has its own policy language
14:26
again, so that you can have the
14:28
policy agent embedded in process in various
14:31
language runtimes, and then you can define
14:33
those policies out of band and apply them
14:35
to the runtime dynamically. So I think
14:38
that that is an interesting approach and maybe something,
14:40
you know, where there's OSO or OPA or one
14:42
of the other tools in that
14:45
ecosystem might start to make inroads into
14:47
the data platform ecosystem as well. And
14:49
then you also have things like identity
14:51
systems like Keycloak or Okta or Auth0,
14:54
etc. that also
14:56
factor into all of that. So it's
14:58
a big, complicated space. I
15:00
think part of the problem here is what
15:02
are we optimizing for? So like OPA
15:04
and Ranger, which is just another
15:06
policy system, was great if you're
15:08
an admin and you want to
15:11
like lay down the
15:13
rules like broadly for
15:15
like lots of tables by using
15:17
table matching. But like SQL security
15:19
was really built around like I
15:21
create a table, I type commands
15:23
to grant access to other folks
15:26
in the platform, I may
15:28
create views or like, you
15:30
know, filter rules or something like
15:32
that. And I'm just typing commands to do
15:34
that in the SQL language. And that SQL
15:36
language is the language
15:39
of the system I'm doing, I'm
15:41
in. So it's like that's a
15:43
system that's optimized for end user
15:45
experience, not admin experience. And
15:47
the admin experience, it's great if
15:49
you're a bank. OPA, like Intrino came
15:51
from Bloomberg and it's like they have
15:54
a lot of data and they have
15:56
data policies they need to apply broadly.
15:58
But if you're like a a small
16:00
group and you want to have a
16:02
security system, like, do you even have
16:04
people that can write these complicated things?
16:06
Can you write a, can you run
16:09
an OPA system that's going to return
16:11
responses in milliseconds because it's part of
16:13
like every query? No. And
16:15
like really you want the system to be kind
16:18
of in a simple, understandable way for a user.
16:20
So it's like, there's these, a lot of the
16:22
stuff in data lakes are provided by big companies
16:25
with big company solutions to big
16:27
company problems, and it does not
16:29
align with like, Hey, I want to
16:31
like grant accesses table of some other person.
16:34
Absolutely. And in the
16:36
data lake and lake house ecosystem
16:39
as well, there's the added complexity that
16:41
by virtue of the storage and the
16:43
compute being disaggregated, you maybe want to
16:45
bring a different compute to that same
16:47
storage. And so then there's the question
16:49
of, okay, well, do I need to
16:52
route all of my requests through the
16:54
other compute engine that has my policy
16:56
information? Do I have to have different
16:58
policy sets and different rule sets across
17:00
those different compute systems? So it's actually
17:02
worse than that too. Cause the outside
17:05
of like Trino, the most
17:07
popular compute engines are map,
17:09
reduce you like things like
17:12
spark and hive. And the
17:14
problem is that those engines
17:16
almost always allow users
17:18
to upload their own third party
17:21
code, untrusted third party code into
17:23
the same process. And that means
17:25
that you can't rely on the
17:28
process to be secure, to protect
17:30
against data access and
17:32
stuff like that. So the
17:35
spark in hive communities are
17:37
pushing for things like column
17:39
level encryption and physical security
17:41
based on file permissions, which
17:43
is like. Anathema to
17:45
like the way SQL works. This would be
17:47
the equivalent of like, Oh, I'm going to
17:49
manage my, my SQL permissions by setting file.
17:55
It's insane, right? And like, this is
17:57
like state of the art and
17:59
it's. because like we, the entire,
18:01
the entire industry went down this
18:04
MapReduce path for 15 years
18:08
and it's not a good
18:10
idea. Like you see like every single
18:12
vendor who's working in the data space
18:14
has moved away from MapReduce. Like, yeah,
18:16
Spark still uses it, but like when
18:18
you get into like high performance stuff,
18:20
like everyone has moved away
18:22
from MapReduce. It's just not a thing
18:24
you do anymore. And we're
18:27
still building our security systems to
18:29
like the lowest common denominator. And
18:32
so taking a step back now
18:34
from ragging about the complexities of
18:36
security, bringing it
18:38
back around to Trino and iceberg,
18:41
I guess maybe keeping it in
18:43
the context of security, what are
18:45
the benefits that that particular pairing
18:47
provides and maybe in juxtaposition to
18:49
other technology stacks or vendors that
18:52
purport to provide a data lake
18:54
house experience? Today, I
18:56
think the data warehouse, like
18:58
the folks talking about the data lake
19:00
experience, and I'm using that in quotes,
19:03
I think it kind of
19:05
breaks down into two camps, you have
19:08
folks who have a traditional
19:10
data warehouse that can pretend
19:13
like it's in the data lake. That's
19:15
almost always done by you run
19:18
a query, it loads the data into Snowflake
19:20
format, they run their query and then they
19:22
throw the data away or they catch it
19:24
or something like that. But they don't actually
19:27
execute directly on the lake house data. So
19:29
that's like one camp. And then the other
19:31
camp would be, obviously you have iceberg
19:35
camp and then you have like the
19:37
Delta Lake camp, which is similar.
19:40
I have my bias. My bias
19:42
is absolutely towards iceberg. I
19:45
was pretty unhappy when Delta Lake
19:47
actually came out. It's
19:50
unfortunate that like, I thought
19:52
we had this brief moment
19:54
where it looked like the
19:56
entire ecosystem was going to
19:58
move onto iceberg. And
20:00
we would only have one thing to implement, not like
20:02
five. And then Databricks dropped
20:05
their format. And in my
20:07
experience, the only people using it are Databricks
20:09
customers, but they have a lot of customers.
20:12
And so like everyone is having to
20:14
implement it because Databricks made it the
20:16
default format for their customers. When
20:19
honestly, like their customers would
20:21
be just as happy with Iceberg.
20:23
So now we all get to
20:26
build twice and yeah, it's got
20:28
a community, but like it's
20:30
not the same thing as it being an
20:32
Apache community. But even then having, if
20:35
there were two Apache projects, I'd be annoyed
20:37
also. And that doesn't, and then there's other
20:39
groups that are trying to build stuff. So,
20:44
so Trino and Iceberg,
20:47
I think we're combining
20:49
together, like in my
20:51
opinion, the best analytics in
20:54
query engine we have available along
20:56
with the current best storage
20:58
format. Uh, so
21:02
without like Iceberg
21:04
without Trino is like, great, I have
21:06
storage format, but like, how do I
21:09
query it? How do I, how
21:11
do I interact and change and
21:13
produce these files? Like, you know,
21:16
it's nice, but like, it's not,
21:18
um, you're still suffering the
21:20
problems of some of the other engines
21:23
and Trino on the
21:26
other hand provides this great query engine
21:28
that's adaptable. Like Trino has
21:30
the ability to add in custom
21:32
data types. Uh, we have, uh,
21:35
direct readers for everything. So it can
21:37
actually, we can actually build an engine
21:40
that's really, really tightly,
21:42
uh, set up for what, uh, Iceberg
21:45
can do, and we can
21:47
do that in a, like in a way where
21:49
you get really, really great performance. So
21:52
what Trino was, was suffering
21:54
from until Iceberg came
21:57
along was the data formats weren't
21:59
particularly good. And so like
22:01
they, you would have performance problems,
22:03
you would be missing stats. You
22:05
know, there's this really, most of
22:07
the data formats and the way
22:09
Hive was worked was actually designed
22:11
for HDFS, which has a very
22:13
specific performance profile that S3 does
22:15
not have. Like listing files is
22:17
great and HDFS is an insanely
22:19
slow in S3 and Iceberg doesn't
22:23
require listing files. Like there's a whole
22:25
bunch of things like that where Iceberg
22:28
was designed to deal with the performance
22:30
characteristics of object storage as
22:32
opposed to like HDFS is
22:34
designed that I mean hardly
22:36
anyone uses HDFS anymore. So
22:39
like Iceberg gave us the,
22:41
a really stable format with
22:43
a well-run community that likes
22:45
specs that understands like the
22:47
performance of modern things. And
22:50
we were able to work really closely with
22:52
them and build a query
22:54
engine that's really tuned.
22:56
The integration we're doing to
22:58
Iceberg is fundamentally designed for Iceberg.
23:00
It isn't like a bolt on. It's like
23:03
we took Hive and like swapped out a
23:05
little bit. So like we wrote a custom
23:07
plugin just for Iceberg that does exactly what
23:09
Iceberg wants. Are
23:13
you sick and tired of salesy data conferences? You
23:16
know, the ones run by large tech companies and
23:18
cloud vendors? Well, so am I. And
23:21
that's why I started Data Council, the
23:23
best vendor neutral, no BS data conference
23:26
around. I'm Pete Soderling
23:28
and I'd like to personally invite you to
23:30
Austin this March 26 to 28th where I'll
23:32
play host to hundreds of attendees, 100 plus
23:35
top speakers, and dozens of hot startups
23:37
on the cutting edge of data science,
23:39
engineering, and AI. The
23:41
community that attends Data Council are some
23:44
of the smartest founders, data scientists, lead
23:46
engineers, CTOs, heads of data, investors, and
23:48
community organizers who are all working together
23:51
to build the future of data and
23:53
AI. And as a listener
23:55
to the Data Engineering Podcast, you can join us.
23:58
Get a special discount off ticket. by
24:00
using the promo code DEPOD20. That's
24:03
D-E-P-O-D-2-0. I
24:07
guarantee that you'll be inspired by the folks at the
24:09
event, and I can't wait to see you there. And
24:11
when somebody is building a data platform or
24:19
building their warehouse implementation, they decide,
24:21
okay, this combination of Treno and
24:23
Iceberg does what I want. I
24:25
have the benefits of a performant
24:27
query engine. I have the flexibility
24:29
and scalability of object storage. I
24:31
can scale those two things independently.
24:33
How does that influence the other
24:35
upstream and downstream choices that they
24:38
might make for the other components
24:40
of their data platform? So once
24:43
you decide you're gonna go with Iceberg
24:45
and Treno, you have the complexities of
24:48
like, how do I actually get my
24:50
data into these platforms? The bootstrap problem
24:52
is a really big problem in data
24:54
warehousing in general. It's like, how do
24:56
I get my data in? In general,
24:59
since Iceberg has become so popular that
25:01
a lot of tools are adopting it,
25:03
so actually getting your data in is
25:05
less of a problem, but you definitely
25:07
wanna go and look at the vendors
25:10
you're gonna use for landing
25:12
the data into your S3
25:14
bucket and make sure they
25:16
support Parquet at the very least and
25:19
Iceberg, hopefully. And if they're not
25:21
supporting it, when are they gonna
25:23
support it? Because most of them
25:25
have it on their roadmap unless
25:28
they're actually, unless they're Databricks. Like,
25:30
actually even Databricks is starting to
25:32
Iceberg support. So making
25:34
sure your vendors actually are supporting
25:36
landing data in Iceberg format. Then
25:38
in terms of like other choices,
25:41
you obviously have things like, how
25:43
am I going? Like, how
25:45
is security gonna work? How is
25:47
data maintenance gonna work? So Iceberg
25:49
tables require maintenance on them. And
25:51
depending on how you're importing data,
25:53
they may require compaction. And you
25:55
wanna keep only so much snapshot
25:57
data because they have the ability
25:59
to. to query historic data, but that
26:01
means you're holding historic data, which could
26:03
be expensive. So there's a bunch of
26:06
like maintenance things and you're going to
26:08
have to choose a tool that supports
26:10
the maintenance. So many of the platforms
26:12
like Starburst, we're integrating all of this
26:14
stuff into our platform because we want
26:16
to create the simplest experience for people.
26:18
Like we don't want them to have
26:20
to go and like integrate with a
26:22
third party tool to like run some
26:24
compaction jobs. Then I think there's additional
26:26
things around like you're going to use
26:28
probably some sort of data transformation, guiding
26:31
pipeline like tool, almost always
26:33
dbt. I don't even
26:35
know if they have competitors honestly.
26:37
Yeah. And then obviously you're going to
26:40
want some sort of BI tools.
26:42
Most of them are supporting Trito or
26:44
Starburst or both today. So there's a
26:46
much of a choice reduction there,
26:48
but I think the big thing is
26:51
like data ingest, getting it into iceberg
26:53
and maintaining those files are currently
26:55
a big part of the platforms. Absolutely.
26:57
And I started my
27:00
data lake house journey, I think maybe
27:02
going on two years ago now. And
27:04
in that two years, it has gotten
27:06
better. Initially there wasn't really any out
27:08
of the box support for being able
27:10
to write into a lake house, you
27:12
could write data into S3, but then
27:15
you would have to perform
27:17
a different step to actually tell whatever meta
27:19
store you were using. Hey, these files exist.
27:21
This is the schema. These are the tables,
27:23
et cetera. So my team is actually using
27:25
air bite. And so we actually had to
27:27
write a custom output plugin that sat on
27:29
top of their S3 plugin to be able
27:31
to automate generation of those AWS glue tables
27:33
for the data that was just written out
27:35
rather than having it be an out of
27:37
band process of, Oh, hey, I wrote all
27:39
this data S to S3. And I'm going
27:41
to wait for the crawler to run, to
27:43
tell me what those tables are. And it's
27:45
probably going to be wrong anyway, et cetera.
27:50
Absolutely. I, I, air bite actually all
27:52
of them. They either have it. If
27:55
they do, it's not, not always the best, but
27:57
like every single one of those vendors I think
27:59
is. realize that Iceberg is an important
28:01
part of the Data Lake future and they
28:04
just need to be able to ingest directly
28:06
into Iceberg. And AirBite does have that out
28:08
of the box now. There are a couple
28:11
of implementations. The level of support is not
28:13
quite where I would like it to be.
28:15
And then going back to one of your
28:17
earlier comments as well, as far as the
28:20
data type specifications being a bit all over
28:22
the place, one of the things that is
28:24
my personal pet peeve, at least in the
28:27
AirBite toolchain. I don't know if it exists
28:29
elsewhere, but anything that has a decimal
28:31
value is automatically a float, which
28:34
if anybody knows anything about data types, that
28:36
is an awful choice. Yes,
28:38
that is an absolutely awful choice.
28:42
Funny enough, the first versions of Trino, we
28:44
didn't have decimal, we only had doubles. And
28:47
the actual migration away from them was
28:49
quite an undertaking. We had like backwards
28:51
compatible flags for a long while where
28:54
you'd be like, oh, if you see
28:56
a literal, it's actually a double, not
28:58
a decimal, like it should have been in
29:01
the spec. So the version of the plugin
29:03
that my team uses, we actually implemented the
29:05
logic that says if it is a numeric
29:07
type that has a decimal place treated as
29:09
a decimal value, not as a float. And
29:15
so for people who are looking at the
29:17
Data Lakehouse ecosystem, going from where we are
29:19
today and looking into the near to medium
29:21
term forward, what are some of the areas
29:23
of progress that you see as
29:25
far as overall improvement in the
29:27
capabilities and user experience for the
29:29
tooling that's available? So I think
29:31
we are finally at the point
29:34
as of this year,
29:36
that the rest of
29:38
the vendor space has
29:41
realized that iceberg is a critical component.
29:43
And they're starting to, they aren't even
29:45
just starting, they figured this out like
29:47
six months ago, their products are starting
29:49
to land. And that's
29:52
a big change. Whereas like before, as you said,
29:54
like, you know, the history of like the Data
29:56
Lake is you end up having to build a
29:58
bunch of the stuff. off yourselves while
30:01
the vendors figure out what's important.
30:03
So where, uh, there's, there's a,
30:05
there's, there's a bunch of interesting
30:07
parts to this. So there's like, obviously
30:09
things like landing data and data maintenance.
30:11
It's going to be interesting to
30:13
see how this shakes out in the
30:16
next like year or two as
30:18
what happened before is happening again, everyone
30:20
realizes is important. So everyone's going
30:22
to build products around this. So now
30:24
we're going to have competing products that
30:27
all have slightly different features, which
30:29
is a good thing, but it's also
30:31
like a bad thing because it's the
30:33
paradox of choice for the end users.
30:35
You're going to have a lot of
30:37
stuff to look at and you have
30:39
to consider like the data Lake is
30:41
about how things integrate together. So it's
30:43
like, if I choose this product from
30:45
this vendor, how does that work with
30:47
my other products that I might be
30:49
interested in from other vendors? Can I
30:51
use air bite to land my data
30:53
and then use a separate data maintenance
30:55
tool that plays well with that land
30:57
of data? And it's
30:59
going to be a interesting next set
31:02
of things around like now that
31:04
we're moving on to iceberg and
31:06
we have tree. No.
31:09
So it's like, how do we get
31:12
these different products to play well with
31:14
it? And everyone's got kind of a
31:16
different viewpoint on that. And
31:18
as a vendor supporting tree, no
31:20
building a product powered by tree,
31:23
no, what are some of the
31:25
areas of investment that you see
31:27
as being most critical to easing
31:29
that adoption curve, improving the effectiveness
31:32
and user experience for people who
31:34
are using Starburst specifically and Trino
31:36
indirectly to just make their lives
31:38
easier and help them get their
31:40
jobs done. Well, I
31:43
should have mentioned this earlier. The most
31:45
challenging thing that people have is actually
31:47
like how they query their data. So
31:50
we set up Trino and the
31:52
first thing you see in Starburst
31:54
is A way of
31:56
actually entering queries right in our
31:59
UI and. The real the rugs
32:01
inquiries and then you're like great I
32:03
want to put this might be I
32:05
to a like how do I get
32:07
this to my p I to have
32:09
to like that is a big area
32:11
we actually think about is like how
32:13
do we empower users to get this
32:15
into the tools they want to use
32:17
Then the other part is kind of
32:19
like generally like the admin park how
32:21
do I manage my security We spend
32:23
a lot of time around that and
32:25
I think the big areas that we
32:27
look for our how do we make
32:29
it easier. And easier for people like
32:32
set up their data lake. So
32:34
one of the first things we
32:36
focused on in the Galaxy development
32:38
was I caught time to first
32:40
query. So you go sign up.
32:42
You may run inquiries on your
32:44
data warehouse in. A minute,
32:46
couple minutes. That's great, how do
32:48
you get your data? And so
32:50
we spent much time around data
32:52
discovery, integrations, etc. and we're continuing
32:55
to do more and more work
32:57
around how you actually build up
32:59
your initial lake and get your
33:01
data endear lake. So I still
33:03
think that's one of the big
33:06
problem. So is this how do
33:08
you get data An and just
33:10
kind of thought seeing a lot
33:12
of the the daily stuff it's
33:14
it's duty stuff. It's like stuff
33:16
I love but it's like really detail.
33:19
There's a lot choice in the space
33:21
and really what I want as a
33:23
non had to end user or even
33:25
honestly my other friends that are insane
33:28
the tactical they're like that's great but
33:30
like I don't want to learn how
33:32
the low level file system stuff works
33:34
dislike own around some queries so I
33:36
spent all we spent a lot of
33:39
time of just like let's get it
33:41
all working and then if you want
33:43
to like integrate with some additional stuff.
33:45
Cause like that's important to like. We.
33:48
Didn't talk about how we do that,
33:50
but really, it's like get up, get
33:52
queries going, Get excited about the. Tree.
33:55
know and what we're doing and then
33:57
we can talk about like some people
34:00
are very opinionated about like they want a
34:02
certain specific integration the way they want to
34:05
do it. But it's pretty rare. We
34:07
hear it because we're in the community. But
34:09
like outside of like data heads, people don't even
34:11
like people don't know what Ranger is or parquet
34:14
or like they don't know what any of this
34:16
is. They're like, I just want to run some
34:18
queries. Yeah, as somebody who's
34:20
been running this podcast for, I guess,
34:23
seven years now, whenever I talk to somebody
34:26
who isn't deeply embedded in this space, I'm
34:28
always struck by the fact that the
34:30
things that I'm talking about, they have no clue and
34:32
they don't care. I'm like, wait a minute. All right,
34:34
reset. I'm going to remember that I'm talking to somebody
34:36
who doesn't do this every day. Yeah, I
34:39
often find myself saying outside of data space. So
34:41
you know, in Excel, when you do x, we
34:44
kind of do that, but the table's infinite,
34:47
like, yeah. Right. And
34:51
going back to that question of landing
34:54
data and the transformation, as you mentioned, most
34:56
people these days are using dbt. There are
34:58
some competitors, but not a lot of them
35:00
and not on the same scale. But one
35:03
of the benefits that Treno provides is, as
35:05
you mentioned, it's a federated query engine. So
35:08
rather than being constrained to, Oh, I can only
35:10
work on the data that's in my iceberg tables,
35:12
you can say, Oh, I actually just want to
35:15
directly query against my Postgres or my MySQL database
35:17
or some of the other numerous data connectors that
35:19
are out there. And I'm wondering what you
35:21
see as the general pattern of people
35:23
who are adopting Treno, whether they are
35:26
still using the air bite or five
35:28
Tran as the only means of landing
35:30
data into their lake house, or if
35:32
they're largely using that federated query capability
35:35
to be able to do more kind
35:37
of real time data updates of from
35:40
source systems into their lake house via
35:42
those transformation routes. Very,
35:44
very interesting question. So you're going to
35:46
get the database answer, which is it
35:49
depends. Uh, so
35:51
it's interesting. So like
35:54
federation is awesome. You
35:56
generally, typically you're
35:58
not keeping your main data. Actually,
36:00
let me back up. So normally when we're
36:02
talking about federation, so like Trino
36:04
in its heart is a federated
36:06
query engine. That is like, we
36:09
don't own the data. We're interacting
36:11
with data and the
36:13
descriptions of the tables that are
36:15
all external that said the connectors
36:17
that read data from like
36:19
object store and glue and that
36:21
sort of thing, those are effectively
36:23
native formats to Trino. Like we
36:25
implement all the raw file reading
36:27
logic. We talk directly to glue.
36:30
We're not talking to like another query
36:32
engine. Whereas when we talk to my
36:34
SQL, we send a query
36:36
in my SQL's language to my SQL.
36:39
So normally when we're talking about federation,
36:41
we're talking about the stuff that's not
36:43
in normal data like queries, folks
36:46
that are a lot
36:48
of companies and users, et
36:50
cetera, we'll have what I'll call
36:53
dimensional data sitting in a production
36:56
store that's like a my SQL or postgres,
36:58
this could be as simple
37:00
as like demographics for users, et
37:02
cetera. So like they'll have their
37:04
main feed of data say it's
37:06
an ad feed and it's like,
37:08
okay, user so-and-so saw this ad
37:10
you join in with their demographics
37:12
and then you can do analytics
37:14
of like, you know, uh, the
37:16
amount of ad clicks by age
37:18
range or something like that, and
37:21
you don't have age range in your,
37:23
uh, in your normal ad feed.
37:25
So that's really powerful and
37:27
it's easy to do because like you
37:29
just connect them together, you don't have
37:31
to set anything up. The downside is
37:34
you're now accessing a production
37:38
data store that like keeps your
37:40
website running from your query engine
37:42
that can be fine if you're
37:44
using like my SQL and you
37:46
have a bunch of reader applicants
37:48
for your, uh, your database that
37:50
can also be expensive because in
37:52
a transaction processing database
37:55
Is more expensive to run than an analytics database
37:57
for the amount of data. Though
38:00
sometimes you'll want to, you
38:02
instead copy that data into
38:04
your data warehouse. The other
38:06
reason that you want to
38:08
copy data in his he
38:10
sometimes one historic data. So
38:12
you need the demographics for
38:14
that user when they saw
38:16
the ad. Especially when you're
38:18
doing stuff where there's money
38:20
involved and give are paying
38:22
for certain ad impressions or
38:24
you know you're doing got
38:26
our products dos you're selling
38:28
things in like. You want to record
38:31
the state of the system at that point, so
38:33
a lot of times then you'll still be either
38:35
dumping and data. Daily. Or you
38:37
can. With. A lot of work
38:39
trying south from the like to Bcm
38:42
and get a feed into a data
38:44
warehouse. It's very complicated today, so a
38:46
lot of times you'll want to mirror
38:48
the data in because you actually want
38:50
a nod when been snapshot. because you
38:52
wanna know who these apps are? You
38:54
want to reduce the pressure. So a
38:57
lot of you start with alive and
38:59
then moved to the other one when
39:01
they realize the cost or the pressure
39:03
on their database. Moving can be really
39:05
really complicated though. Like the tools there
39:07
are not good. The state of the
39:10
are the best tools are very challenging.
39:12
The. Absolutely. Digging
39:14
a little bit deeper in their
39:16
i'm wondering if there any other
39:18
differences the you see in terms
39:21
of the overall pipeline, design, access
39:23
and usage patterns that folks are
39:25
building around the usage of Chino
39:27
an iceberg as compared to maybe
39:29
a warehouse or some of the
39:32
other lake house compositions that you
39:34
seen. So. There's the data
39:36
warehousing space I think in
39:38
general is kind of developing
39:41
two different directions, especially in
39:43
the are the open daily
39:45
so there's a large swath
39:47
of people that are. using.
39:50
something like tv t to do
39:52
step by step transformations ah there
39:54
is a movement towards materialized views
39:56
which he just say i want
39:58
to materialise a of this query
40:01
and here's the policy for keeping that
40:03
up to date. A lot of
40:05
people think they're equivalent, but they are not. So
40:08
materialized views are
40:10
about when you're querying that it's supposed
40:12
to be the equivalent as if you
40:14
just ran the underlying query and so
40:16
the data changes. Whereas like pipeline data
40:19
has the advantage and disadvantage that like
40:21
typically like you're processing on like, I
40:23
don't know, let's say a daily or
40:25
an hourly basis. If like the query
40:27
changes or something like that changes in
40:29
the pipeline, only future data will affect
40:32
it, which is good and bad depending
40:34
on what you're trying to accomplish. So
40:36
like, I think that's an important split that's
40:38
happening in the open community and I'm curious to
40:40
see which one's going to win. In
40:42
terms of like open data lakes
40:45
versus like proprietary ones, the biggest
40:47
difference is that people don't keep
40:49
all their data in their proprietary
40:51
data lakes. Just too expensive or
40:53
it's too complicated to move
40:55
it all in. Whereas like normally
40:58
people are storing all their data in S3,
41:00
whether it's a data lake or not, because
41:02
it's cheap and they can have a backup,
41:04
but you don't keep all your data in
41:06
Snowflake because it's either too expensive or it's
41:08
too much of a burden to keep all
41:10
the feeds to load it into their format.
41:12
You see the same thing with Redshift
41:14
and basically everything else out there. It's
41:17
like even if it were free, it's
41:19
still just annoying. And then
41:22
another consideration that folks have when they're
41:24
deciding whether or not they want to
41:26
use a lake house approach is sometimes
41:28
they have queries that need to be
41:30
able to operate very quickly. And so
41:32
that's where they'll typically bring in something
41:34
like a click house or a druid
41:36
when they're dealing with fast
41:38
moving data that needs to be updated quickly.
41:40
And I'm wondering what you see as some
41:42
of the decision points
41:44
around going wholesale into one of
41:47
those systems or using those as
41:49
a supplement to a Treno and
41:51
Iceberg setup. Yeah, so
41:53
my experience with those systems
41:55
is that they're limited in
41:57
their capabilities. So they're
41:59
almost almost always used with a
42:01
custom application, especially in the case
42:03
of like Druid where it's not
42:05
standard SQL at all. Very
42:08
powerful, but you basically,
42:10
your application is custom written to
42:13
it. So you're not typically using
42:15
it for general analytics. And
42:17
if you're in that space, like you end up
42:19
having a lot of choices of different things you
42:21
can do. So in terms of
42:23
like fast moving data, I think
42:25
the open data lake is getting better at
42:28
this very fast. I
42:31
think that's a thing everyone's
42:33
focusing on. So with iceberg,
42:36
you now have the iceberg
42:38
appending stuff that came in
42:41
what, two years ago, three years
42:43
ago, like you see more and
42:45
more people using tools to take
42:47
data off of event
42:49
streams like Kafka and landing it
42:51
into tables at high resolution, and
42:54
then having background compaction jobs to
42:56
deal with the insane number of
42:58
files you create. And
43:00
then downstream of that, there are a
43:02
bunch of vendors and open source projects
43:04
working on taking like, okay, so now
43:07
we have this new data, how do
43:09
we integrate that into the computations? I
43:11
would guess within a couple of years,
43:13
you're going to see everyone building something
43:15
around this, you know, it'll be like
43:18
everything else. A lot of them
43:20
will be bad, but I think
43:22
the overall community is going to
43:24
be more and more of bringing
43:26
in data at near real time
43:29
and being able to have it manipulated
43:31
in a near real time feed. That
43:34
said, that is near real
43:36
time, getting down to like
43:38
milliseconds, like anything under like
43:40
30 seconds typically
43:43
means you have a custom engine where as
43:45
you're bringing the feeds in, they're going into
43:47
main memory and they're being held in memory.
43:49
You can't even get them to distribute a
43:52
disk. It's not fast enough. Those
43:54
I think will continue to be fairly
43:56
proprietary systems. They're Kind of complicated to write.
43:58
So That's where you're going. Though, feed a
44:01
few vendors in that space. My
44:03
experience is that most people don't
44:05
need anything short of a minute.
44:07
Very rare to see that. the
44:09
reason you see like could he
44:11
came out of goober is because
44:13
they were using their real time
44:15
system to adjust pricing on the
44:17
fly. Well, how many. Group.
44:19
How many organizations have that
44:21
problem? None that like outside
44:23
of like delivery services those are
44:26
like the only people I know
44:28
you systems absolutely. And as somebody
44:30
who has been working in the
44:33
space for a number of years,
44:35
as somebody who is building and
44:37
investing in the lake house architecture
44:39
paradigm be very deeply entrenched in
44:42
that ecosystem. What are some of
44:44
the most interesting are innovative her
44:46
unexpected ways You have seen Trina
44:49
Lighthouses. Applied. So. That
44:51
most interesting cases almost
44:53
always our. Custom applications
44:56
I a I've seen so many
44:58
like standard warehouse stuff that like
45:00
they all kind of blend together
45:02
be less interesting were because really
45:04
interesting is when someone builds a
45:07
custom application especially ensure you know
45:09
if they're building a custom data
45:11
stored a match. So you have
45:13
things like ah companies that run
45:15
not big like a Cd and
45:18
and stuff like that building a
45:20
custom data store the hooks directly
45:22
into their see the ad and
45:24
kid like. Show the live
45:26
data feeds and like security systems
45:29
where you're hooked into the lives,
45:31
security views or add systems like
45:33
we don't have been studies at
45:35
Facebook for like cooking into the
45:38
live at system why they be
45:40
testing where you're like a custom
45:42
data store at specifically to no
45:44
problem with like. Indexes that
45:47
are for petabyte scale data you
45:49
can do you really really powerful
45:51
things with three know because of
45:53
the way that the query engine
45:55
is extensively to add new types
45:57
and functions in all sorts of
45:59
stuff. into it and
46:01
end up with extremely
46:04
responsive systems that do
46:06
really custom things at big scale. That
46:08
said, you need a team of high
46:11
skilled engineers to build something like
46:13
that, which is worthwhile. It's like,
46:15
this is your entire business. I
46:17
think the more common, interesting thing
46:19
is ingesting data and setting it up
46:21
and getting a bunch of people running
46:24
their queries, which is pretty mundane, but
46:26
it's like the power of when you
46:28
give your people access to data and
46:30
their ability to make better decisions is
46:32
just like, it's night and day. And
46:35
in your experience of building
46:37
these systems, working with customers, what are
46:39
some of the most interesting or unexpected
46:42
or challenging lessons that you've learned in
46:44
the process of working in this data
46:46
lake house ecosystem? I think
46:48
the most frustrating thing
46:51
is you run into different
46:53
requirement viewpoints on things. So
46:55
it's like, you think you
46:58
understand what people are
47:00
interested in and you start building that.
47:02
And then someone comes along and they're
47:04
like, no, I actually am very interested
47:07
in the opposite direction. So we had
47:09
a bunch of people that were interested
47:11
in, I don't care what the file
47:13
formats are, I just want this stuff
47:15
to go really fast. You have this
47:17
advantage of your ability
47:20
to move faster and build really custom
47:22
things. If you can change anything you
47:24
want at any time, it's actually a
47:26
huge advantage that the big proprietary vendors
47:28
have. Well, once you get the scale,
47:30
you can't really do that. But in
47:32
the early days, it's very fun. You
47:34
can move very fast. But
47:36
at the same time, like in
47:39
our space, the reality is
47:41
like, we are in this
47:43
open data space. So it's like if
47:46
I extend stuff and no one uses
47:48
it, I'm no longer in that space.
47:50
So it's often challenging to figure out
47:52
like, how do we thread the needle
47:54
of like, actually making things
47:57
a lot better without stepping
47:59
out side that bound. So
48:01
like we're doing a lot of work around iceberg
48:04
and iceberg maintenance. And we
48:06
spent a lot of time
48:09
thinking about like, Hey, should we just
48:11
be like in starboard, should we just
48:13
pulling this into our separate
48:15
space? And then like, maybe we're
48:17
not even using iceberg manifest files.
48:19
Maybe we're using something else in
48:21
like a transactional database. And then
48:23
I can do indexing in ways
48:26
that are impossible right now. And
48:28
we decided that no, we're the open data lake space.
48:31
So it's like, we got to figure out how to
48:33
do it in the, in the open format.
48:35
Sometimes it's like we have augmented data in
48:37
special fields or sidecar files or that sort
48:40
of thing to be able to like, give
48:42
us the additional information that we need to
48:44
make our age go faster. Sometimes like you
48:46
get on Slack and you hit up Ryan
48:49
blue and you're like, yeah, how about we
48:51
just add some, some stuff into the spec
48:53
to be able to handle this? Like I'm
48:55
sure everyone has this problem. So that that's
48:58
the, the like, I want to move faster,
49:00
but I can't move faster that I like,
49:03
I want to, it drives me
49:05
nuts when it's like, I know there's a
49:07
better solution and it's like, I can't do
49:09
it without breaking and making
49:11
the thing proprietary and then, you know,
49:13
even then I have to like wait
49:16
for others to catch up. Absolutely.
49:18
For people who are in the
49:20
process of designing their data systems
49:22
or they're looking to build a
49:24
new set of capabilities in their
49:26
data platform, what are the cases
49:28
where a lake house architecture is
49:31
the wrong choice? So I,
49:33
I also would say a few years
49:36
ago, this answer was a lot easier.
49:38
I think nowadays the open
49:40
data lakes are very good,
49:42
or I think it's helpful with
49:44
some of the vertically integrated players
49:47
is you don't have to
49:49
understand a whole lot, you're just, again,
49:51
you shop and you just use the tool. And I think
49:53
that's where data lakes suffered
49:55
like back to the original cloud era stuff.
49:58
And if you are trying to. install it,
50:00
they had like 10,000 choices of different
50:02
tools to install. It's like, I
50:04
just want to work my data. So
50:06
it's like their entire id was choice.
50:08
And that was the worst part about
50:10
their product. It was like too much
50:12
choice. I think we've done a great
50:14
job at Starburst around like simplifying, getting
50:16
started on your Lake and getting going
50:18
in your Lake. You also had this
50:21
problem in the past where I would
50:23
say that there's a lot of people
50:25
who feel like they need to
50:27
use a Lake cause they heard it or a data
50:29
warehouse in general, and they
50:31
don't actually have a data warehouse problem. Like
50:33
they could just use Postgres and don't
50:35
have a lot of data. Also, we
50:37
see a lot of people that want
50:40
to do Federation and they don't understand
50:42
like Federation is like, we just send
50:44
queries to the other system and they're
50:46
like, well, it'll make my stuff faster.
50:48
And so I don't think we've done
50:50
a great job of describing when you
50:53
would choose to even move to a
50:55
data warehouse. And then in terms of
50:57
like proprietary versus non, it's a, it's
50:59
a tough choice. They can get yourself
51:01
going, but they can be very expensive,
51:04
complex to manage. And you're bolted into that thing.
51:06
Like I don't know if you've ever seen someone
51:08
tried to move from a traditional
51:10
warehouse to an open one. It's not super easy.
51:13
I don't want to say it's hard. Like we
51:15
do a lot of business with moving people off
51:17
the lakes, but it would be, would have been
51:19
a lot easier if they had started on the
51:22
Lake. Absolutely. And as
51:24
you continue to build and iterate on
51:26
the Trino platform and the Starburst product,
51:28
what are some of the things you
51:30
have planned for the near to medium
51:32
term or any particular projects or problem
51:34
areas you're excited to explore? So
51:37
on the open source side, there's
51:39
a bunch of stuff I'm very
51:42
interested in around how
51:44
we can spin people up on
51:46
Trino in a faster
51:49
and easier way. So
51:51
we're doing more around
51:53
simplifying the setup, simplifying
51:56
the installation process, making
51:58
it work. work in
52:00
smaller environments, things like that,
52:02
better integrations with the different
52:04
ecosystems. Like, I want to
52:06
see much more work done
52:08
with better integrations with the
52:11
Python ecosystem in particular. So
52:14
one of the big areas that
52:16
I have been focusing on recently
52:18
has been around how you actually
52:21
set up Treano. So historically,
52:24
Treano was designed and operated as
52:26
you had a data lake with
52:28
Hive in it and now they
52:30
spark in it. And you're adding
52:33
Treano because both of those query
52:35
engines are really slow and not particularly
52:38
good to use. Now we're at the
52:40
point where a lot of people
52:42
just run it and they don't have Hive and Spark.
52:45
So there were things we would assume would
52:47
already exist because you have those other tools.
52:50
Like now we're going back and adding a
52:52
bunch of things where normally
52:55
you would have just fired up the
52:57
Hive console and run some commands and
52:59
you just don't have that anymore. So
53:02
another big area is you set
53:04
up Treano and it's like, oh, you want to set
53:06
up a new catalog. And
53:09
in the old days, you knew what you wanted
53:11
to connect because you already had a data lake
53:13
and so you just create this little catalog file
53:15
and modify it and just restart your server until
53:17
things work. Well, that's just not how people do
53:19
it anymore. Now they fire up the Treano and
53:21
it's like, okay, I want to connect to my
53:23
S3 and it's like to go edit
53:25
a file, I can run a SQL
53:27
command. So recently added a bunch of
53:29
stuff around create catalog, drop catalog. There's
53:31
still more to be done like alter
53:33
catalog. Right now it's still just under
53:35
the covers modifying like local
53:37
files, but we have some work on like
53:40
putting it into a real database. So it's
53:42
funny, it's like you think about this and
53:44
it's like in the Treano ecosystem and it's
53:46
like, what do you mean? You're not storing
53:49
your catalogs in like a normal catalog system.
53:51
It's like we never needed to. And
53:53
it's like at Starburst, like with Galaxy,
53:55
we've had this from the beginning, you go into
53:57
the UI, you just modify your catalogs in like.
54:00
Like everything's kind of live-ish,
54:02
getting even more live with these
54:04
changes we're putting in, in Trino.
54:06
So like you will be able
54:08
to just add catalog to remove
54:10
them a lot easier and maintain
54:12
your system, put in a bunch
54:14
more stuff around like data evolution
54:16
and things like that. So I'm
54:18
excited about this. Like how do
54:20
we bring more people into this
54:23
community? Cause I think we're, we're
54:25
very much at the point where
54:27
the difference between what I can
54:29
do in a traditional data warehouse, what I
54:31
can do in Trino is
54:33
a much, much smaller gap. Like when we
54:36
started Trino, we're like, we're going to be
54:38
able to take out traditional data warehouses with
54:40
us. Like we're going to build something that's
54:42
as good as that. We're 10 years in
54:44
and I think we're for the vast majority
54:46
of cases, like we've been able to take
54:48
them out for years and years and years,
54:50
but it's like this new user
54:53
case, I think is like the one
54:55
remaining spot. And like when we started
54:57
this project, we said, it's going to
54:59
take 10 years. I think we're
55:01
there like just need to like just a
55:03
little bit more. And I think we will
55:05
have covered pretty much everything all the way
55:07
down to like a new user with like
55:10
a couple of files they want to process.
55:13
It's funny how persistent
55:15
that 10 year time horizon is. Pretty
55:17
much every time I talk to somebody
55:19
who has built or is building a
55:21
database engine, they always say it takes
55:23
10 years before you get it right.
55:26
Yeah. The other thing they don't say is
55:29
like, it kind of takes five years before,
55:31
you know, kind of doesn't
55:33
suck. You know, it was pretty good,
55:35
but like, you know, people
55:37
like we didn't have the ability to write
55:39
tables for the first year, like whatever we
55:41
got data, we got hive, it's writing data
55:44
for us. We'll just run queries, right? That
55:46
select the data out. So it's like the
55:48
amount of stuff from like, Oh, this is
55:50
actually interesting. It kind of works to like,
55:53
I can use it everywhere is
55:55
like people have no idea. Absolutely.
55:58
It's amazing how many. products have been built
56:00
because the person building it didn't realize how hard
56:03
it was going to be. Yeah.
56:05
Yeah. I honestly, I think that's
56:07
almost every project I work on is like,
56:10
if I knew, I probably wouldn't start. And
56:14
are there any other aspects of the work that
56:16
you're doing on Trino and this overall space of
56:18
the data lake has ecosystem, the combination of Trino
56:20
and iceberg that we didn't discuss yet that you'd
56:22
like to cover before we close out the show?
56:25
I think we actually covered all of it. All
56:27
right. Well, for anybody who wants to get in
56:29
touch with you and follow along with the work
56:31
that you're doing, I'll have you add your preferred
56:33
contact information to the show notes. And as the
56:36
final question, I'd like to get your perspective on
56:38
what you see as being the biggest gap and
56:40
the tooling or technology that's available for data management
56:42
today. I really, really think
56:44
we need a big improvement in
56:46
the security space. And I don't
56:49
really care what it is other
56:51
than like, it needs to work
56:53
well with things like Trino and
56:55
the maintenance, like the amount of
56:57
complexity you have to go through
56:59
to set those policies. You have
57:01
to learn a new language. That's
57:04
way too complicated. And frankly, it's even
57:06
if you do it in language, you're
57:08
going to get the policies wrong because
57:10
you're no expert in it. So it's,
57:13
they're too complex of models. The other
57:15
spaces, I still think it's too hard
57:17
to get data into the lakes. It
57:19
just needs to work and land and
57:21
be maintained and like, you shouldn't have
57:23
to think about it. It should be,
57:25
it should always work and be low
57:27
cost and data just shows up. Like
57:29
why do I have to worry about,
57:31
you know, all the feeds? All
57:35
right. Well, thank you very much for taking
57:37
the time today to join me and share
57:39
the work that you and your team have
57:41
been doing on bringing the data lake house
57:43
ecosystem into a better place and all the
57:45
work that you're doing to build the starburst
57:47
product definitely makes the onboarding a lot easier
57:49
for folks. So definitely like the work that
57:51
you and your team are doing there. So
57:53
thanks again for taking the time and I
57:55
hope you enjoy the rest of your day.
57:57
Thank you. This is great. Thank
58:03
you for listening.
58:05
Don't forget to check out our other shows, podcast.init, which covers the
58:10
Python language, its community, and the innovative
58:12
ways it is being used, and the
58:14
machine learning podcast, which helps you go
58:16
from idea to production with machine learning.
58:18
Visit the site at dataengineeringpodcast.com to subscribe
58:21
to the show, sign up for the
58:23
mailing list, and read the show notes.
58:25
And if you've learned something or tried out a product from the
58:28
show, then tell us about it. Email
58:30
hosts at dataengineeringpodcast.com with
58:32
your story. And to help other people find
58:35
the show, please leave a review on Apple
58:37
Podcasts or tell your friends.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More