Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello and welcome to the Data Engineering
0:13
Podcast, the show about modern data management. Data
0:16
lakes are notoriously complex. For
0:18
data engineers who battle to build and
0:21
scale high-quality data workflows on the data
0:23
lake, Starburst powers petabyte-scale SQL analytics fast
0:25
at a fraction of the cost of
0:28
traditional methods so that you can meet
0:30
all of your data needs, ranging from
0:32
AI to data applications to complete analytics.
0:35
Trusted by teams of all sizes, including
0:37
Comcast and DoorDash, Starburst is a data
0:39
lake analytics platform that delivers the adaptability
0:42
and flexibility a lake has ecosystem
0:44
promises. And Starburst does
0:46
all of this on an open architecture,
0:48
with first-class support for Apache Iceberg, Delta
0:50
Lake, and Hoodie, so you
0:52
always maintain ownership of your data. Want
0:55
to see Starburst in action? Go
0:58
to dataengineeringpodcast.com/Starburst and get
1:00
$500 in credits to
1:02
try Starburst Galaxy today, the easiest and
1:04
fastest way to get started using Trino.
1:07
Your host is Tobias Macy, and today
1:10
I'm interviewing Jignesh Patel about the research
1:12
that he is conducting on technical scalability
1:14
and user experience improvements around data management.
1:16
So Jignesh, can you start by introducing
1:18
yourself? Yes, hi.
1:20
Well, nice to talk to you and
1:23
to your audience. I'm Jignesh Patel. I'm
1:25
a professor in computer science at
1:27
Carnegie Mellon. I've been working
1:30
in the area of data for
1:32
about 25 years now and
1:34
been working on things and data
1:36
across the spectrum through the different
1:39
ages that the data ecosystem has
1:41
gone through from parallel databases to
1:44
streaming databases to mobile databases to
1:46
using databases for genomics and proteomics
1:48
and other biological applications to where
1:50
we are right now, where we
1:52
are trying to use gen
1:54
AI and make data analytics far
1:57
more easier for humans to get insights from
1:59
data. And you mentioned that you've been
2:01
in this space for a while. Do you remember how you
2:03
first got started working in data? Yeah,
2:06
I first started working in data when I
2:08
came to the University of Wisconsin as a
2:10
grad student. This was in the early 90s.
2:13
And I actually came here
2:15
to do computer architecture. But
2:17
Wisconsin has an amazing group.
2:20
It had one of the leading groups at
2:22
that time in databases. And
2:24
once I started taking a couple of
2:26
classes in there, that's how I decided
2:28
to switch over to databases. So
2:31
it was not the plan that I
2:33
had, but it was the strength
2:36
of the group that was at Wisconsin at
2:38
that time that really drew me into databases.
2:41
You are, as you said, a professor.
2:43
You work at Carnegie Mellon, which is
2:46
one of the leading schools for database
2:48
research today. And I'm
2:50
wondering if you can just start by
2:52
giving a bit of a summary of
2:54
some of the current areas of research
2:56
that you're focused on and what it
2:58
is about those subjects that motivates you
3:00
to invest the time and energy required
3:02
to gain meaningful results. Perfect.
3:05
Sounds great. And maybe a little bit of a context.
3:08
Carnegie Mellon is where many computer scientists
3:10
will say is where AI was invented.
3:13
And if you go back to the
3:15
birth of the study of data in
3:17
academia, Wisconsin, Berkeley, Purdue, were
3:19
one of the earliest schools that really started
3:21
to do that. So I've been really fortunate
3:23
to be at powerhouses of data and AI.
3:26
And of course, at Carnegie Mellon, there is
3:28
both data and AI that is
3:30
present today. In terms of what,
3:33
of course, the data research ecosystem
3:35
and product ecosystem has gone through different
3:37
phases. Where my research
3:39
is today and where I think
3:42
many of the interesting, part-looking research
3:44
problems are, and today's
3:46
part-looking research problems are very likely products that
3:48
will make a difference in a few years,
3:51
is along the two edges. I just alluded
3:53
to how I started initially as
3:55
a grad student being attracted to
3:57
architecture, which is making processes.
4:00
and storage devices and things like that that
4:02
get used as the computing substrate on
4:04
which you build your algorithms and software. And
4:07
today my research is broken into two parts. One is
4:09
on the architecture end of the spectrum
4:11
and the other is on the human end of the
4:13
spectrum. So think about what we
4:15
do in data platforms today, right? The data
4:18
platforms are largely software. They will run on
4:20
some hardware and we want these
4:22
data platforms to work with large volumes of
4:24
data. We want them to be extremely fast
4:26
and we want them to be versatile. And
4:30
of course we want all of that to happen
4:32
in a cost-effective fashion. At the
4:34
other end, we want these data platforms to
4:36
be very easy to use by humans of
4:39
all types, not just programmers, and there's a
4:41
ton of research in there. So
4:43
the first part of my research, which
4:45
is purely in academia right now, is
4:48
on the data architecture side. So
4:50
what's the interesting aspect over there?
4:54
So here's the backdrop. In
4:56
many enterprises, data has been
4:58
doubling in size, roughly
5:00
every two-ish years or so. And this is a growth
5:03
that has been happening for 30, 40
5:06
years for many organizations. In
5:08
the past, the way you dealt with that is
5:11
to say, okay, I've got a data platform.
5:14
It's doubling in the volume every few
5:16
years. I obviously can't pay twice
5:18
for all of my analytics, all of my queries
5:20
every two years, that would be unsustainable. So I
5:22
need to keep the cost the same or at
5:25
least, or perhaps even better start to
5:27
lower that. The one
5:30
big boost we used to get in the past
5:32
for data platforms to meet up with that demand
5:34
while keeping costs constant was to
5:36
say, let's just upgrade to the latest hardware because
5:39
everyone was writing Morse law
5:41
and the underlying principles of Dennard
5:43
scaling, which means if I upgraded
5:45
my computing substrate to
5:47
the latest generation of storage, compute,
5:50
and memory devices, which
5:52
all was 2x faster, and
5:54
if my data volume doubled, I'm
5:56
getting that constant cost perspective on
5:58
my analytics pipeline. But
6:01
all of that has stopped. And a big part
6:03
of my research at Carnegie Mellon now is
6:05
how do we build long-term sustainable platforms
6:08
where we can keep up with this
6:10
growth in data demand. And it's not just
6:12
growth, but we are asking deeper and deeper
6:14
questions of data that pushes additional stress and
6:17
still have this cost balance. The
6:19
gift of Moore's law hasn't fully
6:22
ended yet, but we all know that
6:25
five years out, it probably doesn't keep giving us
6:27
the dividends it had for the last 30, 50
6:29
years. So that's
6:31
one end of the spectrum. And the other
6:33
end of the spectrum is on using gen AI to
6:35
make data platforms more programmable. And I can talk about
6:37
that other part, but before that, let me turn it
6:40
over to you to see if you have questions. You
6:43
mentioned Moore's law as our saving grace
6:45
for the past few years. And
6:48
we are still somewhat benefiting from that by
6:50
increasing the number of transistors, but we're not
6:53
getting better clock speeds. We are adding
6:56
more cores, we're starting to reach the logical limit of
6:58
that as well. And as we go
7:00
down the nanometer scale, we start to hit physical limitations
7:02
of what we can even fit on a chip, which
7:05
brings up the specter of quantum computing. And I'm
7:07
wondering what the viability is
7:09
of that as our saving grace
7:11
for the next few decades. And
7:13
if there's any analogous equivalent in
7:15
quantum processing to the idea of
7:17
Moore's law. Yeah, great
7:20
question. You pointed out that Moore's law
7:22
is not dead. I agree. Not only
7:24
are we getting, we are still
7:26
getting denser packaging of transistors, but it's
7:28
also the big thing that's happening is,
7:30
now we are going 3D, right? You're
7:32
setting storage and chips all becoming three-dimensional.
7:34
It used to be all planar and
7:36
two-dimensional. So there's some life in that
7:38
packaging stuff, but it's still the energy
7:40
profile is an important component when you
7:42
start doing 3D packing. Yes, you can
7:45
get more transistors pushed in,
7:47
but now the heat dissipation becomes a
7:49
problem. So we'll continue to get the gift
7:51
of Moore's law or
7:53
the behavior that we've been expecting of
7:55
hardware for a little while, but not
7:58
forever. I don't think anyone's saying. beyond
8:00
the decade we are going to keep seeing that. And even
8:02
that for some is a stretch. Great
8:04
question about quantum computing and that
8:07
certainly has the potential to revolutionize
8:09
certain aspects of computer science, especially
8:12
the ones in which you're trying to
8:14
solve an algorithmic problem and trying to
8:16
find some optimization stuff, huge opportunities potentially
8:18
over there and of course in crypto.
8:21
But there's a well-known result now
8:24
to more than two decades ago
8:26
that for some of the core
8:28
data problems like sorting, you
8:30
can't do it any faster even if
8:32
you have an ideal quantum computer. So
8:35
furthermore, you know, we are at
8:37
this point many organizations are working
8:40
with terabytes and so many organizations
8:42
are working with petabytes of data. You have
8:44
to go, you can't even push all of
8:46
the data through a compute unit. So
8:48
it's like quantum computing for the type
8:51
of data analytics. I don't think
8:53
that's a possibility at least as far as I can see.
8:56
Certainly might have implications in certain smaller
8:58
components of what you do in the
9:00
broader data ecosystem but it's a different
9:02
problem space. So we
9:04
need to start finding ways
9:06
to get the data
9:08
to insights pipeline through
9:11
more traditional methods and nothing
9:13
else other than the traditional
9:15
semiconductor based hardware substrate ecosystem
9:18
is likely to be the answer for a
9:20
very long time. And
9:22
also with quantum, it will likely
9:25
bring up the same problems that
9:27
we're having now with GPUs where
9:29
it is a co-processor, it's not
9:31
going to supplant classical computing and
9:34
we're likely to hit a point where as
9:36
it gains popularity and adoption, we're not going
9:39
to have enough capacity for it. And so
9:41
I wonder if then we'll end up in
9:44
back in the time sharing model of everybody can submit their
9:46
requests in batch and you just have to wait for it
9:48
to come back. Yeah,
9:50
and look, I'm not an expert in
9:53
quantum computing but today you can go
9:55
and rent a quantum computer in many of
9:57
the cloud providers. Yes, it is
9:59
harder. to get time on that, perhaps
10:02
definitely compared to a GPU. A
10:05
co-processor often in a
10:07
data intensive environment, the co-process have to
10:09
be sitting very close to each other
10:11
because the IO, the cost of transferring
10:13
data from one side to
10:15
the other is often the bottleneck called
10:17
the one Neumann bottleneck. That's already a
10:19
big problem in CPU GPU databases. We
10:22
don't know how to use GPU as
10:24
well for large scale data platforms.
10:26
And there's some big companies that are doing
10:28
that. One of the leading companies that does
10:30
that is Voltron data down in the Bay
10:32
Area. But there are lots of hard
10:34
problems, even with simpler processing substrate. And I
10:36
would say for, as I said,
10:39
I'm not an expert in quantum computing, but that's
10:41
not something I think most nearly anyone
10:43
is really looking at as a viable
10:45
computing substrate for the type of data
10:48
processing, for cryptography, code cracking, stuff like
10:50
that. Obviously that's where all the excitement
10:52
is. But for the data land, I
10:54
think that's quite far out. There have
10:56
been research papers that have explored using
10:58
it for certain components, but
11:01
nothing I can see becoming mainstream
11:03
anytime soon for very fundamental
11:05
reasons. And unless those fundamental reasons
11:07
get solved, which probably is a totally
11:09
different type of quantum computers and totally
11:12
different ways of getting data in and out at
11:14
high speed, that's not
11:16
a viable path for the data direction.
11:19
Continuing on your point of IO being
11:21
the biggest bottleneck as we scale the
11:23
volume and complexity of data and the
11:26
types of analytics that we're trying to
11:28
build on top of them, what
11:31
are the future directions that we can
11:33
look to to try to realize that
11:35
either constant or declining cost as the
11:38
volumes of data increase and whether that
11:40
is in terms of the physical hardware
11:42
or some of the semantics
11:45
of how we work with data or ways that
11:47
we think about storing and accessing data and wondering
11:49
what are some of the areas of research that
11:51
you're focused on to help address those problems? Yeah,
11:55
that's a great question. So the
11:57
part that we are focused on is something a
11:59
little speculative. which computer scientists and architects and data
12:01
folks have been coming back to for a
12:03
fair amount, which is to say the traditional
12:06
one moment architecture says that I've got
12:08
a compute device and I've got a
12:10
storage device. They are connected by some
12:12
communication component. And then you
12:15
have to pull the data through that
12:17
communication channel to the computing device to
12:19
stuff on it. And when you're done
12:21
computing, you push it back. Right. So
12:23
there's two separate devices. And today that's
12:25
largely how your laptop
12:27
or your individual server or your
12:29
phone works to where entire
12:32
cloud data centers have a compute portion
12:34
of the cloud and a storage portion
12:36
of the cloud. So that version of
12:38
separation of data and cloud exists everywhere.
12:40
But as you can imagine, this is
12:42
very inefficient in many data analysis pipelines.
12:45
You are going to scan a large amount of data
12:47
and really the core of the compute that you're going
12:49
to do is going to be on a very small
12:51
fraction. And many times
12:54
in many data pipelines, you
12:56
have a very small
12:58
number of cycles per byte of data that
13:00
you're going to access. So where there's been
13:03
this idea in different forms for the last
13:05
30 years, which is to say,
13:07
can we push compute to the storage?
13:09
Right. Why are we bringing data
13:11
through effectively a narrow straw that
13:13
is relatively getting narrower and narrower
13:16
because the device capacity for storage
13:18
is increasing faster than the channel
13:20
capacity to pull data out. Why
13:22
can't we not think about
13:24
devices as pure storage devices and pure
13:26
compute devices, but have devices that can
13:29
do storage and compute. So you're not
13:31
pulling stuff in and out of the
13:33
device and then pushing it across these
13:35
two separate modes of working with data.
13:38
And so this idea of pushing compute inside
13:40
storage or pushing compute closer to storage has
13:42
been around for 30 years in a variety
13:44
of different forms. Where we are,
13:47
we are and we are spending a fair amount of
13:49
time looking at that. What has
13:51
been missing in all of that work so far? By
13:53
the way, none of that has quite become a reality
13:55
just yet, right? You still have this separation, as
13:57
I just said, even cloud at a high level.
14:00
has the separation principle for a
14:02
variety of reasons, but it's inefficient.
14:04
The reason why a lot of
14:06
these techniques have not had a
14:08
big commotion impact
14:11
is because it's very hard to
14:13
figure out what's the right amount of compute to
14:15
push into the storage without blowing the cost
14:17
of manufacturing this device. So if I
14:20
said I've got memory or I've got
14:22
flash storage and I want to put
14:24
smart compute inside that, by the way,
14:26
we already do that in many forms
14:29
in practical storage devices that you see
14:31
today. The question is how much compute do
14:33
I put in there? How programmable is
14:35
that compute? And what else can I
14:37
do with that? And that's where all of those
14:39
considerations, because many of these storage devices are very
14:41
low margin devices and if you say I'm gonna
14:44
put five dollars more in a hundred dollar device
14:46
that's way too much. Even a dollar is sometimes
14:48
a little too much. So maybe we
14:50
are looking at is taking a very fundamental,
14:53
arguably a very theoretical
14:56
and a very academic approach, which is
14:58
to go down and pretend like we were in the 1960s or
15:00
1950s when
15:02
we were just starting to build these
15:04
computing systems. So I'll give you
15:06
an example of a very fundamental technique, a question that
15:08
we are asking. Today if I
15:10
represent a number and store
15:12
that in a digital form, I'm
15:14
going to convert that into a
15:17
two-th complement representation and store it
15:19
in that device. For
15:21
the rest of this, I'll make my example be in decimal
15:23
form, right? So imagine I've got four
15:25
digit numbers that I want to store
15:27
and I'm storing the number thousand which
15:29
would be one zero zero zero and
15:31
that's in decimal form. The number two
15:33
thousand three hundred and fourteen would be
15:35
two three one four and
15:37
so on. Now imagine I had numbers
15:40
in that that were like five
15:42
and six and stuff like that and
15:44
if you look at the digital representation of
15:46
that, all the leading digits in that is
15:49
going to be zeros. And
15:51
what we do typically in the computer
15:53
is when you're storing just let's say
15:55
an array of numbers, we store it
15:57
so that we have the first number
15:59
represented. in storage first and the
16:01
second number and so on. Now when
16:03
you're searching for these numbers and I say find me
16:05
everything that is less than five,
16:08
I'm actually going to go through all the digits for
16:11
all the numbers before I can find my answer. But
16:14
now imagine we said we're going to represent numbers
16:16
in a totally different way. I'm just going
16:18
to represent the thousand position for
16:21
the number first and keep
16:23
the thousand digit for all the
16:25
numbers together. So if I go
16:27
and fetch some data from memory, I'll
16:29
just get the thousand value
16:31
for each of the numbers first before
16:34
I get the hundredth place, the 10th
16:36
place and the unit place.
16:38
And now with that, you can come up with a
16:40
completely different class of algorithms because let's say I've got
16:43
10 numbers and I just look at the
16:45
thousand digit value for that. And
16:47
if all of them are non-zeros
16:49
and I'm looking for everything that is
16:51
equal to five or less than five,
16:53
I can simply say off these 10
16:56
numbers, I don't even need to look at the
16:58
last few digits for any of them.
17:00
I can algorithmically guarantee you that this answer
17:02
is not present in this or a
17:04
set of numbers. So that's the
17:06
way we are thinking. We are going back to
17:08
early design and say, what's the fundamental encoding
17:10
of numbers? What's the fundamental
17:13
way we want to represent them in
17:15
storage and then can we come up
17:17
with a completely new class of algorithms
17:19
that have algorithmic superiority
17:22
in search compared to existing methods?
17:25
So we think that in this space, there are two
17:27
ways we will win and solve
17:30
this long-term data problem. One is by
17:32
rethinking algorithms from ground up to
17:35
be aware that storage and compute can
17:37
go together and I can push
17:39
specific algorithms that require very low
17:41
computational check and get me this
17:43
benefit. And the second is to
17:45
design what are those computing substrates that
17:48
are low cost, very cheap, and can
17:50
actually be put in an economical
17:52
way in the storage devices. So
17:54
it's a long answer and futuristic, but that's
17:56
kind of the way we are thinking. You're
17:58
imagining, let's imagine it's... research
20:00
problems in that entire space.
20:03
And there are some automated tools that are
20:05
out there to help you with that. This
20:09
also brings to mind some of the lessons
20:11
that we learned from the beginnings of the
20:13
big data era where the common wisdom
20:16
at the time was just throw all the data
20:18
in there, it'll be useful eventually, we don't know
20:20
what we're gonna do with it right now, but
20:22
just keep it all. And now as
20:25
big data has become more widely adopted,
20:27
we have a better understanding of how
20:29
to actually apply useful algorithms and analytics
20:32
on top of that data and the
20:34
regulatory environment has shifted. It's very much
20:36
a only store the data
20:38
that you actually have utility for
20:40
because otherwise it's going to cost
20:43
you both monetarily and potentially in
20:45
terms of reputation if there's
20:47
a breach or in terms of
20:49
fines if you are violating any
20:51
regulations. And I'm wondering what
20:53
you have seen in terms of
20:55
the some of the ways
20:57
that we can design systems to assist in that
20:59
upfront pruning of data rather than just throwing all
21:02
the data in a big black box and hoping
21:04
that we get some value out of it down
21:06
the road. Yeah, no
21:08
great question. I think there's still a lot of that
21:12
which you described which is throw the data in
21:15
and find value later. One of
21:17
the big transformations that has happened is in the
21:19
past people would say to construct a data analysis
21:21
pipeline I'm going to extract and I'm going to
21:24
transform then I'm going to
21:26
extract then transform it and then load it into
21:28
a database then start my analysis. Then
21:30
there's a paradigm shift potentially of saying
21:32
I'm going to extract load and transform
21:34
so I don't need to get the
21:37
schema that is in the right place.
21:39
But more realistically now especially when you
21:41
see things like lake houses and stuff
21:44
like that the whole idea is throw
21:46
the data in in some storage subsystem
21:48
which may be structured semi-structured or unstructured
21:50
have some sort of metadata that could be
21:52
metadata manager that could evolve over
21:54
time And then you're building your
21:56
data analysis pipelines that you know all of these
21:59
components are not. Your anymore I may
22:01
be for a specific task. Maybe I'm trying
22:03
to build a machine learning model to do
22:05
something. I may be looking at some portion
22:07
of the data sitting in the structure database
22:10
of relational system may be a snowflake or
22:12
something like that. but the same time I
22:14
may be have a new data that me
22:16
have come in and sitting in party fires
22:18
or maybe even in unstructured files that sitting
22:21
in the file. System I'm I've read some
22:23
sort of a custom code in Python. To extract
22:25
some from it, blend all of this together
22:27
to get some real time to get some
22:29
features from Baton. Built into a pipeline so
22:32
like did is everywhere having very. Linear.
22:34
Ways of seeing data lance your has
22:36
to be process to put goes to
22:39
even the that's often the predominant method
22:41
in many emerging applications. What enterprises want
22:43
is flexibility so that you can deal
22:45
with data not have to wait for
22:48
it to be formally loaded into their
22:50
house before he can do things because
22:52
sometimes the speed with which you getting
22:54
insights from data that's constantly arriving is
22:57
really the highest value proposition, right? The
22:59
value of an inside sometimes decays with
23:01
time the longer you have to wait.
23:04
To get the data to have
23:06
flown to process easy the human
23:08
are engineered the we can do
23:10
any analysis with that is be
23:12
is often lost Value to the
23:14
whole ecosystem is evolving but it's
23:16
really clear that Be want more
23:18
flexible compositional structures off being able
23:20
to do structured data and unstructured
23:22
analysis because analysis to the often
23:24
means very traditional. Type of analysis stuff
23:26
that people the doing that business intelligence type
23:28
of stuff. To sort of
23:30
more augmented methods that my views
23:32
machine learning to drive. insights perhaps even
23:34
still in the structure farm and then
23:37
the third parties risk sort of unstructured.
23:39
And you're dealing with richer sets of data. To.
23:42
All of that component one of the big. Challenges.
23:45
Is it's becoming harder and
23:47
harder to write analyses pipelines
23:49
and it's V programmatic be
23:52
driven today. So. There's been a ton
23:54
of work. Where people have talked about
23:56
know called and local methods to allow people
23:58
to do. Analyses. of this sort. And
24:01
this is kind of where the other
24:03
spectrum of my research is in using
24:05
GenAI to allow people to generate these
24:07
analysis pipeline, but to do that in
24:10
a way that requires them to write
24:12
no code and use the generative
24:14
AI machinery to actually tell
24:18
the system what to do. And my startup
24:20
data chat essentially addresses this problem. You point
24:22
it to a data set. We
24:24
work with structured data. You ask the question
24:27
and produces the analysis for you. And as
24:29
part of that analysis, it may write SQL
24:31
queries. It may write machine learning pipelines. It
24:33
may do a combination of that. It
24:35
may do visualization and presents the results to
24:38
you. So I think data in
24:40
its different form, there's the time to live for
24:42
data. That's one consideration for sure. People don't want
24:44
to hang on to data forever unless
24:47
they have a reason to. But also there's
24:49
the richness of data and the richness with
24:51
which you need to get insights from the
24:53
data. And there's just so many more tools.
24:56
But there's also the human aspect of it is
24:59
all of that, if it requires
25:01
increasing the human expense to do
25:04
the insights, is unsustainable too. Just
25:06
as it was unsustainable on the hardware end, to
25:08
say I'm going to double my cost every time
25:11
I double the data volume, you can't say I'm
25:13
going to double my human cost for programming if
25:15
I double my analysis needs. That's the
25:17
other end of the spectrum where some of these
25:19
Gen AI tools and stuff that you're doing in
25:21
data chat is one of many examples is
25:24
the other big challenge for the industry and for
25:26
the field. And in
25:28
that space of user experience, usability
25:31
of these data systems, as we
25:33
get more sophisticated with the types
25:35
of data that we're storing, the
25:38
ways that we're analyzing the data,
25:41
finding the data is always a problem.
25:43
So that's the first step in utility,
25:45
but then understanding what to do with
25:48
it, the semantics of that data, the
25:50
organizational aspects of what does the data
25:52
really mean in the context of my
25:54
business. All of these are barriers to
25:57
a seamless user experience and I'm wondering what are some of
25:59
the things of the opportunities for improving the
26:01
interfaces and the semantic understanding at a
26:04
fundamental level that these data engines can
26:06
contain and some of the ways that
26:08
they can help to give hints to
26:10
the end users without the end user
26:13
having to go and get their PhD
26:15
in data management, just to be able
26:17
to answer a simple question. Yeah.
26:20
I think great questions. I think we
26:22
have three components to it. One is
26:24
today it's the whole tooling ecosystem to
26:26
even discover what you where
26:29
to look for in this vast lake house
26:31
is non-existent. And I know a lot of
26:33
people are working on it. We have a
26:35
research project at CMU that is just starting
26:37
out to explore some of these aspects. Today,
26:39
it is not uncommon to go to a large enterprise
26:42
and see that they have a warehouse
26:44
or a lake house where
26:46
they might have hundreds,
26:48
if not a million
26:50
data sets that are sitting around collected
26:53
over time, even though they might have pruned it.
26:56
And in, you know, a data set might be
26:58
a table and that table might have 10 or
27:00
100 of columns in it.
27:02
So you're really saying I've got
27:04
millions or tens
27:06
or maybe sometimes even more schema
27:09
elements, descriptive elements of what's in the data.
27:11
It's not just the data values,
27:13
but just the description of the data is large. How
27:15
do I look? Sometimes it's super complicated even
27:18
saying, what is the profit that I made?
27:20
That's a complicated question. There's
27:22
a financial version of this that
27:24
is the methods that get used
27:27
for reporting purposes for financial statements
27:29
and stuff like that. But then there
27:31
are other descriptions where even something as
27:33
basic as pricing, it's like, do you look
27:35
at as the data is flowing in? Do
27:38
you, if I'm a retailer, do I look
27:40
at all the items that were checked out
27:42
from my cart, but what happens about returns?
27:44
What happens about projected returns? If I'm trying
27:46
to do analysis on orders that were just
27:48
placed, you know, do I expect that 25%
27:50
of that is going to get returned at
27:52
a certain type of the time of the
27:54
year? Like we know that sometimes there's more
27:56
return, the return rates goes up around the
27:59
holiday shopping time. So it's very complicated
28:01
to even define simple things. You don't
28:03
even know where to look. That's the
28:05
first challenge. Second is the semantic complexity
28:08
of saying, how do I manage? What is
28:10
the notion of something as simple as how
28:12
much did I make last week is hard.
28:15
And that's where many of these tools you
28:17
see there's excitement around dbt and a whole
28:19
bunch of semantic tooling mechanisms. That's the
28:21
second component, the discovery component. There
28:23
really isn't much the semantic component dbt
28:25
and tools like that exists. And then
28:27
you get to that programming layer, all of
28:30
the complexity we talked about. So even before
28:32
you get to that programming level, you're
28:34
exactly right. We don't know where to look
28:36
often. Even when we know where we want
28:38
to look, we need some sort of an
28:40
agreement and be able to communicate across different
28:42
members of the team or different teams in
28:44
an organization as to what's the semantic value
28:47
of the things that we see in the
28:49
database so that we can all be on
28:51
the same page. And then we can start
28:53
to trust the analytics pipeline downline. So there
28:55
isn't a clean separation between these pieces. Today,
28:57
when you see someone constructing a data science
28:59
pipeline, let's say in a notebook environment, all
29:01
of these are blended in, they are written
29:03
in code. They are not queryable. They are
29:05
not transparent. And it's very hard. If I
29:07
gave you a notebook that is 10,000 lines
29:09
long, that is running a core pipeline, and
29:11
if I'm no longer in your organization, it'll
29:14
be very hard for whoever picks that up
29:16
to understand what's going on in that notebook, because
29:18
all of these things are blended in programmatically and it's
29:20
a mess. And so given
29:23
the fact that there is so much complexity,
29:25
we have gotten to a space where we
29:27
have to work across at least
29:29
two or three different tools and systems just
29:31
to be able to answer a simple question.
29:34
What are some of the forward
29:37
looking design considerations,
29:39
system architectures, and
29:41
platform evolutions that we can look to to
29:43
simplify that aspect where maybe, I think it
29:46
was 10, 15 years ago, we had systems
29:48
like Informatica, where it was an all in
29:50
one vertically integrated solution. Now we've gone to
29:52
the modern data stack where we have a
29:55
dozen different tools, each of which wants to
29:57
own different overlapping pieces of the puzzle. now
30:00
we're starting to see the pendulum swing
30:02
back the other direction where we are
30:04
recompiling a vertically integrated solution out of
30:06
the individual components of the data stack
30:08
with things like Mozart data. What are
30:11
some of the ways that we as
30:13
engineers and system integrators should be thinking
30:15
about how to build cohesive platforms, cohesive
30:17
experiences so that our end users aren't
30:20
struggling and spending their entire day just trying to figure
30:22
out what they're supposed to be doing and how? Yeah,
30:25
I great question. I think practically
30:27
today from a systems
30:30
architecture, data engineering
30:33
perspective, you want
30:35
to keep the tool ecosystem
30:37
as lean as possible. There's
30:39
this huge tendency to say
30:41
you hit the hammer
30:43
right on the nail, which is
30:45
a lot of these tools have overlapping components and
30:48
it's so common to see, you
30:50
might have a team of 12 engineer
30:53
data engineers. Each one
30:55
will put in their favorite tool. And before
30:57
you know it, you've got a dozen tools
30:59
in the ecosystem and maybe all you needed
31:01
was three or two. And even if you
31:04
boil it down to a few, it's a
31:06
question of how well is that process and
31:09
methodology for using those tools
31:12
set at a systematic process
31:14
level to say what
31:16
will be used when and how do you, how
31:18
do you keep track all of that, especially as
31:20
all of these tools change over time. So I
31:23
think that's just pretty straightforward one-on-one
31:25
tool engineer running a good dev shop,
31:27
good engineering shop, keep it lean, keep
31:29
it clean and only bring in when
31:31
you need to and document everything, have
31:33
processes that go outside that tool integration
31:35
set. The second aspect of it, which
31:37
is a little bit futuristic and goes
31:39
a little bit into where data chat's
31:41
going. It's, it's, we look at it
31:43
from the other end of the spectrum. We
31:45
say all of this engineering support is a
31:48
means to an end. The end is to
31:50
enable the end user to ask a question
31:52
and get an answer in a way that
31:54
is transparent and reproducible. So more than saying,
31:57
I want to make it easy for some.
32:00
To compose a programmatic pipeline? how about
32:02
we complement that or flip it and
32:04
say be wanted Easy for anyone to
32:06
ask any questions and get an answer
32:08
and then get the pipeline that they
32:10
can verify in a way that sci
32:12
fi. this matches the semantics I need
32:14
to. that kind of be a going
32:16
to say whether it is data science
32:18
which includes sequel and machine learning and
32:20
data bleeding and feature injuring and all
32:22
of that stuff or visualization will give
32:25
you one you I loved One interface
32:27
which is a chat box. type your
32:29
question in bridgend. Read all of that but along
32:31
the that will give you a recipe. This. Precise.
32:33
Steps: The document what happened at each step.
32:35
This is a semantic definition that we came
32:38
up with for the definition of profit. You
32:40
can verified you can change it, but that's
32:42
the other end of the spectrum is that
32:44
the tools will evolve and if you make.
32:47
The. Management of the tools off the t
32:49
task of the deed. Engineering T than you
32:51
not serving the end user. You. Could
32:54
also try to come out from the other and which is
32:56
kinda what we're doing in the to chat. Is to say.
32:59
Blew up this portion of it. Make it easy
33:01
for anyone to ask that but build trust and
33:03
verification into the system to. That yes, the
33:05
semantic definition. Liam, I. Change to the
33:07
Beach Dignity or maybe just Python code
33:09
right now to something else But the
33:11
interface that you want to keep constant
33:13
is that enabling that end user to
33:15
ask these questions and build that trust
33:17
and they station the idea because it
33:19
was will teach and they will evolve.
33:21
and given that you are researching both
33:24
sides of this equation of user experience,
33:26
how to improve the utility of these
33:28
data systems as well as the scale,
33:30
ability, aspects and how do we make
33:32
it so the weekend push more data
33:34
through these systems without having to double
33:36
the cost. Every two years, What
33:38
are the elements of tension that
33:40
exist in answering those two questions?
33:43
And what are the opportunities for
33:45
incorporating those perspectives in the evolution
33:47
of the fundamental platform components that
33:49
we build. Yeah. Great questions
33:51
and is a big unification across both ends
33:54
of the spectrum. The unification his time on
33:56
the human side which basically the same as
33:58
costs so I want a fast system to
34:00
deal with the skill A pretty problem on
34:03
the architecture and of the spectrum that be
34:05
talked about. What I want exactly the same
34:07
speed because if I've got a human in
34:09
the loop compute which is what a lot
34:12
of analytics is often today then I want
34:14
you know if you fire up. A question
34:16
let's see today. the chat. It's gonna take thirty
34:18
seconds to come back, but if I could have
34:21
a fast for. Hardware software system that
34:23
could bring that as her back to
34:25
in half the time. Guess. What
34:27
the what? do you win on you when
34:29
on Human time? And human time is really
34:31
expensive, So ultimately that cost is the driving
34:34
factor across. Human times the same as
34:36
cost. And so that's a unification. Stuff
34:38
you need to stash stuff to do more. but
34:40
humans are going to be impatient. A lot of
34:42
this analysis as human in the loop right? Even
34:44
in charge you believe in you punch. In the
34:46
question and press enter you know that think took
34:49
a minute to come back. Was this. Five
34:51
seconds to come back. Your user experience and
34:53
your ability black you to use the tool
34:55
to do real work with completely change. Settlement
34:57
over this one thing. Speed Mappers. At
34:59
both ends of the spectrum. Faster. Is
35:02
better for very different reasons, but that's a
35:04
unifying Kp. I across both of them faster
35:06
is better. And. Circle As
35:08
you are conducting your research, you are
35:11
we doing it in the context of
35:13
a lab environment with your research group
35:15
and year. Hoping that the outcome of
35:18
this research will have some meaningful impact
35:20
on the industry and number of years
35:22
down the road, I'm wondering what are
35:25
some of the strategy that you used
35:27
To get some sort of real world
35:29
context around these problems and solutions that
35:32
you're building to feed that back into
35:34
the research? Said that You're doing it
35:36
in a. Way that is
35:38
directional a beneficial to the outcome that
35:41
you're hoping to achieve. Yeah. that's
35:43
a great question as a tough question
35:45
my recent philosophy is always been work
35:47
on interesting things that are at least
35:50
a few years out in old and
35:52
will if anyone can see more than
35:54
five years out but pick something that
35:56
is a to or five years challenge
35:59
you do not the
38:00
base. All my startups have
38:02
been in conjunction with the university so
38:05
you know I feel like if I'm
38:07
at a university and I do something
38:09
interesting it's because of the university so
38:11
let's play ball with them. Different people
38:13
have different philosophies but there's it's never
38:16
an easy answer there's always discussion there's
38:18
always negotiation there's always contractual stuff and
38:20
lawyers get involved so there's some non-fun
38:22
parts of it. The second part of
38:25
it is that once even in
38:27
academia if you are working on an interesting
38:29
problem in industry is often pretty interested in
38:31
getting engaged with you at an early point
38:33
in time and once you have even a
38:36
crude prototype that you could deploy even in
38:38
a limited setting you always learn things
38:40
that you would have never expected once
38:42
something becomes real and actual users start to
38:45
play with it because people will do crazy
38:47
stuff that even in the wildest imagination you
38:49
can't quite imagine and then all
38:51
of a sudden becomes real and what's super
38:53
interesting is nearly always it'll generate new research
38:55
problems for you to think about that you
38:57
wouldn't have come up with if you had
38:59
just tried to dream about it and think
39:02
about it in your office but you have
39:04
to start by dreaming first right if you
39:06
just go and tell people what do you
39:08
want they may not quite have that so
39:10
it's that combination you have to have a
39:12
dose of practical reality plus
39:14
a dose of aspirational creative
39:16
thinking and you have to have
39:18
both of those parts in any successful research project.
39:21
And as you have been conducting
39:24
your research and working in these different
39:26
startup enterprises what are some of the
39:28
most interesting or innovative or unexpected ways
39:30
that you've seen your research applied? I
39:34
think the most unexpected ways is when
39:36
you start to deal with real workloads
39:38
and real constraints you start to
39:41
realize that things that seem simple
39:43
or trivial actually turned out to
39:46
be really complex so
39:48
just the practical components of making things in
39:50
real life with cost considerations that
39:52
are real right someone's writing a check if
39:55
you're trying to train LLM on
39:57
the specific task at hand for example
39:59
stuff like that that we do a
40:01
data chat, all of a sudden the
40:03
cost component is no longer abstract, you're
40:05
actually writing a check for those hardware
40:07
resources. So you just are
40:09
at a different level where you start
40:12
thinking very, very carefully about things like
40:14
estimating things that are
40:16
going to be actually run and
40:19
developing methods to do that estimation, learning
40:21
how to do low-cost A-B
40:24
testing as you go down
40:26
searching for different architectural configurations
40:28
for the system architecture.
40:31
So very macro level stuff that
40:34
are abstract and potentially
40:36
not interesting in the academic setting,
40:38
but even not realizable in the academic
40:40
setting because you need often large teams of
40:42
engineers to be able to build a big system
40:44
like data chat is. So those
40:47
are super interesting things that I think are
40:49
very hard, if not impossible to study in
40:51
academia, but are front and center
40:53
and very quintessentially interesting problems
40:55
that show up once things start to
40:57
become real in enterprises and in startups.
41:00
And in your own research that you're
41:03
doing, what are some of the most
41:05
interesting or unexpected or challenging lessons that
41:07
you've learned? Yeah, I
41:09
think the most challenging lesson
41:11
is that don't give up the first time
41:13
you get a negative result, which will happen.
41:16
If you pick a challenging problem, you'll sometimes
41:18
hit your head against the wall maybe
41:21
for years. And if
41:23
the question is still valid, if
41:27
it is tantalizingly important long
41:29
term, you sometimes just have to
41:31
stay at it. It takes patience. And
41:33
sometimes it may take multiple students because students
41:36
come even in the PhD program, they
41:38
may be with you for five or six
41:40
years. And sometimes an interesting problem may take
41:43
longer time than that. And so staying
41:46
with the problem longer than a few
41:49
durations, I know attention spans are getting
41:51
shorter and shorter over time. But sometimes
41:53
the payoff happens when you work on
41:55
something for an extended amount of time.
41:58
And have there been any particular interesting
42:00
or informative dead ends that you've encountered along
42:02
your journey? Yeah, the part that we started
42:05
out with where we are looking at encoding
42:07
techniques and saying let's revisit that. We actually
42:09
started working on it about ten years ago,
42:12
got some good early results, then kept hitting
42:14
a wall and now
42:16
I think we are on to a new line
42:18
of thinking which is along this line of... And
42:21
as you continue to
42:23
work on these hard problems, you
42:25
start, you try to forecast what
42:27
are the solutions that we're going
42:29
to need three to five years
42:32
out as you were saying. I'm
42:34
curious what your heuristic is for
42:36
when a particular research project needs
42:38
to be either killed or put
42:40
into production. Yeah, I think
42:42
putting into production is easy, right? If you have something
42:44
that is interesting and exciting, you pitch it to a
42:47
couple of VCs. You know, first before you pitch, you
42:49
see if you can get your students excited and collaborators
42:51
excited to go spin it out into something like
42:53
a startup. Once you do that, then you go
42:55
and see if you can pitch it to VCs.
42:58
Many of them are extremely sharp. They see
43:00
a lot. They'll be... And if you can't
43:02
get a VC's attention, then there's something probably
43:05
wrong. You missed it, right? Because you
43:07
should be able to convince someone to put money
43:09
into a good idea. And once you have all
43:11
of that, then you can get
43:13
the ball rolling. And academic research
43:15
also requires funding, right? You're trying to convince
43:18
funding agencies to fund you and the VC
43:20
game is different. It has to be more mature by
43:22
the time you get to that. So it's a spectrum. But
43:24
luckily there are well-defined mechanisms to do that.
43:27
But if you can't convince someone, your student
43:29
to work on an interesting, far outreaching, far
43:31
out problem that may seem crazy. Or if
43:33
you can convince a VC to fund you,
43:35
then something's wrong. You have to re-examine it
43:37
and say, how do I refine what I'm
43:39
doing? Am I on the wrong path? Should
43:41
I sunset this or pause this till I
43:44
can get someone else also
43:46
to be more interested in this problem?
43:48
So that's the way I think about
43:50
it. I know there are different ways. You know,
43:52
if you're a pure theory person or a pure math
43:54
person, you could stick to a problem
43:56
by yourself. But for the type of things that
43:58
I do in systems, You need collaborators,
44:01
you need students, you need larger
44:03
teams. So you have to convince someone that's
44:05
a good idea and that's for me, a good
44:07
measure. And as you look
44:10
to the future and you see what
44:12
are some of the problems that you
44:14
are anticipating we're going to have to
44:16
address as we continue to build and
44:18
scale these complex systems and complex data
44:21
challenges, what are some of the areas
44:23
of focus that you have planned for
44:25
the near to medium term or any
44:27
particular projects or problem areas that you
44:29
or someone else should dig into? Yeah.
44:32
And it's looking at the two ends of the spectrum to
44:34
broaden out on the architecture side. There's
44:36
just so much diversity of different ways
44:38
to architect storage and computing devices. So
44:40
I'm working with collaborators from other universities
44:43
and at CMU who are hardware folks
44:45
to understand that ecosystem and see what's
44:47
possible. What's the design space? It's vast.
44:49
So there's a ton of work to
44:51
do in that space and lots of
44:53
interesting sub spaces there. On the other
44:55
end of the spectrum where I think
44:58
we are just getting started with all
45:00
of the uses of Gen AI for
45:03
improving human productivity in getting
45:05
insights from systems and things of
45:07
that sort. We're still starting
45:09
to better understand how to use
45:11
these LLMs in
45:14
ways that protect the
45:16
privacy of the data and the
45:18
communication between the platform
45:20
that's using the Gen
45:23
AI technology and
45:25
the application. There's also
45:27
a huge component of what's the
45:29
cost component, are small models of
45:32
future in certain cases or are they still
45:34
quite far out from the large models and
45:36
large models are getting larger and larger.
45:38
There are all kinds of different architectures. So
45:41
lots of interesting stuff in just that space
45:43
of how to economically use, when to use
45:45
what components and just like, you know,
45:47
lots of interesting subspace, including that data discovery piece
45:49
that I mentioned, we don't know where to look.
45:52
And even when you know where to look, you
45:54
don't know how to use many of these new
45:57
advanced methods, especially in the Gen AI space, because
45:59
that's just moving. So I think
46:01
just anywhere you look, there are lots and lots of pockets
46:03
of interesting components in the two ends of the spectrum.
46:05
I would say the middle is kind of boring. Go
46:08
to the edges. It's wide open. Are
46:11
there any other aspects of the
46:13
research that you're focused on, the
46:15
problem spaces that are still open
46:18
to be explored, or some of the other work that
46:20
you're involved in that we didn't discuss yet that you'd
46:23
like to cover before we close out the show? I
46:26
think there's a huge amount of interest in
46:28
general in terms of saying what's the
46:31
future of LLMs in terms of how
46:33
open should they be? And
46:35
what does openness mean? Is open
46:37
weights open enough? Probably not. I
46:40
think in academia, one of the challenges
46:42
when you're working on some of these large
46:44
LLM models is very few institutions
46:46
have the resources it takes to
46:48
build one of these LLMs from
46:50
scratch in a realistic fashion. Yeah,
46:53
so I think there are lots of research
46:55
problems. And if you especially look at the
46:57
space of Gen AI, there are certain things
46:59
that you can do better in industry today.
47:02
So if you are at OpenAI
47:04
or at Google and have been
47:07
building these large language models now
47:09
for five years, which is an
47:11
eternity, you know all the deep
47:13
system engineering tricks that use
47:16
a lot of insights. They will never get written in
47:18
papers. It's very hard in academia for someone to go
47:20
and say, I'm going to take that project first. You
47:23
don't have that five year engineering detail
47:25
tricks that you can do or trade
47:29
secrets to go and do things in an
47:31
efficient fashion. Second, it
47:33
takes a lot of resources,
47:35
millions, if not tens or hundreds of millions
47:37
of dollars to build one of that. So
47:40
there's certain components that are
47:42
just very uniquely well positioned right now
47:44
in that exciting space in industry. And
47:46
as academics, it's like, okay, do you go to
47:48
industry and spend some time over there if you're
47:50
deeply interested in stuff like that? Luckily, there are
47:53
lots of interesting far
47:55
outreaching problems that require
47:57
you to start with something
47:59
that might be a large language model and do
48:01
stuff with it. And there's a ton of
48:03
work going on in there, but you know,
48:05
certainly this is kind of unique where in
48:07
the past, it was often the case where
48:09
the deepest core component of some
48:12
new technology was often done in academia.
48:15
You could arguably say building a large
48:17
language model is one of those core
48:19
constructs, and that is better
48:21
done right now, arguably in industry because of
48:24
the resources and, and all of the large
48:26
engineering teams that you often need to go
48:28
do, do that stuff is available only over
48:30
there right now. So there's a little bit
48:33
of a difference in terms of where things
48:35
go. So you have to, if you're working
48:37
in that space, you have to say, which
48:40
problems can I practically achieve and do in
48:42
academia? And that's sort of a
48:44
new thing for many parts of computer science.
48:47
All right. Well, for anybody who wants to get
48:49
in touch with you and follow along with the
48:51
work that you're doing, I'll have you add your
48:53
preferred contact information to the show notes. And as
48:55
the final question, I'd like to get your perspective
48:58
on what you see as being the biggest gap
49:00
in the tooling or technology that's available for data
49:02
management today. I think data discovery
49:04
is probably the biggest one that comes to
49:06
mind. You know, we do not
49:08
have ways to find out where do I
49:10
even start to look. Absolutely. All
49:12
right. Well, thank you very much for taking
49:14
the time today to join me and share
49:17
the work that you've been doing in your
49:19
research and the ways that you have been
49:21
applying that in the commercial sector. It's definitely
49:24
a very interesting body of topics that you're
49:26
focused on. Definitely glad that you and your
49:28
collaborators are working to improve our capabilities in
49:30
this space. So I appreciate all the time
49:33
and energy that you're putting into that. And
49:35
I hope you enjoy the rest of your day.
49:37
Thank you. Take care. at
50:00
dataengineeringpodcast.com to subscribe to the show, find up
50:02
for the mailing list and read the show
50:04
notes. And if you've learned something or tried
50:06
out a project from the show, then tell us about it. e-mail
50:09
host at dataengineeringpodcast.com with your stories.
50:12
And to help other people
50:14
find the show, please leave a review on Apple Podcasts and tell
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More