Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel by Data Engineering Podcast | Podchaser

Episode from the podcastData Engineering Podcast

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Released Sunday, 7th January 2024

Good episode? Give it some love!

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Sunday, 7th January 2024

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello and welcome to the Data Engineering

0:13

Podcast, the show about modern data management. Data

0:16

lakes are notoriously complex. For

0:18

data engineers who battle to build and

0:21

scale high-quality data workflows on the data

0:23

lake, Starburst powers petabyte-scale SQL analytics fast

0:25

at a fraction of the cost of

0:28

traditional methods so that you can meet

0:30

all of your data needs, ranging from

0:32

AI to data applications to complete analytics.

0:35

Trusted by teams of all sizes, including

0:37

Comcast and DoorDash, Starburst is a data

0:39

lake analytics platform that delivers the adaptability

0:42

and flexibility a lake has ecosystem

0:44

promises. And Starburst does

0:46

all of this on an open architecture,

0:48

with first-class support for Apache Iceberg, Delta

0:50

Lake, and Hoodie, so you

0:52

always maintain ownership of your data. Want

0:55

to see Starburst in action? Go

0:58

to dataengineeringpodcast.com/Starburst and get

1:00

$500 in credits to

1:02

try Starburst Galaxy today, the easiest and

1:04

fastest way to get started using Trino.

1:07

Your host is Tobias Macy, and today

1:10

I'm interviewing Jignesh Patel about the research

1:12

that he is conducting on technical scalability

1:14

and user experience improvements around data management.

1:16

So Jignesh, can you start by introducing

1:18

yourself? Yes, hi.

1:20

Well, nice to talk to you and

1:23

to your audience. I'm Jignesh Patel. I'm

1:25

a professor in computer science at

1:27

Carnegie Mellon. I've been working

1:30

in the area of data for

1:32

about 25 years now and

1:34

been working on things and data

1:36

across the spectrum through the different

1:39

ages that the data ecosystem has

1:41

gone through from parallel databases to

1:44

streaming databases to mobile databases to

1:46

using databases for genomics and proteomics

1:48

and other biological applications to where

1:50

we are right now, where we

1:52

are trying to use gen

1:54

AI and make data analytics far

1:57

more easier for humans to get insights from

1:59

data. And you mentioned that you've been

2:01

in this space for a while. Do you remember how you

2:03

first got started working in data? Yeah,

2:06

I first started working in data when I

2:08

came to the University of Wisconsin as a

2:10

grad student. This was in the early 90s.

2:13

And I actually came here

2:15

to do computer architecture. But

2:17

Wisconsin has an amazing group.

2:20

It had one of the leading groups at

2:22

that time in databases. And

2:24

once I started taking a couple of

2:26

classes in there, that's how I decided

2:28

to switch over to databases. So

2:31

it was not the plan that I

2:33

had, but it was the strength

2:36

of the group that was at Wisconsin at

2:38

that time that really drew me into databases.

2:41

You are, as you said, a professor.

2:43

You work at Carnegie Mellon, which is

2:46

one of the leading schools for database

2:48

research today. And I'm

2:50

wondering if you can just start by

2:52

giving a bit of a summary of

2:54

some of the current areas of research

2:56

that you're focused on and what it

2:58

is about those subjects that motivates you

3:00

to invest the time and energy required

3:02

to gain meaningful results. Perfect.

3:05

Sounds great. And maybe a little bit of a context.

3:08

Carnegie Mellon is where many computer scientists

3:10

will say is where AI was invented.

3:13

And if you go back to the

3:15

birth of the study of data in

3:17

academia, Wisconsin, Berkeley, Purdue, were

3:19

one of the earliest schools that really started

3:21

to do that. So I've been really fortunate

3:23

to be at powerhouses of data and AI.

3:26

And of course, at Carnegie Mellon, there is

3:28

both data and AI that is

3:30

present today. In terms of what,

3:33

of course, the data research ecosystem

3:35

and product ecosystem has gone through different

3:37

phases. Where my research

3:39

is today and where I think

3:42

many of the interesting, part-looking research

3:44

problems are, and today's

3:46

part-looking research problems are very likely products that

3:48

will make a difference in a few years,

3:51

is along the two edges. I just alluded

3:53

to how I started initially as

3:55

a grad student being attracted to

3:57

architecture, which is making processes.

4:00

and storage devices and things like that that

4:02

get used as the computing substrate on

4:04

which you build your algorithms and software. And

4:07

today my research is broken into two parts. One is

4:09

on the architecture end of the spectrum

4:11

and the other is on the human end of the

4:13

spectrum. So think about what we

4:15

do in data platforms today, right? The data

4:18

platforms are largely software. They will run on

4:20

some hardware and we want these

4:22

data platforms to work with large volumes of

4:24

data. We want them to be extremely fast

4:26

and we want them to be versatile. And

4:30

of course we want all of that to happen

4:32

in a cost-effective fashion. At the

4:34

other end, we want these data platforms to

4:36

be very easy to use by humans of

4:39

all types, not just programmers, and there's a

4:41

ton of research in there. So

4:43

the first part of my research, which

4:45

is purely in academia right now, is

4:48

on the data architecture side. So

4:50

what's the interesting aspect over there?

4:54

So here's the backdrop. In

4:56

many enterprises, data has been

4:58

doubling in size, roughly

5:00

every two-ish years or so. And this is a growth

5:03

that has been happening for 30, 40

5:06

years for many organizations. In

5:08

the past, the way you dealt with that is

5:11

to say, okay, I've got a data platform.

5:14

It's doubling in the volume every few

5:16

years. I obviously can't pay twice

5:18

for all of my analytics, all of my queries

5:20

every two years, that would be unsustainable. So I

5:22

need to keep the cost the same or at

5:25

least, or perhaps even better start to

5:27

lower that. The one

5:30

big boost we used to get in the past

5:32

for data platforms to meet up with that demand

5:34

while keeping costs constant was to

5:36

say, let's just upgrade to the latest hardware because

5:39

everyone was writing Morse law

5:41

and the underlying principles of Dennard

5:43

scaling, which means if I upgraded

5:45

my computing substrate to

5:47

the latest generation of storage, compute,

5:50

and memory devices, which

5:52

all was 2x faster, and

5:54

if my data volume doubled, I'm

5:56

getting that constant cost perspective on

5:58

my analytics pipeline. But

6:01

all of that has stopped. And a big part

6:03

of my research at Carnegie Mellon now is

6:05

how do we build long-term sustainable platforms

6:08

where we can keep up with this

6:10

growth in data demand. And it's not just

6:12

growth, but we are asking deeper and deeper

6:14

questions of data that pushes additional stress and

6:17

still have this cost balance. The

6:19

gift of Moore's law hasn't fully

6:22

ended yet, but we all know that

6:25

five years out, it probably doesn't keep giving us

6:27

the dividends it had for the last 30, 50

6:29

years. So that's

6:31

one end of the spectrum. And the other

6:33

end of the spectrum is on using gen AI to

6:35

make data platforms more programmable. And I can talk about

6:37

that other part, but before that, let me turn it

6:40

over to you to see if you have questions. You

6:43

mentioned Moore's law as our saving grace

6:45

for the past few years. And

6:48

we are still somewhat benefiting from that by

6:50

increasing the number of transistors, but we're not

6:53

getting better clock speeds. We are adding

6:56

more cores, we're starting to reach the logical limit of

6:58

that as well. And as we go

7:00

down the nanometer scale, we start to hit physical limitations

7:02

of what we can even fit on a chip, which

7:05

brings up the specter of quantum computing. And I'm

7:07

wondering what the viability is

7:09

of that as our saving grace

7:11

for the next few decades. And

7:13

if there's any analogous equivalent in

7:15

quantum processing to the idea of

7:17

Moore's law. Yeah, great

7:20

question. You pointed out that Moore's law

7:22

is not dead. I agree. Not only

7:24

are we getting, we are still

7:26

getting denser packaging of transistors, but it's

7:28

also the big thing that's happening is,

7:30

now we are going 3D, right? You're

7:32

setting storage and chips all becoming three-dimensional.

7:34

It used to be all planar and

7:36

two-dimensional. So there's some life in that

7:38

packaging stuff, but it's still the energy

7:40

profile is an important component when you

7:42

start doing 3D packing. Yes, you can

7:45

get more transistors pushed in,

7:47

but now the heat dissipation becomes a

7:49

problem. So we'll continue to get the gift

7:51

of Moore's law or

7:53

the behavior that we've been expecting of

7:55

hardware for a little while, but not

7:58

forever. I don't think anyone's saying. beyond

8:00

the decade we are going to keep seeing that. And even

8:02

that for some is a stretch. Great

8:04

question about quantum computing and that

8:07

certainly has the potential to revolutionize

8:09

certain aspects of computer science, especially

8:12

the ones in which you're trying to

8:14

solve an algorithmic problem and trying to

8:16

find some optimization stuff, huge opportunities potentially

8:18

over there and of course in crypto.

8:21

But there's a well-known result now

8:24

to more than two decades ago

8:26

that for some of the core

8:28

data problems like sorting, you

8:30

can't do it any faster even if

8:32

you have an ideal quantum computer. So

8:35

furthermore, you know, we are at

8:37

this point many organizations are working

8:40

with terabytes and so many organizations

8:42

are working with petabytes of data. You have

8:44

to go, you can't even push all of

8:46

the data through a compute unit. So

8:48

it's like quantum computing for the type

8:51

of data analytics. I don't think

8:53

that's a possibility at least as far as I can see.

8:56

Certainly might have implications in certain smaller

8:58

components of what you do in the

9:00

broader data ecosystem but it's a different

9:02

problem space. So we

9:04

need to start finding ways

9:06

to get the data

9:08

to insights pipeline through

9:11

more traditional methods and nothing

9:13

else other than the traditional

9:15

semiconductor based hardware substrate ecosystem

9:18

is likely to be the answer for a

9:20

very long time. And

9:22

also with quantum, it will likely

9:25

bring up the same problems that

9:27

we're having now with GPUs where

9:29

it is a co-processor, it's not

9:31

going to supplant classical computing and

9:34

we're likely to hit a point where as

9:36

it gains popularity and adoption, we're not going

9:39

to have enough capacity for it. And so

9:41

I wonder if then we'll end up in

9:44

back in the time sharing model of everybody can submit their

9:46

requests in batch and you just have to wait for it

9:48

to come back. Yeah,

9:50

and look, I'm not an expert in

9:53

quantum computing but today you can go

9:55

and rent a quantum computer in many of

9:57

the cloud providers. Yes, it is

9:59

harder. to get time on that, perhaps

10:02

definitely compared to a GPU. A

10:05

co-processor often in a

10:07

data intensive environment, the co-process have to

10:09

be sitting very close to each other

10:11

because the IO, the cost of transferring

10:13

data from one side to

10:15

the other is often the bottleneck called

10:17

the one Neumann bottleneck. That's already a

10:19

big problem in CPU GPU databases. We

10:22

don't know how to use GPU as

10:24

well for large scale data platforms.

10:26

And there's some big companies that are doing

10:28

that. One of the leading companies that does

10:30

that is Voltron data down in the Bay

10:32

Area. But there are lots of hard

10:34

problems, even with simpler processing substrate. And I

10:36

would say for, as I said,

10:39

I'm not an expert in quantum computing, but that's

10:41

not something I think most nearly anyone

10:43

is really looking at as a viable

10:45

computing substrate for the type of data

10:48

processing, for cryptography, code cracking, stuff like

10:50

that. Obviously that's where all the excitement

10:52

is. But for the data land, I

10:54

think that's quite far out. There have

10:56

been research papers that have explored using

10:58

it for certain components, but

11:01

nothing I can see becoming mainstream

11:03

anytime soon for very fundamental

11:05

reasons. And unless those fundamental reasons

11:07

get solved, which probably is a totally

11:09

different type of quantum computers and totally

11:12

different ways of getting data in and out at

11:14

high speed, that's not

11:16

a viable path for the data direction.

11:19

Continuing on your point of IO being

11:21

the biggest bottleneck as we scale the

11:23

volume and complexity of data and the

11:26

types of analytics that we're trying to

11:28

build on top of them, what

11:31

are the future directions that we can

11:33

look to to try to realize that

11:35

either constant or declining cost as the

11:38

volumes of data increase and whether that

11:40

is in terms of the physical hardware

11:42

or some of the semantics

11:45

of how we work with data or ways that

11:47

we think about storing and accessing data and wondering

11:49

what are some of the areas of research that

11:51

you're focused on to help address those problems? Yeah,

11:55

that's a great question. So the

11:57

part that we are focused on is something a

11:59

little speculative. which computer scientists and architects and data

12:01

folks have been coming back to for a

12:03

fair amount, which is to say the traditional

12:06

one moment architecture says that I've got

12:08

a compute device and I've got a

12:10

storage device. They are connected by some

12:12

communication component. And then you

12:15

have to pull the data through that

12:17

communication channel to the computing device to

12:19

stuff on it. And when you're done

12:21

computing, you push it back. Right. So

12:23

there's two separate devices. And today that's

12:25

largely how your laptop

12:27

or your individual server or your

12:29

phone works to where entire

12:32

cloud data centers have a compute portion

12:34

of the cloud and a storage portion

12:36

of the cloud. So that version of

12:38

separation of data and cloud exists everywhere.

12:40

But as you can imagine, this is

12:42

very inefficient in many data analysis pipelines.

12:45

You are going to scan a large amount of data

12:47

and really the core of the compute that you're going

12:49

to do is going to be on a very small

12:51

fraction. And many times

12:54

in many data pipelines, you

12:56

have a very small

12:58

number of cycles per byte of data that

13:00

you're going to access. So where there's been

13:03

this idea in different forms for the last

13:05

30 years, which is to say,

13:07

can we push compute to the storage?

13:09

Right. Why are we bringing data

13:11

through effectively a narrow straw that

13:13

is relatively getting narrower and narrower

13:16

because the device capacity for storage

13:18

is increasing faster than the channel

13:20

capacity to pull data out. Why

13:22

can't we not think about

13:24

devices as pure storage devices and pure

13:26

compute devices, but have devices that can

13:29

do storage and compute. So you're not

13:31

pulling stuff in and out of the

13:33

device and then pushing it across these

13:35

two separate modes of working with data.

13:38

And so this idea of pushing compute inside

13:40

storage or pushing compute closer to storage has

13:42

been around for 30 years in a variety

13:44

of different forms. Where we are,

13:47

we are and we are spending a fair amount of

13:49

time looking at that. What has

13:51

been missing in all of that work so far? By

13:53

the way, none of that has quite become a reality

13:55

just yet, right? You still have this separation, as

13:57

I just said, even cloud at a high level.

14:00

has the separation principle for a

14:02

variety of reasons, but it's inefficient.

14:04

The reason why a lot of

14:06

these techniques have not had a

14:08

big commotion impact

14:11

is because it's very hard to

14:13

figure out what's the right amount of compute to

14:15

push into the storage without blowing the cost

14:17

of manufacturing this device. So if I

14:20

said I've got memory or I've got

14:22

flash storage and I want to put

14:24

smart compute inside that, by the way,

14:26

we already do that in many forms

14:29

in practical storage devices that you see

14:31

today. The question is how much compute do

14:33

I put in there? How programmable is

14:35

that compute? And what else can I

14:37

do with that? And that's where all of those

14:39

considerations, because many of these storage devices are very

14:41

low margin devices and if you say I'm gonna

14:44

put five dollars more in a hundred dollar device

14:46

that's way too much. Even a dollar is sometimes

14:48

a little too much. So maybe we

14:50

are looking at is taking a very fundamental,

14:53

arguably a very theoretical

14:56

and a very academic approach, which is

14:58

to go down and pretend like we were in the 1960s or

15:00

1950s when

15:02

we were just starting to build these

15:04

computing systems. So I'll give you

15:06

an example of a very fundamental technique, a question that

15:08

we are asking. Today if I

15:10

represent a number and store

15:12

that in a digital form, I'm

15:14

going to convert that into a

15:17

two-th complement representation and store it

15:19

in that device. For

15:21

the rest of this, I'll make my example be in decimal

15:23

form, right? So imagine I've got four

15:25

digit numbers that I want to store

15:27

and I'm storing the number thousand which

15:29

would be one zero zero zero and

15:31

that's in decimal form. The number two

15:33

thousand three hundred and fourteen would be

15:35

two three one four and

15:37

so on. Now imagine I had numbers

15:40

in that that were like five

15:42

and six and stuff like that and

15:44

if you look at the digital representation of

15:46

that, all the leading digits in that is

15:49

going to be zeros. And

15:51

what we do typically in the computer

15:53

is when you're storing just let's say

15:55

an array of numbers, we store it

15:57

so that we have the first number

15:59

represented. in storage first and the

16:01

second number and so on. Now when

16:03

you're searching for these numbers and I say find me

16:05

everything that is less than five,

16:08

I'm actually going to go through all the digits for

16:11

all the numbers before I can find my answer. But

16:14

now imagine we said we're going to represent numbers

16:16

in a totally different way. I'm just going

16:18

to represent the thousand position for

16:21

the number first and keep

16:23

the thousand digit for all the

16:25

numbers together. So if I go

16:27

and fetch some data from memory, I'll

16:29

just get the thousand value

16:31

for each of the numbers first before

16:34

I get the hundredth place, the 10th

16:36

place and the unit place.

16:38

And now with that, you can come up with a

16:40

completely different class of algorithms because let's say I've got

16:43

10 numbers and I just look at the

16:45

thousand digit value for that. And

16:47

if all of them are non-zeros

16:49

and I'm looking for everything that is

16:51

equal to five or less than five,

16:53

I can simply say off these 10

16:56

numbers, I don't even need to look at the

16:58

last few digits for any of them.

17:00

I can algorithmically guarantee you that this answer

17:02

is not present in this or a

17:04

set of numbers. So that's the

17:06

way we are thinking. We are going back to

17:08

early design and say, what's the fundamental encoding

17:10

of numbers? What's the fundamental

17:13

way we want to represent them in

17:15

storage and then can we come up

17:17

with a completely new class of algorithms

17:19

that have algorithmic superiority

17:22

in search compared to existing methods?

17:25

So we think that in this space, there are two

17:27

ways we will win and solve

17:30

this long-term data problem. One is by

17:32

rethinking algorithms from ground up to

17:35

be aware that storage and compute can

17:37

go together and I can push

17:39

specific algorithms that require very low

17:41

computational check and get me this

17:43

benefit. And the second is to

17:45

design what are those computing substrates that

17:48

are low cost, very cheap, and can

17:50

actually be put in an economical

17:52

way in the storage devices. So

17:54

it's a long answer and futuristic, but that's

17:56

kind of the way we are thinking. You're

17:58

imagining, let's imagine it's... research

20:00

problems in that entire space.

20:03

And there are some automated tools that are

20:05

out there to help you with that. This

20:09

also brings to mind some of the lessons

20:11

that we learned from the beginnings of the

20:13

big data era where the common wisdom

20:16

at the time was just throw all the data

20:18

in there, it'll be useful eventually, we don't know

20:20

what we're gonna do with it right now, but

20:22

just keep it all. And now as

20:25

big data has become more widely adopted,

20:27

we have a better understanding of how

20:29

to actually apply useful algorithms and analytics

20:32

on top of that data and the

20:34

regulatory environment has shifted. It's very much

20:36

a only store the data

20:38

that you actually have utility for

20:40

because otherwise it's going to cost

20:43

you both monetarily and potentially in

20:45

terms of reputation if there's

20:47

a breach or in terms of

20:49

fines if you are violating any

20:51

regulations. And I'm wondering what

20:53

you have seen in terms of

20:55

the some of the ways

20:57

that we can design systems to assist in that

20:59

upfront pruning of data rather than just throwing all

21:02

the data in a big black box and hoping

21:04

that we get some value out of it down

21:06

the road. Yeah, no

21:08

great question. I think there's still a lot of that

21:12

which you described which is throw the data in

21:15

and find value later. One of

21:17

the big transformations that has happened is in the

21:19

past people would say to construct a data analysis

21:21

pipeline I'm going to extract and I'm going to

21:24

transform then I'm going to

21:26

extract then transform it and then load it into

21:28

a database then start my analysis. Then

21:30

there's a paradigm shift potentially of saying

21:32

I'm going to extract load and transform

21:34

so I don't need to get the

21:37

schema that is in the right place.

21:39

But more realistically now especially when you

21:41

see things like lake houses and stuff

21:44

like that the whole idea is throw

21:46

the data in in some storage subsystem

21:48

which may be structured semi-structured or unstructured

21:50

have some sort of metadata that could be

21:52

metadata manager that could evolve over

21:54

time And then you're building your

21:56

data analysis pipelines that you know all of these

21:59

components are not. Your anymore I may

22:01

be for a specific task. Maybe I'm trying

22:03

to build a machine learning model to do

22:05

something. I may be looking at some portion

22:07

of the data sitting in the structure database

22:10

of relational system may be a snowflake or

22:12

something like that. but the same time I

22:14

may be have a new data that me

22:16

have come in and sitting in party fires

22:18

or maybe even in unstructured files that sitting

22:21

in the file. System I'm I've read some

22:23

sort of a custom code in Python. To extract

22:25

some from it, blend all of this together

22:27

to get some real time to get some

22:29

features from Baton. Built into a pipeline so

22:32

like did is everywhere having very. Linear.

22:34

Ways of seeing data lance your has

22:36

to be process to put goes to

22:39

even the that's often the predominant method

22:41

in many emerging applications. What enterprises want

22:43

is flexibility so that you can deal

22:45

with data not have to wait for

22:48

it to be formally loaded into their

22:50

house before he can do things because

22:52

sometimes the speed with which you getting

22:54

insights from data that's constantly arriving is

22:57

really the highest value proposition, right? The

22:59

value of an inside sometimes decays with

23:01

time the longer you have to wait.

23:04

To get the data to have

23:06

flown to process easy the human

23:08

are engineered the we can do

23:10

any analysis with that is be

23:12

is often lost Value to the

23:14

whole ecosystem is evolving but it's

23:16

really clear that Be want more

23:18

flexible compositional structures off being able

23:20

to do structured data and unstructured

23:22

analysis because analysis to the often

23:24

means very traditional. Type of analysis stuff

23:26

that people the doing that business intelligence type

23:28

of stuff. To sort of

23:30

more augmented methods that my views

23:32

machine learning to drive. insights perhaps even

23:34

still in the structure farm and then

23:37

the third parties risk sort of unstructured.

23:39

And you're dealing with richer sets of data. To.

23:42

All of that component one of the big. Challenges.

23:45

Is it's becoming harder and

23:47

harder to write analyses pipelines

23:49

and it's V programmatic be

23:52

driven today. So. There's been a ton

23:54

of work. Where people have talked about

23:56

know called and local methods to allow people

23:58

to do. Analyses. of this sort. And

24:01

this is kind of where the other

24:03

spectrum of my research is in using

24:05

GenAI to allow people to generate these

24:07

analysis pipeline, but to do that in

24:10

a way that requires them to write

24:12

no code and use the generative

24:14

AI machinery to actually tell

24:18

the system what to do. And my startup

24:20

data chat essentially addresses this problem. You point

24:22

it to a data set. We

24:24

work with structured data. You ask the question

24:27

and produces the analysis for you. And as

24:29

part of that analysis, it may write SQL

24:31

queries. It may write machine learning pipelines. It

24:33

may do a combination of that. It

24:35

may do visualization and presents the results to

24:38

you. So I think data in

24:40

its different form, there's the time to live for

24:42

data. That's one consideration for sure. People don't want

24:44

to hang on to data forever unless

24:47

they have a reason to. But also there's

24:49

the richness of data and the richness with

24:51

which you need to get insights from the

24:53

data. And there's just so many more tools.

24:56

But there's also the human aspect of it is

24:59

all of that, if it requires

25:01

increasing the human expense to do

25:04

the insights, is unsustainable too. Just

25:06

as it was unsustainable on the hardware end, to

25:08

say I'm going to double my cost every time

25:11

I double the data volume, you can't say I'm

25:13

going to double my human cost for programming if

25:15

I double my analysis needs. That's the

25:17

other end of the spectrum where some of these

25:19

Gen AI tools and stuff that you're doing in

25:21

data chat is one of many examples is

25:24

the other big challenge for the industry and for

25:26

the field. And in

25:28

that space of user experience, usability

25:31

of these data systems, as we

25:33

get more sophisticated with the types

25:35

of data that we're storing, the

25:38

ways that we're analyzing the data,

25:41

finding the data is always a problem.

25:43

So that's the first step in utility,

25:45

but then understanding what to do with

25:48

it, the semantics of that data, the

25:50

organizational aspects of what does the data

25:52

really mean in the context of my

25:54

business. All of these are barriers to

25:57

a seamless user experience and I'm wondering what are some of

25:59

the things of the opportunities for improving the

26:01

interfaces and the semantic understanding at a

26:04

fundamental level that these data engines can

26:06

contain and some of the ways that

26:08

they can help to give hints to

26:10

the end users without the end user

26:13

having to go and get their PhD

26:15

in data management, just to be able

26:17

to answer a simple question. Yeah.

26:20

I think great questions. I think we

26:22

have three components to it. One is

26:24

today it's the whole tooling ecosystem to

26:26

even discover what you where

26:29

to look for in this vast lake house

26:31

is non-existent. And I know a lot of

26:33

people are working on it. We have a

26:35

research project at CMU that is just starting

26:37

out to explore some of these aspects. Today,

26:39

it is not uncommon to go to a large enterprise

26:42

and see that they have a warehouse

26:44

or a lake house where

26:46

they might have hundreds,

26:48

if not a million

26:50

data sets that are sitting around collected

26:53

over time, even though they might have pruned it.

26:56

And in, you know, a data set might be

26:58

a table and that table might have 10 or

27:00

100 of columns in it.

27:02

So you're really saying I've got

27:04

millions or tens

27:06

or maybe sometimes even more schema

27:09

elements, descriptive elements of what's in the data.

27:11

It's not just the data values,

27:13

but just the description of the data is large. How

27:15

do I look? Sometimes it's super complicated even

27:18

saying, what is the profit that I made?

27:20

That's a complicated question. There's

27:22

a financial version of this that

27:24

is the methods that get used

27:27

for reporting purposes for financial statements

27:29

and stuff like that. But then there

27:31

are other descriptions where even something as

27:33

basic as pricing, it's like, do you look

27:35

at as the data is flowing in? Do

27:38

you, if I'm a retailer, do I look

27:40

at all the items that were checked out

27:42

from my cart, but what happens about returns?

27:44

What happens about projected returns? If I'm trying

27:46

to do analysis on orders that were just

27:48

placed, you know, do I expect that 25%

27:50

of that is going to get returned at

27:52

a certain type of the time of the

27:54

year? Like we know that sometimes there's more

27:56

return, the return rates goes up around the

27:59

holiday shopping time. So it's very complicated

28:01

to even define simple things. You don't

28:03

even know where to look. That's the

28:05

first challenge. Second is the semantic complexity

28:08

of saying, how do I manage? What is

28:10

the notion of something as simple as how

28:12

much did I make last week is hard.

28:15

And that's where many of these tools you

28:17

see there's excitement around dbt and a whole

28:19

bunch of semantic tooling mechanisms. That's the

28:21

second component, the discovery component. There

28:23

really isn't much the semantic component dbt

28:25

and tools like that exists. And then

28:27

you get to that programming layer, all of

28:30

the complexity we talked about. So even before

28:32

you get to that programming level, you're

28:34

exactly right. We don't know where to look

28:36

often. Even when we know where we want

28:38

to look, we need some sort of an

28:40

agreement and be able to communicate across different

28:42

members of the team or different teams in

28:44

an organization as to what's the semantic value

28:47

of the things that we see in the

28:49

database so that we can all be on

28:51

the same page. And then we can start

28:53

to trust the analytics pipeline downline. So there

28:55

isn't a clean separation between these pieces. Today,

28:57

when you see someone constructing a data science

28:59

pipeline, let's say in a notebook environment, all

29:01

of these are blended in, they are written

29:03

in code. They are not queryable. They are

29:05

not transparent. And it's very hard. If I

29:07

gave you a notebook that is 10,000 lines

29:09

long, that is running a core pipeline, and

29:11

if I'm no longer in your organization, it'll

29:14

be very hard for whoever picks that up

29:16

to understand what's going on in that notebook, because

29:18

all of these things are blended in programmatically and it's

29:20

a mess. And so given

29:23

the fact that there is so much complexity,

29:25

we have gotten to a space where we

29:27

have to work across at least

29:29

two or three different tools and systems just

29:31

to be able to answer a simple question.

29:34

What are some of the forward

29:37

looking design considerations,

29:39

system architectures, and

29:41

platform evolutions that we can look to to

29:43

simplify that aspect where maybe, I think it

29:46

was 10, 15 years ago, we had systems

29:48

like Informatica, where it was an all in

29:50

one vertically integrated solution. Now we've gone to

29:52

the modern data stack where we have a

29:55

dozen different tools, each of which wants to

29:57

own different overlapping pieces of the puzzle. now

30:00

we're starting to see the pendulum swing

30:02

back the other direction where we are

30:04

recompiling a vertically integrated solution out of

30:06

the individual components of the data stack

30:08

with things like Mozart data. What are

30:11

some of the ways that we as

30:13

engineers and system integrators should be thinking

30:15

about how to build cohesive platforms, cohesive

30:17

experiences so that our end users aren't

30:20

struggling and spending their entire day just trying to figure

30:22

out what they're supposed to be doing and how? Yeah,

30:25

I great question. I think practically

30:27

today from a systems

30:30

architecture, data engineering

30:33

perspective, you want

30:35

to keep the tool ecosystem

30:37

as lean as possible. There's

30:39

this huge tendency to say

30:41

you hit the hammer

30:43

right on the nail, which is

30:45

a lot of these tools have overlapping components and

30:48

it's so common to see, you

30:50

might have a team of 12 engineer

30:53

data engineers. Each one

30:55

will put in their favorite tool. And before

30:57

you know it, you've got a dozen tools

30:59

in the ecosystem and maybe all you needed

31:01

was three or two. And even if you

31:04

boil it down to a few, it's a

31:06

question of how well is that process and

31:09

methodology for using those tools

31:12

set at a systematic process

31:14

level to say what

31:16

will be used when and how do you, how

31:18

do you keep track all of that, especially as

31:20

all of these tools change over time. So I

31:23

think that's just pretty straightforward one-on-one

31:25

tool engineer running a good dev shop,

31:27

good engineering shop, keep it lean, keep

31:29

it clean and only bring in when

31:31

you need to and document everything, have

31:33

processes that go outside that tool integration

31:35

set. The second aspect of it, which

31:37

is a little bit futuristic and goes

31:39

a little bit into where data chat's

31:41

going. It's, it's, we look at it

31:43

from the other end of the spectrum. We

31:45

say all of this engineering support is a

31:48

means to an end. The end is to

31:50

enable the end user to ask a question

31:52

and get an answer in a way that

31:54

is transparent and reproducible. So more than saying,

31:57

I want to make it easy for some.

32:00

To compose a programmatic pipeline? how about

32:02

we complement that or flip it and

32:04

say be wanted Easy for anyone to

32:06

ask any questions and get an answer

32:08

and then get the pipeline that they

32:10

can verify in a way that sci

32:12

fi. this matches the semantics I need

32:14

to. that kind of be a going

32:16

to say whether it is data science

32:18

which includes sequel and machine learning and

32:20

data bleeding and feature injuring and all

32:22

of that stuff or visualization will give

32:25

you one you I loved One interface

32:27

which is a chat box. type your

32:29

question in bridgend. Read all of that but along

32:31

the that will give you a recipe. This. Precise.

32:33

Steps: The document what happened at each step.

32:35

This is a semantic definition that we came

32:38

up with for the definition of profit. You

32:40

can verified you can change it, but that's

32:42

the other end of the spectrum is that

32:44

the tools will evolve and if you make.

32:47

The. Management of the tools off the t

32:49

task of the deed. Engineering T than you

32:51

not serving the end user. You. Could

32:54

also try to come out from the other and which is

32:56

kinda what we're doing in the to chat. Is to say.

32:59

Blew up this portion of it. Make it easy

33:01

for anyone to ask that but build trust and

33:03

verification into the system to. That yes, the

33:05

semantic definition. Liam, I. Change to the

33:07

Beach Dignity or maybe just Python code

33:09

right now to something else But the

33:11

interface that you want to keep constant

33:13

is that enabling that end user to

33:15

ask these questions and build that trust

33:17

and they station the idea because it

33:19

was will teach and they will evolve.

33:21

and given that you are researching both

33:24

sides of this equation of user experience,

33:26

how to improve the utility of these

33:28

data systems as well as the scale,

33:30

ability, aspects and how do we make

33:32

it so the weekend push more data

33:34

through these systems without having to double

33:36

the cost. Every two years, What

33:38

are the elements of tension that

33:40

exist in answering those two questions?

33:43

And what are the opportunities for

33:45

incorporating those perspectives in the evolution

33:47

of the fundamental platform components that

33:49

we build. Yeah. Great questions

33:51

and is a big unification across both ends

33:54

of the spectrum. The unification his time on

33:56

the human side which basically the same as

33:58

costs so I want a fast system to

34:00

deal with the skill A pretty problem on

34:03

the architecture and of the spectrum that be

34:05

talked about. What I want exactly the same

34:07

speed because if I've got a human in

34:09

the loop compute which is what a lot

34:12

of analytics is often today then I want

34:14

you know if you fire up. A question

34:16

let's see today. the chat. It's gonna take thirty

34:18

seconds to come back, but if I could have

34:21

a fast for. Hardware software system that

34:23

could bring that as her back to

34:25

in half the time. Guess. What

34:27

the what? do you win on you when

34:29

on Human time? And human time is really

34:31

expensive, So ultimately that cost is the driving

34:34

factor across. Human times the same as

34:36

cost. And so that's a unification. Stuff

34:38

you need to stash stuff to do more. but

34:40

humans are going to be impatient. A lot of

34:42

this analysis as human in the loop right? Even

34:44

in charge you believe in you punch. In the

34:46

question and press enter you know that think took

34:49

a minute to come back. Was this. Five

34:51

seconds to come back. Your user experience and

34:53

your ability black you to use the tool

34:55

to do real work with completely change. Settlement

34:57

over this one thing. Speed Mappers. At

34:59

both ends of the spectrum. Faster. Is

35:02

better for very different reasons, but that's a

35:04

unifying Kp. I across both of them faster

35:06

is better. And. Circle As

35:08

you are conducting your research, you are

35:11

we doing it in the context of

35:13

a lab environment with your research group

35:15

and year. Hoping that the outcome of

35:18

this research will have some meaningful impact

35:20

on the industry and number of years

35:22

down the road, I'm wondering what are

35:25

some of the strategy that you used

35:27

To get some sort of real world

35:29

context around these problems and solutions that

35:32

you're building to feed that back into

35:34

the research? Said that You're doing it

35:36

in a. Way that is

35:38

directional a beneficial to the outcome that

35:41

you're hoping to achieve. Yeah. that's

35:43

a great question as a tough question

35:45

my recent philosophy is always been work

35:47

on interesting things that are at least

35:50

a few years out in old and

35:52

will if anyone can see more than

35:54

five years out but pick something that

35:56

is a to or five years challenge

35:59

you do not the

38:00

base. All my startups have

38:02

been in conjunction with the university so

38:05

you know I feel like if I'm

38:07

at a university and I do something

38:09

interesting it's because of the university so

38:11

let's play ball with them. Different people

38:13

have different philosophies but there's it's never

38:16

an easy answer there's always discussion there's

38:18

always negotiation there's always contractual stuff and

38:20

lawyers get involved so there's some non-fun

38:22

parts of it. The second part of

38:25

it is that once even in

38:27

academia if you are working on an interesting

38:29

problem in industry is often pretty interested in

38:31

getting engaged with you at an early point

38:33

in time and once you have even a

38:36

crude prototype that you could deploy even in

38:38

a limited setting you always learn things

38:40

that you would have never expected once

38:42

something becomes real and actual users start to

38:45

play with it because people will do crazy

38:47

stuff that even in the wildest imagination you

38:49

can't quite imagine and then all

38:51

of a sudden becomes real and what's super

38:53

interesting is nearly always it'll generate new research

38:55

problems for you to think about that you

38:57

wouldn't have come up with if you had

38:59

just tried to dream about it and think

39:02

about it in your office but you have

39:04

to start by dreaming first right if you

39:06

just go and tell people what do you

39:08

want they may not quite have that so

39:10

it's that combination you have to have a

39:12

dose of practical reality plus

39:14

a dose of aspirational creative

39:16

thinking and you have to have

39:18

both of those parts in any successful research project.

39:21

And as you have been conducting

39:24

your research and working in these different

39:26

startup enterprises what are some of the

39:28

most interesting or innovative or unexpected ways

39:30

that you've seen your research applied? I

39:34

think the most unexpected ways is when

39:36

you start to deal with real workloads

39:38

and real constraints you start to

39:41

realize that things that seem simple

39:43

or trivial actually turned out to

39:46

be really complex so

39:48

just the practical components of making things in

39:50

real life with cost considerations that

39:52

are real right someone's writing a check if

39:55

you're trying to train LLM on

39:57

the specific task at hand for example

39:59

stuff like that that we do a

40:01

data chat, all of a sudden the

40:03

cost component is no longer abstract, you're

40:05

actually writing a check for those hardware

40:07

resources. So you just are

40:09

at a different level where you start

40:12

thinking very, very carefully about things like

40:14

estimating things that are

40:16

going to be actually run and

40:19

developing methods to do that estimation, learning

40:21

how to do low-cost A-B

40:24

testing as you go down

40:26

searching for different architectural configurations

40:28

for the system architecture.

40:31

So very macro level stuff that

40:34

are abstract and potentially

40:36

not interesting in the academic setting,

40:38

but even not realizable in the academic

40:40

setting because you need often large teams of

40:42

engineers to be able to build a big system

40:44

like data chat is. So those

40:47

are super interesting things that I think are

40:49

very hard, if not impossible to study in

40:51

academia, but are front and center

40:53

and very quintessentially interesting problems

40:55

that show up once things start to

40:57

become real in enterprises and in startups.

41:00

And in your own research that you're

41:03

doing, what are some of the most

41:05

interesting or unexpected or challenging lessons that

41:07

you've learned? Yeah, I

41:09

think the most challenging lesson

41:11

is that don't give up the first time

41:13

you get a negative result, which will happen.

41:16

If you pick a challenging problem, you'll sometimes

41:18

hit your head against the wall maybe

41:21

for years. And if

41:23

the question is still valid, if

41:27

it is tantalizingly important long

41:29

term, you sometimes just have to

41:31

stay at it. It takes patience. And

41:33

sometimes it may take multiple students because students

41:36

come even in the PhD program, they

41:38

may be with you for five or six

41:40

years. And sometimes an interesting problem may take

41:43

longer time than that. And so staying

41:46

with the problem longer than a few

41:49

durations, I know attention spans are getting

41:51

shorter and shorter over time. But sometimes

41:53

the payoff happens when you work on

41:55

something for an extended amount of time.

41:58

And have there been any particular interesting

42:00

or informative dead ends that you've encountered along

42:02

your journey? Yeah, the part that we started

42:05

out with where we are looking at encoding

42:07

techniques and saying let's revisit that. We actually

42:09

started working on it about ten years ago,

42:12

got some good early results, then kept hitting

42:14

a wall and now

42:16

I think we are on to a new line

42:18

of thinking which is along this line of... And

42:21

as you continue to

42:23

work on these hard problems, you

42:25

start, you try to forecast what

42:27

are the solutions that we're going

42:29

to need three to five years

42:32

out as you were saying. I'm

42:34

curious what your heuristic is for

42:36

when a particular research project needs

42:38

to be either killed or put

42:40

into production. Yeah, I think

42:42

putting into production is easy, right? If you have something

42:44

that is interesting and exciting, you pitch it to a

42:47

couple of VCs. You know, first before you pitch, you

42:49

see if you can get your students excited and collaborators

42:51

excited to go spin it out into something like

42:53

a startup. Once you do that, then you go

42:55

and see if you can pitch it to VCs.

42:58

Many of them are extremely sharp. They see

43:00

a lot. They'll be... And if you can't

43:02

get a VC's attention, then there's something probably

43:05

wrong. You missed it, right? Because you

43:07

should be able to convince someone to put money

43:09

into a good idea. And once you have all

43:11

of that, then you can get

43:13

the ball rolling. And academic research

43:15

also requires funding, right? You're trying to convince

43:18

funding agencies to fund you and the VC

43:20

game is different. It has to be more mature by

43:22

the time you get to that. So it's a spectrum. But

43:24

luckily there are well-defined mechanisms to do that.

43:27

But if you can't convince someone, your student

43:29

to work on an interesting, far outreaching, far

43:31

out problem that may seem crazy. Or if

43:33

you can convince a VC to fund you,

43:35

then something's wrong. You have to re-examine it

43:37

and say, how do I refine what I'm

43:39

doing? Am I on the wrong path? Should

43:41

I sunset this or pause this till I

43:44

can get someone else also

43:46

to be more interested in this problem?

43:48

So that's the way I think about

43:50

it. I know there are different ways. You know,

43:52

if you're a pure theory person or a pure math

43:54

person, you could stick to a problem

43:56

by yourself. But for the type of things that

43:58

I do in systems, You need collaborators,

44:01

you need students, you need larger

44:03

teams. So you have to convince someone that's

44:05

a good idea and that's for me, a good

44:07

measure. And as you look

44:10

to the future and you see what

44:12

are some of the problems that you

44:14

are anticipating we're going to have to

44:16

address as we continue to build and

44:18

scale these complex systems and complex data

44:21

challenges, what are some of the areas

44:23

of focus that you have planned for

44:25

the near to medium term or any

44:27

particular projects or problem areas that you

44:29

or someone else should dig into? Yeah.

44:32

And it's looking at the two ends of the spectrum to

44:34

broaden out on the architecture side. There's

44:36

just so much diversity of different ways

44:38

to architect storage and computing devices. So

44:40

I'm working with collaborators from other universities

44:43

and at CMU who are hardware folks

44:45

to understand that ecosystem and see what's

44:47

possible. What's the design space? It's vast.

44:49

So there's a ton of work to

44:51

do in that space and lots of

44:53

interesting sub spaces there. On the other

44:55

end of the spectrum where I think

44:58

we are just getting started with all

45:00

of the uses of Gen AI for

45:03

improving human productivity in getting

45:05

insights from systems and things of

45:07

that sort. We're still starting

45:09

to better understand how to use

45:11

these LLMs in

45:14

ways that protect the

45:16

privacy of the data and the

45:18

communication between the platform

45:20

that's using the Gen

45:23

AI technology and

45:25

the application. There's also

45:27

a huge component of what's the

45:29

cost component, are small models of

45:32

future in certain cases or are they still

45:34

quite far out from the large models and

45:36

large models are getting larger and larger.

45:38

There are all kinds of different architectures. So

45:41

lots of interesting stuff in just that space

45:43

of how to economically use, when to use

45:45

what components and just like, you know,

45:47

lots of interesting subspace, including that data discovery piece

45:49

that I mentioned, we don't know where to look.

45:52

And even when you know where to look, you

45:54

don't know how to use many of these new

45:57

advanced methods, especially in the Gen AI space, because

45:59

that's just moving. So I think

46:01

just anywhere you look, there are lots and lots of pockets

46:03

of interesting components in the two ends of the spectrum.

46:05

I would say the middle is kind of boring. Go

46:08

to the edges. It's wide open. Are

46:11

there any other aspects of the

46:13

research that you're focused on, the

46:15

problem spaces that are still open

46:18

to be explored, or some of the other work that

46:20

you're involved in that we didn't discuss yet that you'd

46:23

like to cover before we close out the show? I

46:26

think there's a huge amount of interest in

46:28

general in terms of saying what's the

46:31

future of LLMs in terms of how

46:33

open should they be? And

46:35

what does openness mean? Is open

46:37

weights open enough? Probably not. I

46:40

think in academia, one of the challenges

46:42

when you're working on some of these large

46:44

LLM models is very few institutions

46:46

have the resources it takes to

46:48

build one of these LLMs from

46:50

scratch in a realistic fashion. Yeah,

46:53

so I think there are lots of research

46:55

problems. And if you especially look at the

46:57

space of Gen AI, there are certain things

46:59

that you can do better in industry today.

47:02

So if you are at OpenAI

47:04

or at Google and have been

47:07

building these large language models now

47:09

for five years, which is an

47:11

eternity, you know all the deep

47:13

system engineering tricks that use

47:16

a lot of insights. They will never get written in

47:18

papers. It's very hard in academia for someone to go

47:20

and say, I'm going to take that project first. You

47:23

don't have that five year engineering detail

47:25

tricks that you can do or trade

47:29

secrets to go and do things in an

47:31

efficient fashion. Second, it

47:33

takes a lot of resources,

47:35

millions, if not tens or hundreds of millions

47:37

of dollars to build one of that. So

47:40

there's certain components that are

47:42

just very uniquely well positioned right now

47:44

in that exciting space in industry. And

47:46

as academics, it's like, okay, do you go to

47:48

industry and spend some time over there if you're

47:50

deeply interested in stuff like that? Luckily, there are

47:53

lots of interesting far

47:55

outreaching problems that require

47:57

you to start with something

47:59

that might be a large language model and do

48:01

stuff with it. And there's a ton of

48:03

work going on in there, but you know,

48:05

certainly this is kind of unique where in

48:07

the past, it was often the case where

48:09

the deepest core component of some

48:12

new technology was often done in academia.

48:15

You could arguably say building a large

48:17

language model is one of those core

48:19

constructs, and that is better

48:21

done right now, arguably in industry because of

48:24

the resources and, and all of the large

48:26

engineering teams that you often need to go

48:28

do, do that stuff is available only over

48:30

there right now. So there's a little bit

48:33

of a difference in terms of where things

48:35

go. So you have to, if you're working

48:37

in that space, you have to say, which

48:40

problems can I practically achieve and do in

48:42

academia? And that's sort of a

48:44

new thing for many parts of computer science.

48:47

All right. Well, for anybody who wants to get

48:49

in touch with you and follow along with the

48:51

work that you're doing, I'll have you add your

48:53

preferred contact information to the show notes. And as

48:55

the final question, I'd like to get your perspective

48:58

on what you see as being the biggest gap

49:00

in the tooling or technology that's available for data

49:02

management today. I think data discovery

49:04

is probably the biggest one that comes to

49:06

mind. You know, we do not

49:08

have ways to find out where do I

49:10

even start to look. Absolutely. All

49:12

right. Well, thank you very much for taking

49:14

the time today to join me and share

49:17

the work that you've been doing in your

49:19

research and the ways that you have been

49:21

applying that in the commercial sector. It's definitely

49:24

a very interesting body of topics that you're

49:26

focused on. Definitely glad that you and your

49:28

collaborators are working to improve our capabilities in

49:30

this space. So I appreciate all the time

49:33

and energy that you're putting into that. And

49:35

I hope you enjoy the rest of your day.

49:37

Thank you. Take care. at

50:00

dataengineeringpodcast.com to subscribe to the show, find up

50:02

for the mailing list and read the show

50:04

notes. And if you've learned something or tried

50:06

out a project from the show, then tell us about it. e-mail

50:09

host at dataengineeringpodcast.com with your stories.

50:12

And to help other people

50:14

find the show, please leave a review on Apple Podcasts and tell

Rate

Get this podcast via API

From The Podcast

Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More