Hitting the AI wall by Spark | Podchaser

Episode from the podcastSpark

Hitting the AI wall

Released Friday, 26th April 2024

Good episode? Give it some love!

Hitting the AI wall

Hitting the AI wall

Friday, 26th April 2024

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Hi, I'm Asha Tomlinson. And I'm David

0:02

Common. Hi, I'm Asha Tomlinson. And I'm

0:04

David Common. And we're hosts

0:07

of CBC Marketplace. We're award-winning

0:09

investigative journalists that want to

0:11

help you avoid clever scams,

0:13

unsafe products and sketchy services.

0:16

Our TV show has been Canada's

0:18

top investigative consumer watchdog for more

0:20

than 50 years, but

0:22

this is our first podcast.

0:24

CBC Marketplace Podcast is available now

0:27

on the CBC Listen app or wherever

0:29

you get your podcasts. This

0:32

is a CBC Podcast. Hi,

0:38

I'm Nora Young. This is Spark. Over

0:43

the years, we've talked a lot about the

0:45

data-driven turn in AI and how

0:47

a deep learning approach has given us everything

0:49

from image recognition to chat GPT. But

0:52

what about the ongoing ethical questions about the

0:54

kinds of data machines are learning on? And

0:57

beyond that, what if we're starting

0:59

to run out of data? This

1:01

time, tracking the data limits of AI. Ever

1:17

since chat GPT took off, Google, Meta

1:19

and OpenAI have been in a race

1:21

to build ever more powerful generative AI

1:23

systems. Systems that rely on enormous

1:25

amounts of data to train them. Especially

1:28

the kind of human-created, high-quality

1:30

data that large language models

1:32

like chat GPT need to

1:34

produce impressive results. But now,

1:38

there's concern that these companies are running out

1:40

of data to train their new, large language

1:42

models. That high-quality,

1:44

human-produced information is finite. And

1:48

that the internet isn't the endless source of data they

1:50

once thought it was. I

1:52

think that there's a real reason to think that we've

1:54

maybe reached a period of diminishing returns. So a year

1:56

ago, it looked like we were going to be able

1:58

to or maybe on

2:00

an exponential, things were rising really fast. This

2:03

is Gary Marcus. He's a cognitive scientist and

2:06

leading voice in artificial intelligence. He's

2:08

the author of Rebooting AI, Building Artificial

2:10

Intelligence We Can Trust, and the

2:12

forthcoming book Taming Silicon Valley, How We

2:15

Can Ensure That AI Works for Us.

2:18

Well, I think of large language models as being

2:20

like bulls in a china shop. They're wild, reckless

2:22

beasts that do amazing things, but we don't really

2:24

know how to control them. Back

2:27

in 2022, Gary warned that we

2:29

were nearing this deep learning data wall.

2:33

And he's also written a lot about the limits

2:35

of large language models. They're

2:38

not very good at reasoning. They're not very

2:40

good at planning. They hallucinate or confabulate might

2:42

be a better word frequently. And

2:44

there's also an issue that they're very greedy about

2:47

data. And we're running up,

2:49

I think, against the fact that people have already

2:51

used essentially every bit of data they can get

2:53

their hands on. A

2:55

recent piece in The New York Times

2:57

reported that a team at OpenAI, which

3:00

included President Greg Brockman, had actually collected

3:02

and transcribed over a million hours

3:04

of YouTube videos to train their

3:06

chat GPT-4. Last

3:09

year, Metta also reportedly discussed acquiring

3:11

Simon & Schuster to gain access

3:13

to the publishing house's long-form works.

3:16

I mean, there's almost a desperation about trying to

3:18

get more data. And there's not that much

3:20

more good data. You can always make up

3:22

bad data. You can have chat GPT, which

3:24

hallucinates or confabulates makeup data. But some of

3:27

that data is not going to be any

3:29

good. So there's actually a concern about kind

3:31

of polluting the internet with bad information. If

3:34

you plotted things on a graph on

3:36

your favorite benchmark, how well are we doing? None of

3:38

them are perfect. But if you took whatever your favorite

3:41

one is and looked at like the difference between 2020

3:43

and 2022, you'd see a huge difference. And

3:47

a huge difference between 2022 and 2023, and you'd say, hey, we're in this period of exponential

3:52

returns. But that

3:54

growth hasn't really sustained. Gary says that GPT-4,

3:57

which came out in March, 2020, is a

3:59

big difference. 23 was a

4:01

huge and impressive leap. Since

4:03

then, there have been several competing

4:06

models with huge financial investment, time

4:08

investment, and massive amounts of data,

4:10

but they're not really any better.

4:13

While generative AI may have reached

4:15

a point of diminishing returns, Gary

4:17

says that doesn't mean AI itself is

4:19

going to be indefinitely stuck, but

4:22

it does mean we'll need to come up with

4:24

new approaches to how we train these systems. My

4:28

view is this has been a productive

4:30

path, but also a blind alley in

4:32

a certain way. The whole

4:35

notion of these systems is that you

4:37

statistically predict what people would say in

4:39

certain circumstances based on experience, but these

4:41

systems have always been poor at outliers,

4:44

cases that are different from what they've

4:46

been trained on before. We saw this

4:48

whole movie before with driverless cars, where

4:51

I and a couple other people pointed out in 2016

4:54

that you have outliers with driverless cars, unfamiliar

4:56

circumstances, and that the kinds of techniques we

4:58

know how to build in AI now are just not that

5:00

good at those. We

5:02

said, driverless cars might not be as imminent as

5:04

you thought, and lots of people got excited. Investors

5:07

put in $100 billion, but in the end

5:09

of the day, there are still lots of

5:11

unpredictable circumstances, weird placements of traffic cones or

5:13

people with hand-lettered signs that the driverless cars

5:15

still don't do very well with. I think

5:17

we're seeing the same thing with large language

5:19

models. If you ask a question a lot

5:21

of people have asked before, you're probably all

5:23

set. If it's subtly different from a question

5:26

that's been asked before, they might miss that

5:28

subtlety. It's not clear that

5:30

the generative AI systems are ever

5:32

going to be able to deal with

5:34

the unfamiliar in an effective and systematic

5:36

way. That doesn't mean no approach to

5:39

AI will ever get there. I

5:41

think we're in this blind alley

5:43

where it's all statistical approximation, and we

5:45

need systems that are in fact based

5:47

on facts and reasoning. Neural networks

5:49

that are popular right now are basically good

5:52

at something that's a little bit like intuition,

5:54

but they're bad at the deliberate stuff. They

5:56

really can't reason reliably. They can't plan

5:58

reliably. some other

6:01

approach to do that. So

6:03

just to explain

6:05

what synthetic data is. Sure, you make stuff up. So

6:07

a great example of this is, I mean, really, truly, I didn't

6:09

mean to be, to ridicule the idea. I mean, it's actually a good idea

6:12

as far as it can take you, but it doesn't

6:14

take you far enough sometimes. So a classic example, I

6:16

would say, is in driverless cars around 2016 or so,

6:18

people started realizing they didn't have enough data

6:23

from actual cars and they started making up

6:26

data in different ways. So, I think it

6:29

started making up data in video games like Grand Theft Auto and

6:31

sometimes their own version of

6:38

that. So you would have a simulated car in some

6:40

weird circumstance and try to get data

6:42

from that in order to feed the system. There's a whole company

6:45

that's, I think, Canadian-based that's

6:47

trying to do that. And there are probably multiple companies that are

6:49

trying to do this in various ways. And I would say it's

6:51

helped, but I would say it

6:54

hasn't helped enough. And it's partly because you

6:56

don't know which data to store and

6:58

you don't know which data to simulate. In the

7:00

real world, there are many, many instances where nobody

7:02

anticipates the data that you might need. So if

7:04

you can anticipate exactly what people are going to

7:07

need, you could do that. It would be a

7:09

really stupid use of a large language model to

7:11

make it do arithmetic because they're just not very

7:13

good at it. But you could say, well, they're

7:15

not very good at it, but if I give

7:18

them more data, they'll be better. And so you

7:20

could synthesize all the math data that you want

7:22

in principle and you could improve it to some

7:24

extent. But, for example, if you're

7:26

dealing with irrational numbers, there's just never

7:28

going to be enough synthetic data. You're

7:30

not really going to solve that problem

7:32

that way. Synthetic data has been compared

7:35

to the computer science version of inbreeding. What

7:37

do you make of that analogy? I

7:39

think there's something even more like inbreeding,

7:41

which is what Ernie Davis and I

7:43

once called the echo chamber effect, which

7:45

is having the models train on their

7:47

own output or having Google train on

7:49

open AI's output. So it is a

7:51

kind of inbreeding that's going on where

7:53

these models are making synthetic data and

7:55

then training on that. And so errors

7:57

get in there like a crazy one.

8:00

was somebody asked one of these systems, I

8:02

might get the details wrong, but I think

8:04

asked OpenAI how many

8:06

African countries begin with the letter K

8:09

and it said none. And then,

8:11

sorry about that Kenya, and

8:13

then Google trained on

8:15

OpenAI's output. So that's a kind of inbreeding where

8:17

the one system is training on the other and

8:20

the whole quality of the information ecosphere is going

8:22

down because then other people ask and that error

8:25

percolates. Again, these are kind of like contrived test

8:27

examples, we call them red teaming. But they're so

8:29

easy to generate that we're sure that they're happening

8:32

in the real world which parenthetically

8:34

points to something else, which is transparency. We don't

8:36

actually know how these systems get used in the

8:38

real world because the companies don't want to share

8:40

it. And governments should actually

8:42

be demanding logs. Like for example, do

8:44

people use these systems to make decisions

8:46

about jobs, loans, prison sentences? It was

8:48

just a study that showed in carefully

8:50

controlled circumstances if you speak to them

8:52

in African American English, you get a

8:55

different set of answers than if you

8:57

speak to them in standard English. So

8:59

we know this from the lab, we would like to

9:02

know does this happen in the real world. We don't

9:04

have that transparency right now. So the examples I give

9:06

you are a little contrived, but they show in principle

9:08

this kind of inbreeding thing that we call the echo

9:10

chamber effect and so forth. So we know

9:13

from kind of doing science as best we

9:15

can on the limited data that's available that

9:17

there are all these serious problems. And that

9:19

we don't know how far they go in

9:21

the actual world. Just to

9:23

throw out one case where we do know

9:25

in the actual world, there was a piece

9:27

in the New York Times today showing that

9:29

in the case of child born, there's so

9:31

much of it being created by generative AI

9:34

that one of the nonprofits, I guess the

9:36

tracks is overwhelmed now because suddenly there's just

9:38

so much out there. So sometimes we have

9:40

some way of measuring in the real world what's

9:42

going on and sometimes we don't. Yeah,

9:44

but this is what I've wondered is even if we're not using

9:47

sort of specifically synthetic

9:50

data to train, if

9:53

we have these systems that are generating content

9:55

and that's filling the internet, doesn't that mean

9:57

a lot of the data that gets used

9:59

to... train next generations of models isn't

10:01

going to be human created anyway. Well,

10:04

I mean, what's happening is the companies are stealing from

10:06

each other. And so the

10:08

stuff that they're stealing is no longer

10:11

pure. I mean, we always

10:13

have problems with people generating misinformation for

10:15

political reasons and so forth. But

10:17

the situation has gotten worse because there is

10:19

this mad craze for more data. So one

10:21

of the ways in which people get data

10:24

now is they use each other's models. And

10:26

the terms of service tell them not to

10:28

do that, but they've all violated each other's

10:30

terms of service. So YouTube doesn't say that

10:32

OpenAI can use their data, but apparently GPT-4,

10:35

maybe, so we're trained on.

10:37

So you have this kind of

10:39

mad mess of recycling each other's

10:41

data rather than what you really

10:43

want is like authentic human

10:45

created data from like the New

10:48

York Times, ideally licensed, where

10:51

some human writer has written an article, some

10:53

fact-checking team has verified it, or you want,

10:55

you know, the Britannica, whether it was hard

10:58

work or Wikipedia. They are taking Wikipedia, but

11:00

they're taking all this other garbage too. And

11:03

I mean, there is this old saying, computer science, like

11:05

somebody should remember this, garbage

11:07

in, garbage out, right? And

11:10

the proportion of garbage is going up.

11:32

You are listening to Spark. Everything

11:35

is a sort of a fun house. Nothing

11:37

is as it ordinarily is. And

11:41

all possibilities are open

11:43

to exploration. This is

11:45

Spark. From CBC. I'm

11:57

Nora Young, and today on Spark we're talking about the

11:59

limitations of our current approach to data intensive

12:01

AI and the ways AI giants

12:04

are trying to get around the data wall.

12:06

Right now my guest is Gary

12:08

Marcus, a cognitive scientist and founder

12:10

of Robust AI and Geometric AI.

12:12

He says there's both an underlying

12:14

technical problem and business problem when

12:16

it comes to all the competition

12:18

and hype around AI right now. So

12:22

the technical problem is the kind of AI that

12:24

we know how to build now, which I think

12:26

will look laughable 30 years from now. Like old

12:29

flip phones look a little bit laughable to us now.

12:32

It's very greedy in terms of how much data

12:34

it uses. And I pointed this out in

12:36

2018, I think people ignored me, but that's now

12:38

coming home to roost. It is

12:41

changing the moral fiber of these companies

12:43

and it's maybe leading to the diminishing

12:45

returns and so may undermine the whole

12:47

project. So on the technical side, these

12:50

systems just aren't as efficient with data

12:52

as human children. I have a 9 and an

12:54

11 year old show them something once and they

12:56

understand that they can put it to use. You

12:59

show them the rules of a new game and they

13:01

get it. These systems need a lot of data for

13:03

most of what they do. And I

13:06

don't think that's anywhere near the limit of what we

13:08

could do with AI. It's just the limit of what

13:10

we know how to do with AI today. Just like

13:14

we didn't know how to build efficient gasoline

13:16

or electric, gasoline engines or electric motors once

13:18

upon a time and we learned to make

13:21

things more efficiently, sometimes by changing the entire

13:23

structure. In this case, I think the entire

13:25

algorithm is just not the right way to

13:27

do things efficiently. It's just built as a

13:30

way of mimicking things, not as a way

13:32

of deeply comprehending things. And the reason my

13:34

kids are so much more efficient is

13:36

they build models of the world and how

13:38

it works, causal models of what

13:41

supports their weight or why this thing

13:44

works this way in this game. And

13:46

these systems just aren't really doing that.

13:48

So the technical limitation that then drives

13:51

a business thing, so the business thing

13:53

is complicated. It starts with the fact

13:55

that people think there's a lot of money to be made,

13:57

which may not actually be true. We might want to talk

13:59

about it. about that. But there is a widespread

14:01

belief that many people are acting on that

14:04

there's a ton of money being made and

14:06

so people are rushing. They want to be

14:08

first or more prominent. They want to be

14:10

Coca-Cola rather than Pepsi. And so that's driving

14:12

things. And then the fact

14:14

that there's no known method

14:16

for doing better besides getting more data

14:19

has led to this mad dash for

14:21

data which has led to a lot

14:24

of copyright infringement to companies doing a

14:26

lot of really shady things. And so

14:28

a bunch of these companies actually started

14:30

out wanting to do AI ethically and responsibly. And

14:32

now they're kind of like screwing artists and writers

14:35

left, right and center. They've kind of lost their

14:37

moral compass and a lot of the loss of

14:39

that moral compass has really been driven around the

14:41

mad dash for data. Like they've kind of forgot

14:43

where they came from and what they're supposed to

14:46

do. Like I have lost my faith in a

14:48

number of companies over the last year and a

14:50

half and a lot of it is the things

14:52

that they have done to try to get ahead

14:54

in this race. So what

14:57

would it take for generative AI to

14:59

make real progress from where we are

15:01

today if there's a diminishing return? My

15:03

view as a generative AI is not to paraphrase

15:05

Star Wars, the droids we're looking for. That

15:08

generative AI is almost like a mirage. I mean

15:11

you can use it for some things but a

15:13

lot of things that people wanted to use it

15:15

for are not reliable. And I

15:18

think AI is much harder than a lot of people

15:20

think. Like I don't think it's an impossible problem. You

15:22

know our brains are essentially computers. I know a lot

15:24

of people get mad but I think that's correct. But

15:27

our brains do a lot of

15:29

amazing things. They also make mistakes. They

15:31

could be improved upon. But our brains

15:33

are capable of approaching new problems adaptively

15:35

and flexibly. That's what I think the

15:37

center of intelligence is. This particular algorithm

15:39

just isn't. It's popular but I think

15:41

it's on the wrong track. I think

15:43

when we look 20 years

15:45

from now, look back at 2024, we're

15:47

going to say, well, in that era

15:49

people figured out one thing which is how

15:51

amazing AI could be, how it could spectacularly

15:54

transform our lives but they didn't really know

15:56

how to do it. In fact, they spent

15:58

too much time on that one thing. kind

16:00

of stifled research into anything else. They

16:02

put in billions and billions

16:04

of dollars and this other thing that

16:06

got developed in 2030 or whatever

16:08

it is, I wish they could have

16:11

developed it sooner because if we had this technology in

16:13

2025 instead of waiting until 2035, a lot

16:15

of lives could have been saved

16:19

because it was so good at solving medicine

16:21

and so forth. But people were

16:23

obsessed with the wrong tool. They didn't recognize it

16:25

was the wrong tool. I've

16:28

argued for something more like a hybrid approach. Do

16:30

you think that that's the path forward where we're

16:32

using generative AI for the things that generative AI

16:34

is good at and we're using things that have

16:37

more of a semantic understanding of the world around

16:39

them together in the same system or that we

16:41

triage problems and separate this

16:44

is a generative AI problem and this is not?

16:46

I think we need to do a lot of

16:48

that. I wrote in 2018 about deep learning, which

16:50

is generative AI is a form

16:52

of. I said it's one tool among many. We

16:54

shouldn't throw it away, but we

16:57

have to understand a large complement of tools.

17:00

It's like if somebody was building a house and they

17:02

discovered power screwdrivers and they'd be like, these are amazing,

17:04

but that doesn't mean you want to forget

17:07

that you have hammers and chisels and you might need

17:09

to build a custom tool for this one thing that

17:11

you do a lot. I mean, that's

17:13

kind of what's happening right now. It's like

17:15

the best power screwdriver ever invented. It really

17:17

is amazing. I mean, I'm often criticizing, but

17:19

it's amazing. There's a question about it. It's

17:21

amazing. The question is, is it the right

17:23

tool for the job and which jobs is the right tool for? Ultimately,

17:27

if you want a general intelligence that

17:29

can be like the Star Trek computer, it's

17:31

reliable. You can trust it with whatever kind

17:33

of problem you want to pose, you're going

17:35

to need something that has a broader array

17:38

of tools. I love the word semantic. It's

17:40

not common in these kinds of conversations, but

17:42

it's right. The semantics, the comprehension, the

17:45

meaning in generative AI is

17:47

very limited. Simple

17:49

AI, although it's limited in other ways,

17:51

symbolic AI is better representing semantics, the

17:54

meanings of things, reasoning about those relationships.

17:56

We're certainly going to need elements of

17:58

both. I don't think ... that's enough. I

18:00

wrote an article called The Next Decade

18:02

in AI which came out just before

18:04

the pandemic and the argument I

18:07

made there was that we need this thing,

18:09

hybrids, called neurosymbolic AI but that that's itself

18:11

only part of the solution. So we also

18:13

need a lot of knowledge. We need better

18:15

reasoning techniques. We need our systems to build

18:17

models of the world in the way that

18:19

you do when you go to a movie

18:21

and you learn about each character and their

18:23

motivations and what they're setting is you build

18:25

an internal model of what's going on there.

18:28

Current systems don't really do that in a

18:30

careful and robust way. So you can't kind

18:32

of ask them what's going on. They can't work

18:35

on that. So I said we need to

18:37

tackle four different problems. One of them is this

18:39

hybrid that you're talking about and I've devoted

18:41

a lot of my career to. And even

18:43

on the hybrid I would say we kind

18:46

of sort of know what that might look like

18:48

but not exactly. There's a lot of best practices

18:50

we have to learn and we're

18:52

kind of mostly ignoring that right now. There

18:54

was a very nice paper by DeepMind last

18:56

year that was a neurosymbolic approach to math

18:58

problems that could solve some international math

19:01

Olympian problems called alpha geometry. So there's

19:03

a bit of work in that area

19:05

but it's underfunded compared to the rest.

19:07

So we've probably as a field put

19:09

in close to $100 billion, certainly well

19:11

over 50 on the neural

19:13

network side and the rest of it's getting like

19:15

2% of that

19:18

or something like that. You could

19:20

think like an investor wants to diversify their

19:22

holdings. They want some stocks. You want some

19:24

bonds. You want some real estate. Right now

19:27

there's an intellectual monoculture in AI where only

19:29

one idea is being pursued hard and that

19:31

idea is generative AI. We need some other

19:33

ideas to flourish before we get to I

19:36

think AI that we can trust and that really

19:38

is transformative in the way that we're all hoping.

19:40

So do you think that given that hitting

19:43

a kind of data wall might

19:45

be a good thing at least temporarily? Yeah.

19:47

I mean there is

19:49

a sense in which I think that's right. Right

19:51

now people are resisting. They're saying well give it

19:53

another year, another two years. Some people may kind

19:55

of stick to the wrong horse for a really

19:57

long time. We'll see. I

20:00

think hitting a wall might actually

20:02

turn out to be good in just the

20:04

way that you're saying because it might force

20:06

us to a more reliable, more trustworthy substrate

20:08

for AI. There's a saying or a phrase

20:10

in the field that the current stuff that

20:12

we have, they're called foundation models, but they're

20:14

terrible foundation, right? The point of a foundation

20:16

in a house is you build the rest

20:18

of the line, you know that it's going

20:20

to be stable. And what

20:22

we have now is an unstable foundation.

20:24

If what it takes to get people

20:27

to widely acknowledge the instability of that

20:29

foundation is a period of

20:31

slower progress so that we kind of finally

20:34

say, hey, we're not quite doing this

20:36

right, what else can we do? Then

20:38

yeah, a short-term slowdown might lead to

20:41

a longer-term acceleration and a longer-term more

20:44

stable way of doing AI. A lot of people

20:46

think that I hate AI and it's not true.

20:48

It's not at all true. You hate it. I

20:51

really don't, right? I mean, I built an AI

20:53

company and sold it. I've been working on it since

20:55

I was eight years old. I actually love AI. I

20:57

spend most of my discretionary time

20:59

thinking about AI. Mostly don't even do this

21:01

for pay. I mostly just want the world

21:03

to be in the right place. But I really

21:05

do kind of hate the way that generative AI

21:08

has been positioned. Like as a lab curiosity, it's

21:10

fine. People should look at different

21:12

approaches, but it is so

21:14

much sucking the life from everything else and

21:16

it is so unreliable that it's just not

21:19

a good way to do AI. OpenAI is

21:21

like instead of like saving lives, it's mostly

21:23

in the near term going to be used

21:25

to surveil people. OpenAI wants

21:28

to suck up all your documents and

21:30

your calendar entries. It's going

21:32

to be like the greatest surveillance tool ever made,

21:34

but that's not why I went into AI. OpenAI

21:38

CEO Sam Altman said at a conference last

21:40

year that we were coming to an end

21:43

of the era where we keep relying on

21:45

these giant data models and that we'd make

21:47

them better in other ways. So do you

21:49

think that the kinds of limitations

21:51

in the current approaches to generative AI are

21:54

acknowledged within the AI community? Well, I mean

21:56

it's hilarious that he said that because when

21:58

I first said that... in

22:00

2022. He posted on Twitter a meme that

22:03

looked like my article, Deep Learning is Hitting

22:05

a Wall, saying, God, give me the strength

22:07

of something like that of the mediocre deep

22:09

learning skeptic. So he came after me hard

22:11

for saying this stuff, but I think he's

22:13

come around. I think a few people have

22:15

come around. I think people who have really

22:18

looked at the problem of what intelligence is

22:20

almost uniformly recognize how far away we actually

22:22

are. Gary, thanks so much for

22:24

your insights on this. Sure. My pleasure. Gary

22:27

Marcus is a cognitive scientist, entrepreneur

22:29

and professor emeritus at New York

22:31

University. His forthcoming book is called

22:33

Teaming Silicon Valley. It's out September

22:35

24th, 2024. You

22:47

are listening to Spark. Democratizing

22:49

culture to me means not

22:51

just letting us shout into

22:54

the void of the internet. This

22:57

is Spark with Nora Yun on

22:59

CBC Radio. On

23:08

last week's show about tech and

23:10

music, Inongo Lumumba-Kassongo talked about technological

23:12

transformation in the history of hip

23:14

hop. Inongo is an associate

23:17

professor of music at Brown University. We

23:19

had such an engaging talk, but we didn't

23:22

have time for it all. So we decided

23:24

to play more from that conversation, especially because

23:26

it speaks directly to how data gathered from

23:28

hip hop artists work is used by generative

23:30

AI and the ethical problems that

23:33

poses. It also lets

23:35

us reflect not only on how AI challenges

23:37

what music is for, but also

23:39

the importance of lived human

23:41

experiences. The

23:47

thing our music prof is also a rapper. And

23:53

I go by the name Samus when I'm performing. I

24:01

started making beats in high school. In part,

24:03

I wanted to score a video game because

24:05

I love video games. And so

24:08

my older brother showed me how to

24:10

make beats on my laptop. And from there, I

24:12

started making these sort of little songs.

24:14

And then eventually that expanded into

24:16

me rapping over those songs. I

24:19

wasn't formally musically trained. So I felt like, OK, I

24:21

know how to make beats. And I have my voice.

24:23

What can I do? And

24:25

so rap became this really awesome mode for

24:28

me to be able to share things that

24:30

I was thinking were important. They trying to

24:32

bleed. The psycho on the right

24:34

folks trying to sight bloke. Say my

24:36

company's easy. In 2022, Inongo wrote

24:39

a piece for Public Books where she explored

24:41

the emergence of high tech blackface

24:43

and digital blackface, the

24:46

idea that digital technologies allow non-black

24:48

people to adopt the personas of

24:50

black artists online. One of

24:52

the examples she highlights is the case of

24:54

FN Mecha. So

24:57

FN Mecha had this almost

24:59

like Icarus tale, Rise

25:01

and Fall. So a set

25:04

of kind of creative technologists, or

25:06

really only one sort of entrepreneur

25:09

and another creative technologist, I think

25:11

around 2019, 2020 started developing the

25:13

idea to

25:16

create a kind of rap

25:18

avatar who would take on

25:21

rap, our hip hop mannerisms, and

25:24

promote music, and be sort

25:26

of the first quote unquote AI

25:29

rapper. And I say AI rapper

25:31

in quotes because it was not

25:33

actually ever made clear how AI

25:35

was being engaged in this context,

25:38

but it was clearly important for

25:40

the developers of this character to

25:43

place AI in dialogue with the

25:45

way that this character was being

25:47

developed. There was a recognition that

25:49

this signals, at the very least,

25:51

that there's a kind of innovation

25:54

happening here that other musicians and

25:56

record labels will want to sort of invest in.

25:58

And so this character of FN Mecha, Maca

26:01

started putting out music, which we later learned was

26:03

actually recorded by a black rapper

26:11

named Kyle the Hooligan. He

26:15

was sort of voicing the character but

26:17

was not properly compensated. And

26:20

this was the voice of F.N. Maca.

26:22

And he was sort of developing a

26:24

presence online on Instagram and on TikTok,

26:26

kind of performing this

26:28

prototypical rap persona where, you

26:30

know, he has lots of

26:32

cars and lots of jewelry.

26:35

And questions started to emerge

26:37

around who was the creative

26:39

force behind this avatar, right?

26:42

And I think part of that awareness has

26:44

been this understanding in the digital age

26:47

that stepping into black personhood is

26:50

particularly kind of easy through

26:52

some of the forms of the digital world.

26:55

And so there was an already kind

26:57

of a caution and suspicion on the

26:59

part of listeners and, you

27:01

know, folks who would be in

27:03

that space. Despite those

27:06

suspicions and its ethically dubious

27:08

origins, F.N. Maca's popularity

27:10

continued to grow with over one

27:12

billion views on TikTok and millions

27:14

of followers. And then in 2022,

27:16

the AI rapper was signed

27:20

to Capitol Records, the first time an

27:22

AI-generated musical artist was signed to a

27:24

major record label. And

27:26

was subsequently dropped within months

27:28

of being signed because so

27:30

many people responded with

27:32

concerns about what sort of image

27:35

of a rapper this avatar was

27:38

conveying. And again, questions about

27:40

transparency. Who is making decisions

27:42

about who this AI

27:45

or avatar rapper is

27:47

sort of how he moves through

27:49

the space and how he's understood. I think

27:51

there's a lot of healthy suspicion that this

27:54

was sort of a cash grab that was

27:56

not invested in the actual communities from

27:58

which the art form

28:01

and even the mannerisms were sort

28:03

of coming from. Yeah, yeah. And

28:05

you've argued that this is part of a long

28:07

history of black sound. Can you dig into that a little

28:09

bit for me? Absolutely. So

28:11

Matthew D. Morrison, who's a

28:14

musicologist, really brilliant thinker, has

28:17

asked for us to think

28:19

about the context of how

28:22

we engage with the work

28:24

and material of black

28:26

musical artists in our contemporary moment

28:29

by thinking back to the formation

28:31

of the music industry, particularly within

28:33

the US context. And so he

28:36

asks us to think about the

28:38

emergence of black-based minstrelsy,

28:40

which is this racist theatrical form

28:42

that emerges in the 1820s

28:46

and involves the

28:48

performance caricaturing of

28:51

enslaved Africans as well as free

28:53

black folks by white performers

28:55

who would don black face paint

28:58

and step into these caricatures of

29:00

these figures. And it was a

29:02

way not just to

29:05

express kind of fear and

29:08

revulsion around relationships

29:11

to black folks in

29:13

the US. It was also a way

29:15

to transgress and play with some of

29:17

the sort of gendered and class hierarchies

29:20

that were emerging at that time as

29:22

well. And so I think that dialectic

29:24

is really important to note because when

29:26

we think about digital black face, it's

29:29

not about sort of just mocking or

29:32

playing with representations of blackness

29:34

that are about demeaning black

29:36

folks, right? In a lot of

29:38

ways, these representations are ways that

29:41

non-black people can play with

29:43

transgression or trying new

29:46

modes of expression without

29:48

having to sort of deal with

29:50

the consequences of what that might

29:52

look like without doing so in

29:54

the body of a figure that

29:56

is commonly understood as transgressive just

29:58

as a matter matter of fact.

30:00

And so there's a kind

30:03

of play that's happening there that's

30:05

really harmful because folks get to

30:07

step in and out of presentations

30:09

and performances of black modes of

30:11

expression and thought without having to

30:13

deal with how being black shapes

30:16

one's life outside of that

30:18

context. You

30:21

know, it seems to me that in the sort of popular

30:23

conversation around this, there's been a lot of focus

30:25

on extremely high profile artists, people like Drake or

30:27

The Weeknd, you know, whose

30:29

voices and likenesses are being used. But ultimately,

30:31

who do you think really stands out to

30:33

lose in all this? I

30:36

mean, it's interesting because like you said,

30:39

the way in which this is

30:41

sort of unfolding, the people who

30:43

are at the moment the most

30:45

vulnerable when I think about these

30:47

kind of AI voice filters where

30:49

folks are able to really sound,

30:51

you know, like audio deepfakes to

30:53

really step into the sound of

30:55

a Drake or The Weeknd, you

30:58

know, because they have this kind of

31:00

cultural cachet built into the timbre of

31:02

their voice, it enables

31:04

people to step in and

31:07

to generate capital and clout

31:09

because their voice means something. So for

31:11

an artist who's just starting out, their

31:14

voice doesn't mean what Drake's voice means,

31:16

just the sound of it, right? Just

31:18

the sound of it is doing something

31:20

important. And so I think in many

31:23

ways, artists who are, you know, at

31:25

that sort of upper echelon, they're really

31:27

vulnerable because their voice, A, is

31:30

everywhere. Yeah, a lot of

31:32

training data there. So much, there's

31:34

so much material. And B,

31:37

their voice has a kind of value

31:39

pop culturally. I mean, I think

31:41

about the ways that when an

31:44

artist features on another artist's track,

31:46

the excitement about hearing these two

31:48

voices be in conversation

31:50

because this voice is meaningful to

31:53

us. So it's not

31:55

as, I think, overtly

31:57

destructive in the more deep.

32:00

DIY spaces or the spaces where

32:02

an artist hasn't yet developed a voice

32:05

or a timbre of a voice that's

32:07

recognizable. But again, I think

32:09

how that impacts artists who are

32:11

sort of on the underground is

32:14

that when we think about the possibilities

32:16

for how working musicians can

32:18

build a life, it's very,

32:21

very difficult at this moment to be

32:23

a working artist. I think every single

32:25

rapper friend that I have or music,

32:28

you know, just more generally folks who

32:30

work in music, they have

32:32

like five hustles. I mean, I myself

32:34

am a professor, and I'm also a

32:37

rapper. And, you know, I value

32:39

and appreciate being an academia and having

32:42

these conversations. And in

32:44

part, this has been a strategy to be

32:46

able to build a sustainable art practice, because

32:48

were I to just be actively pursuing music,

32:50

I would be subject to the whims of

32:52

the market. And that's a really, really difficult

32:55

position to be in as an artist. And

32:57

as an artist who doesn't want to just

32:59

make whatever is profitable on the

33:01

radio, like this is a really,

33:03

really difficult position to be in. And

33:05

so with the advent of AI in

33:08

the music space, again, I think about

33:10

questions of risk and who can afford

33:12

to absorb creating new kinds of sounds

33:15

or trying to make it. My worry

33:17

is that artists who are just starting

33:19

out or who are, you

33:21

know, creeping around the DIY basement

33:23

space, is that they don't even

33:25

see a possibility or a way

33:28

forward. Because what the sort of

33:30

large record labels do impacts what

33:32

the middle tier record labels do and

33:34

who they invest in. And

33:36

if the sort of Warner Music Groups

33:38

of the world are reflecting the message

33:41

that it's not really worth investing in

33:43

real human artists, and instead, maybe what

33:45

we should do is invest in tools

33:47

that enable us to take

33:50

on the personhood of artists, artists who we

33:52

don't then have to be accountable to in

33:54

the ways that we have to be accountable

33:56

to human artists. You know, I

33:58

can see that impacting the decision. making on

34:00

the part of everyone else

34:03

in the music industry. So I

34:05

think I'm worried about the culture

34:07

around how we view the work

34:09

of being a musician, that it's

34:12

devalued in this process. And that

34:14

devaluation actually significantly impacts

34:16

who sees themself as being

34:18

able to pursue a life

34:20

as an artist. Yeah.

34:22

Well, no, just from a technical point of

34:24

view, I mean, what do you

34:26

make of their ability to replicate sounds

34:28

from different genres, different forms of music?

34:32

I think that the tools that

34:34

I've engaged with are there's

34:37

a range of levels of sophistication.

34:39

So for example, if I were

34:41

to go into chat GPT and

34:43

say, write me a rhyme in

34:45

the style of Samus, myself. And

34:49

it'll generate this pretty

34:51

mundane, childish rhyme that

34:53

has a really not

34:56

a particularly innovative rhyme scheme. There's

34:58

not sort of like metrical complexity

35:00

to it. And the material

35:03

itself reflects sort

35:05

of like a shadow of who I am

35:07

as a rapper generally based on what exists

35:09

in the world. So a lot of

35:11

my music deals with metaphors around technology

35:13

and video games. And so there's some

35:16

reflection of that being important to me.

35:18

But it's very unspecific

35:21

and not particularly compelling. However,

35:24

with some of these sort

35:26

of tools that

35:28

allow folks to use

35:30

AI to create a filter for a

35:33

particular person's voice so they can rap

35:35

as themselves and then sort of put

35:37

this filter on so that it becomes,

35:39

as we've heard, Drake or The Weeknd,

35:42

that enables you to step into

35:45

the kind of flow and real

35:47

expressive qualities of what makes a

35:49

rap song, a rap song, or

35:51

what makes a rap interesting. So

35:54

the level of sophistication there, I

35:56

think, is troubling and does

35:59

sort of like. on a technical level, I

36:01

think we're moving into a space

36:03

where it will become really, really

36:05

difficult to kind of figure out

36:08

who's authoring what. And actually, it's really interesting.

36:10

We're seeing that happen right now with Drake,

36:12

who's in a bit of a beef with

36:14

a number of different artists. And

36:17

very, very recently, a track

36:19

was released and a real

36:21

discourse online was is this

36:23

diss track an AI track?

36:25

Like, did Drake actually write

36:27

this track? And there's so

36:29

many implications around that. You

36:32

know, if Drake says, I didn't write this

36:34

track, like if it is an AI track,

36:37

the next thing that he writes will be

36:39

compared to this other AI track. So as

36:41

an artist, he's kind of having to interface

36:44

with this shadow version of himself. But

36:46

then there's also the misinformation elements

36:49

of this where, you know, with a

36:51

diss track, or in the context of

36:53

a beef, this can have real implications

36:55

for people's relationships with the other people

36:57

in the music industry or with their

37:00

peers. And if it's not clear, whether

37:02

this was generated by some outside force

37:05

or by the artists themselves, it can

37:07

start to get really challenging interpersonally. So

37:09

it's we already see how

37:11

it's manifesting in the public sphere. Yeah,

37:13

I mean, historically, people have used songwriting

37:16

as ways to sort of, you

37:18

know, document their lives to

37:20

work through their feelings and their thoughts.

37:23

Does generative AI for music come

37:25

into conflict with that history? Like,

37:27

and the importance of just lived

37:29

human experience in that type of

37:31

storytelling? Absolutely. And I think

37:33

that there's a particular way in which

37:36

the rap context is interesting to

37:39

study because within the world of

37:41

rap, the sort of like subjectivity

37:43

of the rapper is so

37:45

critical to our understanding and love

37:47

of or engagement with that person.

37:49

So like the rapper saying, this

37:51

is me, this is my story,

37:54

even if it's not right, even

37:56

if there is

37:58

embellishment, which of course, of course for

38:01

all artists, we're telling stories. So there's

38:03

some artists are more committed to kind of

38:06

telling the story of their life in a

38:08

way that really reflects sort of the events

38:10

of it. And other artists have more of

38:12

a sort of playful relationship with their sense

38:14

of truth. But within the rap context, there's

38:17

very much a sort of understanding that what

38:19

you present is who you are. So

38:21

much so that the practice of ghostwriting

38:24

is frowned upon, right? That's just not

38:26

something you do. And in other songwriting

38:28

contexts, you know, we know Beyonce has

38:30

a team of songwriters. We know that

38:32

other artists work with songwriters. And what

38:35

we expect of them or desire of

38:37

them is that they implement

38:39

or use their own

38:41

capacity as a performer

38:44

to give the song life or

38:46

infuse their story with it. But

38:48

with the rap context, there

38:50

really is an expectation that the

38:53

rapper does all of that sort

38:55

of labor of writing and performing

38:57

and being. So when you bring

38:59

in these tools of generative AI

39:01

that really question authorship, it

39:04

kind of throws the

39:06

whole hip-hop project into question. Like what

39:08

do we think is the most important

39:10

value in this space? Is it okay

39:13

to have a

39:15

person who is a really

39:17

incredible performer but their words

39:20

that they're performing have come from a

39:22

context that is not of their lived

39:24

experience? I think in this

39:26

moment, many sort of rap fans would

39:28

say that's unacceptable. But I also think

39:30

a growing number of people who are

39:32

getting familiar with these tools would argue

39:34

that that's actually, that's okay. It's okay

39:36

to sort of play with authorship

39:39

in new ways. And maybe we don't

39:41

have to be so beholden to that

39:43

mode of being. So yeah, it

39:45

definitely pulls apart, I think, as some

39:48

of the central tenets of what we

39:50

think of as being constitutive of like

39:52

rap music. Yeah. Fascinating. Inango, thanks

39:54

so much for your insights on this. Thank

39:56

you so much for having me. Inango

39:59

Lumumba. is assistant professor of

40:01

music at Brown University, chief rap officer

40:03

at Glow Up Games, and a

40:06

rapper. Hello,

40:08

I'm Jess Milton. For 15 years,

40:10

I produced The Vinyl Cafe with the late,

40:12

great Stuart McLean. Every week, more

40:15

than 2 million people tuned in to hear

40:17

funny, fictional, feel-good stories about Dave and his

40:19

family. We're excited to welcome you back to

40:21

the warm and welcoming world of The

40:23

Vinyl Cafe with our new podcast, Backstage at

40:25

The Vinyl Cafe. Each week,

40:28

we'll share two hilarious stories by Stuart, and for

40:30

the first time ever, I'll tell you what it

40:32

was like behind the scenes. Subscribe

40:34

for free whenever you get your podcasts.

40:53

Hello, I'm Nora Young, and today on Spark, we're talking

40:56

about some of the limits in how we use

40:58

data in training AI, and

41:00

how we might think differently about how we

41:02

create, train, and use these systems. Models

41:05

are what they eat. They ultimately regurgitate the data

41:07

that you show them. So if you show them

41:09

high-quality data, they're going to be high-quality. If

41:12

you show them low-quality data, they're going to be low-quality. This

41:15

is Ari Morcos. He's the CEO

41:17

and co-founder of a data selection

41:19

tool startup called Datology AI, which

41:22

he formed after a career working at

41:24

Meta Platforms and Google's DeepMind unit. We

41:27

help companies train better models faster by optimizing

41:29

the quality of the data that they train

41:31

on. So at a high

41:34

level, we can exploit other models

41:36

to describe the relationships between billions

41:38

of data points, and use those

41:40

models to identify what data are

41:42

good, bad, redundant, etc. But

41:44

ultimately, it's a lot of various algorithms

41:46

that take into account the relationships between

41:49

data points to figure this out. In

41:52

2022, Ari co-authored

41:54

a landmark paper called Beyond Neural Scaling

41:56

Laws, which challenges the widespread notion that more data

41:58

can be used to solve the problem of data. data equals

42:01

better models. Not

42:03

all data are created equal. Some data teach the

42:05

model a lot, and some data teach the model

42:07

a little. The amount of information you learn

42:09

from a piece of data also depends on how much data

42:12

you've seen already. So if you've seen a

42:14

little bit of data, then the next data point is

42:16

probably going to teach you something new. But if you've

42:18

seen a ton of data already, then that next data

42:20

point is probably not going to teach you something new,

42:22

because it's likely to be similar to something you've seen

42:25

before. And in many data sets, we observe this distribution

42:27

where most of the data is focused

42:29

on a pretty small set of concepts. And then

42:31

you have this long tail of more esoteric concepts

42:33

that are really the most informative for the model

42:35

and teach the model the most. But naively, if

42:37

you were to just train on all the data

42:40

or just acquire as much data as possible, those

42:43

long tail data points that are really

42:45

informative would be massively underrepresented in the

42:48

data sets. This comes up commonly in

42:50

a lot of different use cases. And ultimately,

42:52

what's important to get models that are really

42:54

high quality is to identify what are the

42:56

most informative data points, what's the data that

42:59

teaches the model the most, and enrich your

43:01

data sets so that those data points are

43:03

most prevalent in training. So

43:05

what are the practical implications of looking at,

43:07

for example, the data that tells you not

43:09

the 1,000 times the chicken crossed the

43:11

road, but the one time the chicken didn't cross the

43:13

road? What is that actually giving you in practical terms?

43:16

Yeah, that's ultimately what teaches the model

43:18

to be robust and to be able

43:21

to generalize to lots of different situations.

43:23

There's another huge practical implication of this,

43:25

which is that it dramatically slows down

43:27

training and makes training far more expensive

43:29

to get much worse models. Because what

43:32

happens as a result of this is

43:34

that most data that a model is

43:36

looking at doesn't teach it anything at

43:38

all. But it costs money. It costs

43:40

compute to look at that data. And

43:43

it takes time. And ultimately,

43:45

we're in a regime now where we have

43:47

so much data that no model is actually

43:49

learning everything about the data that's presented to

43:51

us. We decide to stop training a model

43:53

because we ran out of money. So we have a budget

43:56

for how much we're willing to spend to train a model.

43:58

And we run out of math. say optimizing

44:00

the quality of the data that goes into a

44:02

model, what you're effectively doing is making it

44:04

so that the model learns faster. And

44:07

if the model learns faster, that provides what

44:09

we call a compute multiplier, but that

44:11

leads to what also is called a quality multiplier,

44:14

because if the model learns faster, then you can

44:16

get to the same performance much faster, but you

44:18

can also get to much better performance in the

44:20

same budget. So this is ultimately

44:23

critical to getting models that work robustly

44:25

across lots of situations in

44:27

which we can train in a cost-effective way. So

44:30

how does this thinking inform what you're

44:32

doing at Datology AI? Yeah.

44:34

So ultimately, our goal at Datology is

44:36

to make curating high-quality data easy for

44:38

everyone. This is a frontier research problem,

44:40

as you noted, kind of in many

44:43

ways. My company is based off of

44:45

this paper that we had in 2022,

44:47

Beyond Neural Scaling Laws. But

44:49

there's a ton of nuance and challenge into

44:51

how you do this. And this is an

44:53

area where there's been very little published research

44:56

in general. This is ultimately the secret sauce

44:58

that divides the best models from the average

45:00

models. Data quality really is everything.

45:03

Most of the big frontier model companies are

45:05

using the same architecture. Ultimately

45:07

what differentiates the quality of the model is

45:10

which data they show it. But of course,

45:12

they're strongly disincentivized to share with anybody how

45:14

they do that, because that is a secret

45:16

sauce. So what that means is, if you

45:19

wanted to train your own model, you would

45:21

not have access to this really critical part

45:23

of the AI infrastructure stack that's really quite

45:25

challenging and difficult and has a lot of

45:27

nuance in how you identify this data at

45:30

scale automatically. So that's what we

45:32

do at Datology. We make that easy for everybody

45:34

by automatically curating massive data sets up

45:36

to petabytes that in order to make

45:38

the data as high-quality and informative as

45:40

possible and make models train

45:43

much faster and to much better performance. But

45:45

doesn't the entire sort of big data

45:48

machine learning project rely on kind

45:50

of probabilistic outcomes of large amounts

45:52

of even sort of messy data?

45:54

I understand the importance of the outliers, the long tail,

45:57

but don't we need to know what mostly

45:59

happens? as well? This gets into this

46:01

notion of redundancy and redundancy is actually

46:03

good to a point. And

46:05

different concepts have different amount of complexity, which

46:08

means that they need different amounts of

46:10

redundancy. So I'll give you an example.

46:12

Imagine trying to understand elephants versus dogs.

46:14

Okay, elephants are pretty stereotypes, right? They're

46:17

all gray. They all have wrinkly skin.

46:19

They all have big floppy ears. They're

46:21

bigger and smaller elephants, African and Asian,

46:23

respectively. But ultimately, most elephants are pretty

46:25

similar to one another. Whereas dogs, you

46:27

have tons of variation. So the amount

46:29

of redundancy that I need in order

46:31

to understand what an elephant is is much

46:33

smaller than the amount of redundancy that I

46:35

need in order to understand what a dog

46:38

is. So if I were to use the

46:40

right amount of redundancy for elephants, for dogs,

46:42

then I'd end up doing very well on elephants,

46:44

but I would not fully understand dogs in my

46:47

model. Right. And if I were to do the

46:49

opposite, I would understand dogs perfectly well, but I

46:51

would have wasted a ton of compute, looking

46:53

and learning about elephants far beyond where I

46:55

need to. So the challenge here is that

46:58

you absolutely need redundancy about the common concepts,

47:00

but you need the appropriate amount of redundancy

47:02

for a given complexity. So what we have

47:04

to do given a massive data set that's

47:06

unlabeled, that doesn't have, it doesn't say this

47:08

is an elephant or this is a dog.

47:10

It just here's a bunch of data, we

47:12

have to identify automatically what are those concepts,

47:14

figure out how complicated are each of those

47:16

concepts. And then based off of that, determine

47:19

the right amount of data to remove from

47:21

each of those concepts, in addition to removing

47:23

the right data there, because obviously, even within

47:25

a concept of elephants, not all elephant data

47:27

is equally informative, some is going to be

47:30

better than others. One

47:32

of the things we've talked about on the

47:34

show in the past is not only the

47:36

cost of training these things, but the environmental

47:38

cost of these very, very data intensive models,

47:40

like deep learning, do you think this approach

47:42

has potential to address the end just a straight

47:44

up energy costs of this approach to computing? Absolutely.

47:46

And I think that's a big part of our

47:48

mission as well as to help with the compute

47:50

costs of these models, both on the training side,

47:52

but also on the inference side. During

47:55

training, by reducing the amount of data you

47:57

need to train models on, we can reduce

47:59

that currently by two to to 4X and

48:01

we're getting better at that every day. So

48:03

that already means that you can now train

48:05

a model with 2 to 4X less environmental

48:07

impact, which is obviously significant.

48:09

But one of the things that we can

48:11

also do with higher quality data is train

48:14

smaller models to the same performance. And in

48:16

the scheme of things, ultimately models are actually

48:18

gonna be run in what's called

48:20

inference, which is when you're actually using a model

48:22

in deployment or something like that, far more often

48:25

than they're gonna be used in training. And if

48:27

you deploy a model to inference that's bigger than

48:29

it needs to be because it didn't

48:31

see high quality data, then that's a

48:33

massively increased environmental and compute costs as

48:35

well. So better quality data both helps

48:37

to cut training costs of models, but

48:39

also helps you to train models that

48:42

are smaller and better optimized so that

48:44

the inference cost at deployment time is

48:46

also much lower, which is very helpful

48:48

from a business standpoint, but also clearly

48:50

has massive environmental impact. You

49:04

are listening to Spark. The idea that

49:06

we're somehow making proto humans and

49:09

that may approach or exceed us on

49:11

some mythical scale of intelligence or

49:13

decide they don't need us anymore, there's no they

49:15

there. This is Spark from

49:17

CBC. Hi,

49:29

I'm Nora Young. Today on Spark, we're talking

49:31

about the data limitations of some AI and

49:34

whether the way around the data wall is

49:36

to focus on data quality rather than quantity.

49:39

Right now, my guest is AI researcher, Ari

49:41

Morcos. His company, Datology AI, is

49:43

building tools to improve data selection, which

49:45

could help lower the amount of data

49:48

needed to train these systems. One

49:51

reason we wanted to talk to you is

49:54

that we've been hearing about concerns that data-hungry

49:56

AI like large language models will hit a

49:58

cap of good quality training data. So

50:00

if we don't rethink how to train these

50:02

systems, do you think large language models

50:05

are going to hit a plateau? I

50:07

think there's a ton more we can

50:09

do by just gumming up with better

50:11

quality metrics for our existing datasets. Obviously

50:14

more data is better given the same quality,

50:16

but if we look at the models that

50:18

we have right now, they're still getting better

50:21

with more data. They're not converging yet, even

50:23

on the data that we've already shown them.

50:25

So there's a lot of gains still to

50:27

be had from showing the model higher quality

50:30

data more times over so that it learns

50:32

it. Think about how you might do flashcards

50:34

if you're trying to study for a test.

50:37

You put all the different questions on your flashcards,

50:39

and then when you get one correct, you take

50:42

it out of the pack. When you get it

50:44

incorrect, you put it at the back, and then

50:46

you see it over and over again. So doing

50:48

things where we actually present the data that's most

50:51

difficult for the model or that teaches the model

50:53

the most multiple times is still an area where

50:55

I think we can get a ton of gains

50:57

and one that we've just really barely exploited. For

50:59

a number of cultural reasons, the field of machine

51:02

learning has largely ignored studying data. Part

51:04

of that is because data has often been viewed as kind of

51:06

boring or the plumbing.

51:09

In many cases, part of it is also that

51:11

in a lot of the competition style machine learning

51:13

research data is viewed as a given. So it's

51:15

like given a dataset, how can you create a

51:18

model that's going to do the best on that

51:20

dataset? As a result of

51:22

that, the field is mostly focused on advances

51:24

in modeling rather than advances in data. A

51:27

metaphor I like for this is

51:29

that there's this tree that's barren

51:31

that's surrounded by a bunch of

51:33

professors prodding their grad students to

51:35

climb this barren thorny tree to

51:38

reach up to find a shriveled apple

51:40

that is some site improvement in a

51:42

modeling advance. Meanwhile, just out of sight,

51:44

there's a lush orchard of

51:47

trees that are literally dropping fruit

51:49

onto the floor in the realm of

51:51

ways we can better improve data. So

51:53

I think this is an area that

51:56

just has been so massively understudied relative

51:58

to its potential impact. that

52:00

I think that even if we hit the

52:02

limits of what's available with respect to public

52:05

data, there's still far more we

52:07

can do by making better use

52:09

of the data that we already have.

52:11

I'll also note that the data that's

52:13

in public is a heavy

52:15

minority of the total data that's present in

52:17

the world, right? The majority of data is

52:19

private. So there's also a lot of

52:21

opportunities, I think, to get that private data and exploit

52:23

that. And I think that's one of the things that

52:26

a lot of businesses are thinking now, hey, we're sitting

52:28

on these hordes of data that could be really valuable.

52:30

How can we use that to make models better for

52:32

ourselves? And personally,

52:34

a lot of companies are concerned about their

52:36

proprietary data outside

52:39

of their proprietary wall as well,

52:41

right? Absolutely. They wanna make sure

52:43

that that advantage doesn't get ceded

52:45

to everyone. Right.

52:48

How widespread a problem do you think

52:50

this sort of potential data shortage is?

52:53

Like much of the conversation has been about chat,

52:55

GPT, and large language models, but is

52:57

this sort of issue with growing

53:00

data potentially kind of an existential issue for

53:02

a deep learning approach to AI

53:04

in general? How broad are we talking about

53:07

here? Yeah, I actually don't think

53:09

the data shortage is as big of an

53:11

issue as people make it out to be

53:13

in general. And in large part,

53:15

that's for the reasons we've been discussing, that there's just

53:17

a lot more we can do by making better use

53:19

of the data we have available. And I think if

53:21

you go to companies, many

53:23

enterprises have too much data. They have

53:26

petabytes or exabytes of data that they've

53:28

been collecting, most of which is mostly

53:31

useless because it's not very high

53:33

quality. And the problem is, right, that they

53:35

don't know how do I make the best use of that data?

53:37

How do I find the data that's actually gonna teach me the

53:39

most? But

53:41

I think for the largest frontier models

53:43

that you see coming out of OpenAI,

53:48

ultimately the path forward is going to be

53:50

to try to acquire more high quality data,

53:52

right? They've started doing a lot of licensing

53:54

deals with various data providers in

53:56

order to acquire new data that has some sort of

53:58

quality guarantee. and then also

54:00

by pushing forward a lot of research to do

54:03

better at identifying the right data, of course, which

54:05

they will not share with anybody else.

54:09

All right. Thanks so much for your insights on this.

54:11

Absolutely. Thank you for having me. Ari

54:14

Morkos is an AI researcher and founder

54:16

of Datology AI. The

54:24

show is made by Michelle Parisi, Samarit

54:26

Yohannes, Megan Carty and me, Nora

54:28

Young and by Gary

54:30

Marcus, Inongo Lumumba Kasongo and Ari

54:32

Morkos. Subscribe to Spark

54:34

on the free CBC Listen app or your favourite podcast

54:36

app. I'm Nora Young. Talk to you soon. For

55:04

more CBC podcasts,

55:06

go to cbc.ca/podcasts.

Rate

Get this podcast via API

From The Podcast

Spark

Spark on CBC Radio One Nora Young helps you navigate your digital life by connecting you to fresh ideas in surprising ways.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Download Audio Filehttps://chrt.fm/track/52291/cbc.mc.tritondigital.com/CBC_SPARK_P/media/spark/spark-jqZjtbrc-20240425.mp3?ttag=season%3A17%2Cepisode%3A612

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More