Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Hi, I'm Asha Tomlinson. And I'm David
0:02
Common. Hi, I'm Asha Tomlinson. And I'm
0:04
David Common. And we're hosts
0:07
of CBC Marketplace. We're award-winning
0:09
investigative journalists that want to
0:11
help you avoid clever scams,
0:13
unsafe products and sketchy services.
0:16
Our TV show has been Canada's
0:18
top investigative consumer watchdog for more
0:20
than 50 years, but
0:22
this is our first podcast.
0:24
CBC Marketplace Podcast is available now
0:27
on the CBC Listen app or wherever
0:29
you get your podcasts. This
0:32
is a CBC Podcast. Hi,
0:38
I'm Nora Young. This is Spark. Over
0:43
the years, we've talked a lot about the
0:45
data-driven turn in AI and how
0:47
a deep learning approach has given us everything
0:49
from image recognition to chat GPT. But
0:52
what about the ongoing ethical questions about the
0:54
kinds of data machines are learning on? And
0:57
beyond that, what if we're starting
0:59
to run out of data? This
1:01
time, tracking the data limits of AI. Ever
1:17
since chat GPT took off, Google, Meta
1:19
and OpenAI have been in a race
1:21
to build ever more powerful generative AI
1:23
systems. Systems that rely on enormous
1:25
amounts of data to train them. Especially
1:28
the kind of human-created, high-quality
1:30
data that large language models
1:32
like chat GPT need to
1:34
produce impressive results. But now,
1:38
there's concern that these companies are running out
1:40
of data to train their new, large language
1:42
models. That high-quality,
1:44
human-produced information is finite. And
1:48
that the internet isn't the endless source of data they
1:50
once thought it was. I
1:52
think that there's a real reason to think that we've
1:54
maybe reached a period of diminishing returns. So a year
1:56
ago, it looked like we were going to be able
1:58
to or maybe on
2:00
an exponential, things were rising really fast. This
2:03
is Gary Marcus. He's a cognitive scientist and
2:06
leading voice in artificial intelligence. He's
2:08
the author of Rebooting AI, Building Artificial
2:10
Intelligence We Can Trust, and the
2:12
forthcoming book Taming Silicon Valley, How We
2:15
Can Ensure That AI Works for Us.
2:18
Well, I think of large language models as being
2:20
like bulls in a china shop. They're wild, reckless
2:22
beasts that do amazing things, but we don't really
2:24
know how to control them. Back
2:27
in 2022, Gary warned that we
2:29
were nearing this deep learning data wall.
2:33
And he's also written a lot about the limits
2:35
of large language models. They're
2:38
not very good at reasoning. They're not very
2:40
good at planning. They hallucinate or confabulate might
2:42
be a better word frequently. And
2:44
there's also an issue that they're very greedy about
2:47
data. And we're running up,
2:49
I think, against the fact that people have already
2:51
used essentially every bit of data they can get
2:53
their hands on. A
2:55
recent piece in The New York Times
2:57
reported that a team at OpenAI, which
3:00
included President Greg Brockman, had actually collected
3:02
and transcribed over a million hours
3:04
of YouTube videos to train their
3:06
chat GPT-4. Last
3:09
year, Metta also reportedly discussed acquiring
3:11
Simon & Schuster to gain access
3:13
to the publishing house's long-form works.
3:16
I mean, there's almost a desperation about trying to
3:18
get more data. And there's not that much
3:20
more good data. You can always make up
3:22
bad data. You can have chat GPT, which
3:24
hallucinates or confabulates makeup data. But some of
3:27
that data is not going to be any
3:29
good. So there's actually a concern about kind
3:31
of polluting the internet with bad information. If
3:34
you plotted things on a graph on
3:36
your favorite benchmark, how well are we doing? None of
3:38
them are perfect. But if you took whatever your favorite
3:41
one is and looked at like the difference between 2020
3:43
and 2022, you'd see a huge difference. And
3:47
a huge difference between 2022 and 2023, and you'd say, hey, we're in this period of exponential
3:52
returns. But that
3:54
growth hasn't really sustained. Gary says that GPT-4,
3:57
which came out in March, 2020, is a
3:59
big difference. 23 was a
4:01
huge and impressive leap. Since
4:03
then, there have been several competing
4:06
models with huge financial investment, time
4:08
investment, and massive amounts of data,
4:10
but they're not really any better.
4:13
While generative AI may have reached
4:15
a point of diminishing returns, Gary
4:17
says that doesn't mean AI itself is
4:19
going to be indefinitely stuck, but
4:22
it does mean we'll need to come up with
4:24
new approaches to how we train these systems. My
4:28
view is this has been a productive
4:30
path, but also a blind alley in
4:32
a certain way. The whole
4:35
notion of these systems is that you
4:37
statistically predict what people would say in
4:39
certain circumstances based on experience, but these
4:41
systems have always been poor at outliers,
4:44
cases that are different from what they've
4:46
been trained on before. We saw this
4:48
whole movie before with driverless cars, where
4:51
I and a couple other people pointed out in 2016
4:54
that you have outliers with driverless cars, unfamiliar
4:56
circumstances, and that the kinds of techniques we
4:58
know how to build in AI now are just not that
5:00
good at those. We
5:02
said, driverless cars might not be as imminent as
5:04
you thought, and lots of people got excited. Investors
5:07
put in $100 billion, but in the end
5:09
of the day, there are still lots of
5:11
unpredictable circumstances, weird placements of traffic cones or
5:13
people with hand-lettered signs that the driverless cars
5:15
still don't do very well with. I think
5:17
we're seeing the same thing with large language
5:19
models. If you ask a question a lot
5:21
of people have asked before, you're probably all
5:23
set. If it's subtly different from a question
5:26
that's been asked before, they might miss that
5:28
subtlety. It's not clear that
5:30
the generative AI systems are ever
5:32
going to be able to deal with
5:34
the unfamiliar in an effective and systematic
5:36
way. That doesn't mean no approach to
5:39
AI will ever get there. I
5:41
think we're in this blind alley
5:43
where it's all statistical approximation, and we
5:45
need systems that are in fact based
5:47
on facts and reasoning. Neural networks
5:49
that are popular right now are basically good
5:52
at something that's a little bit like intuition,
5:54
but they're bad at the deliberate stuff. They
5:56
really can't reason reliably. They can't plan
5:58
reliably. some other
6:01
approach to do that. So
6:03
just to explain
6:05
what synthetic data is. Sure, you make stuff up. So
6:07
a great example of this is, I mean, really, truly, I didn't
6:09
mean to be, to ridicule the idea. I mean, it's actually a good idea
6:12
as far as it can take you, but it doesn't
6:14
take you far enough sometimes. So a classic example, I
6:16
would say, is in driverless cars around 2016 or so,
6:18
people started realizing they didn't have enough data
6:23
from actual cars and they started making up
6:26
data in different ways. So, I think it
6:29
started making up data in video games like Grand Theft Auto and
6:31
sometimes their own version of
6:38
that. So you would have a simulated car in some
6:40
weird circumstance and try to get data
6:42
from that in order to feed the system. There's a whole company
6:45
that's, I think, Canadian-based that's
6:47
trying to do that. And there are probably multiple companies that are
6:49
trying to do this in various ways. And I would say it's
6:51
helped, but I would say it
6:54
hasn't helped enough. And it's partly because you
6:56
don't know which data to store and
6:58
you don't know which data to simulate. In the
7:00
real world, there are many, many instances where nobody
7:02
anticipates the data that you might need. So if
7:04
you can anticipate exactly what people are going to
7:07
need, you could do that. It would be a
7:09
really stupid use of a large language model to
7:11
make it do arithmetic because they're just not very
7:13
good at it. But you could say, well, they're
7:15
not very good at it, but if I give
7:18
them more data, they'll be better. And so you
7:20
could synthesize all the math data that you want
7:22
in principle and you could improve it to some
7:24
extent. But, for example, if you're
7:26
dealing with irrational numbers, there's just never
7:28
going to be enough synthetic data. You're
7:30
not really going to solve that problem
7:32
that way. Synthetic data has been compared
7:35
to the computer science version of inbreeding. What
7:37
do you make of that analogy? I
7:39
think there's something even more like inbreeding,
7:41
which is what Ernie Davis and I
7:43
once called the echo chamber effect, which
7:45
is having the models train on their
7:47
own output or having Google train on
7:49
open AI's output. So it is a
7:51
kind of inbreeding that's going on where
7:53
these models are making synthetic data and
7:55
then training on that. And so errors
7:57
get in there like a crazy one.
8:00
was somebody asked one of these systems, I
8:02
might get the details wrong, but I think
8:04
asked OpenAI how many
8:06
African countries begin with the letter K
8:09
and it said none. And then,
8:11
sorry about that Kenya, and
8:13
then Google trained on
8:15
OpenAI's output. So that's a kind of inbreeding where
8:17
the one system is training on the other and
8:20
the whole quality of the information ecosphere is going
8:22
down because then other people ask and that error
8:25
percolates. Again, these are kind of like contrived test
8:27
examples, we call them red teaming. But they're so
8:29
easy to generate that we're sure that they're happening
8:32
in the real world which parenthetically
8:34
points to something else, which is transparency. We don't
8:36
actually know how these systems get used in the
8:38
real world because the companies don't want to share
8:40
it. And governments should actually
8:42
be demanding logs. Like for example, do
8:44
people use these systems to make decisions
8:46
about jobs, loans, prison sentences? It was
8:48
just a study that showed in carefully
8:50
controlled circumstances if you speak to them
8:52
in African American English, you get a
8:55
different set of answers than if you
8:57
speak to them in standard English. So
8:59
we know this from the lab, we would like to
9:02
know does this happen in the real world. We don't
9:04
have that transparency right now. So the examples I give
9:06
you are a little contrived, but they show in principle
9:08
this kind of inbreeding thing that we call the echo
9:10
chamber effect and so forth. So we know
9:13
from kind of doing science as best we
9:15
can on the limited data that's available that
9:17
there are all these serious problems. And that
9:19
we don't know how far they go in
9:21
the actual world. Just to
9:23
throw out one case where we do know
9:25
in the actual world, there was a piece
9:27
in the New York Times today showing that
9:29
in the case of child born, there's so
9:31
much of it being created by generative AI
9:34
that one of the nonprofits, I guess the
9:36
tracks is overwhelmed now because suddenly there's just
9:38
so much out there. So sometimes we have
9:40
some way of measuring in the real world what's
9:42
going on and sometimes we don't. Yeah,
9:44
but this is what I've wondered is even if we're not using
9:47
sort of specifically synthetic
9:50
data to train, if
9:53
we have these systems that are generating content
9:55
and that's filling the internet, doesn't that mean
9:57
a lot of the data that gets used
9:59
to... train next generations of models isn't
10:01
going to be human created anyway. Well,
10:04
I mean, what's happening is the companies are stealing from
10:06
each other. And so the
10:08
stuff that they're stealing is no longer
10:11
pure. I mean, we always
10:13
have problems with people generating misinformation for
10:15
political reasons and so forth. But
10:17
the situation has gotten worse because there is
10:19
this mad craze for more data. So one
10:21
of the ways in which people get data
10:24
now is they use each other's models. And
10:26
the terms of service tell them not to
10:28
do that, but they've all violated each other's
10:30
terms of service. So YouTube doesn't say that
10:32
OpenAI can use their data, but apparently GPT-4,
10:35
maybe, so we're trained on.
10:37
So you have this kind of
10:39
mad mess of recycling each other's
10:41
data rather than what you really
10:43
want is like authentic human
10:45
created data from like the New
10:48
York Times, ideally licensed, where
10:51
some human writer has written an article, some
10:53
fact-checking team has verified it, or you want,
10:55
you know, the Britannica, whether it was hard
10:58
work or Wikipedia. They are taking Wikipedia, but
11:00
they're taking all this other garbage too. And
11:03
I mean, there is this old saying, computer science, like
11:05
somebody should remember this, garbage
11:07
in, garbage out, right? And
11:10
the proportion of garbage is going up.
11:32
You are listening to Spark. Everything
11:35
is a sort of a fun house. Nothing
11:37
is as it ordinarily is. And
11:41
all possibilities are open
11:43
to exploration. This is
11:45
Spark. From CBC. I'm
11:57
Nora Young, and today on Spark we're talking about the
11:59
limitations of our current approach to data intensive
12:01
AI and the ways AI giants
12:04
are trying to get around the data wall.
12:06
Right now my guest is Gary
12:08
Marcus, a cognitive scientist and founder
12:10
of Robust AI and Geometric AI.
12:12
He says there's both an underlying
12:14
technical problem and business problem when
12:16
it comes to all the competition
12:18
and hype around AI right now. So
12:22
the technical problem is the kind of AI that
12:24
we know how to build now, which I think
12:26
will look laughable 30 years from now. Like old
12:29
flip phones look a little bit laughable to us now.
12:32
It's very greedy in terms of how much data
12:34
it uses. And I pointed this out in
12:36
2018, I think people ignored me, but that's now
12:38
coming home to roost. It is
12:41
changing the moral fiber of these companies
12:43
and it's maybe leading to the diminishing
12:45
returns and so may undermine the whole
12:47
project. So on the technical side, these
12:50
systems just aren't as efficient with data
12:52
as human children. I have a 9 and an
12:54
11 year old show them something once and they
12:56
understand that they can put it to use. You
12:59
show them the rules of a new game and they
13:01
get it. These systems need a lot of data for
13:03
most of what they do. And I
13:06
don't think that's anywhere near the limit of what we
13:08
could do with AI. It's just the limit of what
13:10
we know how to do with AI today. Just like
13:14
we didn't know how to build efficient gasoline
13:16
or electric, gasoline engines or electric motors once
13:18
upon a time and we learned to make
13:21
things more efficiently, sometimes by changing the entire
13:23
structure. In this case, I think the entire
13:25
algorithm is just not the right way to
13:27
do things efficiently. It's just built as a
13:30
way of mimicking things, not as a way
13:32
of deeply comprehending things. And the reason my
13:34
kids are so much more efficient is
13:36
they build models of the world and how
13:38
it works, causal models of what
13:41
supports their weight or why this thing
13:44
works this way in this game. And
13:46
these systems just aren't really doing that.
13:48
So the technical limitation that then drives
13:51
a business thing, so the business thing
13:53
is complicated. It starts with the fact
13:55
that people think there's a lot of money to be made,
13:57
which may not actually be true. We might want to talk
13:59
about it. about that. But there is a widespread
14:01
belief that many people are acting on that
14:04
there's a ton of money being made and
14:06
so people are rushing. They want to be
14:08
first or more prominent. They want to be
14:10
Coca-Cola rather than Pepsi. And so that's driving
14:12
things. And then the fact
14:14
that there's no known method
14:16
for doing better besides getting more data
14:19
has led to this mad dash for
14:21
data which has led to a lot
14:24
of copyright infringement to companies doing a
14:26
lot of really shady things. And so
14:28
a bunch of these companies actually started
14:30
out wanting to do AI ethically and responsibly. And
14:32
now they're kind of like screwing artists and writers
14:35
left, right and center. They've kind of lost their
14:37
moral compass and a lot of the loss of
14:39
that moral compass has really been driven around the
14:41
mad dash for data. Like they've kind of forgot
14:43
where they came from and what they're supposed to
14:46
do. Like I have lost my faith in a
14:48
number of companies over the last year and a
14:50
half and a lot of it is the things
14:52
that they have done to try to get ahead
14:54
in this race. So what
14:57
would it take for generative AI to
14:59
make real progress from where we are
15:01
today if there's a diminishing return? My
15:03
view as a generative AI is not to paraphrase
15:05
Star Wars, the droids we're looking for. That
15:08
generative AI is almost like a mirage. I mean
15:11
you can use it for some things but a
15:13
lot of things that people wanted to use it
15:15
for are not reliable. And I
15:18
think AI is much harder than a lot of people
15:20
think. Like I don't think it's an impossible problem. You
15:22
know our brains are essentially computers. I know a lot
15:24
of people get mad but I think that's correct. But
15:27
our brains do a lot of
15:29
amazing things. They also make mistakes. They
15:31
could be improved upon. But our brains
15:33
are capable of approaching new problems adaptively
15:35
and flexibly. That's what I think the
15:37
center of intelligence is. This particular algorithm
15:39
just isn't. It's popular but I think
15:41
it's on the wrong track. I think
15:43
when we look 20 years
15:45
from now, look back at 2024, we're
15:47
going to say, well, in that era
15:49
people figured out one thing which is how
15:51
amazing AI could be, how it could spectacularly
15:54
transform our lives but they didn't really know
15:56
how to do it. In fact, they spent
15:58
too much time on that one thing. kind
16:00
of stifled research into anything else. They
16:02
put in billions and billions
16:04
of dollars and this other thing that
16:06
got developed in 2030 or whatever
16:08
it is, I wish they could have
16:11
developed it sooner because if we had this technology in
16:13
2025 instead of waiting until 2035, a lot
16:15
of lives could have been saved
16:19
because it was so good at solving medicine
16:21
and so forth. But people were
16:23
obsessed with the wrong tool. They didn't recognize it
16:25
was the wrong tool. I've
16:28
argued for something more like a hybrid approach. Do
16:30
you think that that's the path forward where we're
16:32
using generative AI for the things that generative AI
16:34
is good at and we're using things that have
16:37
more of a semantic understanding of the world around
16:39
them together in the same system or that we
16:41
triage problems and separate this
16:44
is a generative AI problem and this is not?
16:46
I think we need to do a lot of
16:48
that. I wrote in 2018 about deep learning, which
16:50
is generative AI is a form
16:52
of. I said it's one tool among many. We
16:54
shouldn't throw it away, but we
16:57
have to understand a large complement of tools.
17:00
It's like if somebody was building a house and they
17:02
discovered power screwdrivers and they'd be like, these are amazing,
17:04
but that doesn't mean you want to forget
17:07
that you have hammers and chisels and you might need
17:09
to build a custom tool for this one thing that
17:11
you do a lot. I mean, that's
17:13
kind of what's happening right now. It's like
17:15
the best power screwdriver ever invented. It really
17:17
is amazing. I mean, I'm often criticizing, but
17:19
it's amazing. There's a question about it. It's
17:21
amazing. The question is, is it the right
17:23
tool for the job and which jobs is the right tool for? Ultimately,
17:27
if you want a general intelligence that
17:29
can be like the Star Trek computer, it's
17:31
reliable. You can trust it with whatever kind
17:33
of problem you want to pose, you're going
17:35
to need something that has a broader array
17:38
of tools. I love the word semantic. It's
17:40
not common in these kinds of conversations, but
17:42
it's right. The semantics, the comprehension, the
17:45
meaning in generative AI is
17:47
very limited. Simple
17:49
AI, although it's limited in other ways,
17:51
symbolic AI is better representing semantics, the
17:54
meanings of things, reasoning about those relationships.
17:56
We're certainly going to need elements of
17:58
both. I don't think ... that's enough. I
18:00
wrote an article called The Next Decade
18:02
in AI which came out just before
18:04
the pandemic and the argument I
18:07
made there was that we need this thing,
18:09
hybrids, called neurosymbolic AI but that that's itself
18:11
only part of the solution. So we also
18:13
need a lot of knowledge. We need better
18:15
reasoning techniques. We need our systems to build
18:17
models of the world in the way that
18:19
you do when you go to a movie
18:21
and you learn about each character and their
18:23
motivations and what they're setting is you build
18:25
an internal model of what's going on there.
18:28
Current systems don't really do that in a
18:30
careful and robust way. So you can't kind
18:32
of ask them what's going on. They can't work
18:35
on that. So I said we need to
18:37
tackle four different problems. One of them is this
18:39
hybrid that you're talking about and I've devoted
18:41
a lot of my career to. And even
18:43
on the hybrid I would say we kind
18:46
of sort of know what that might look like
18:48
but not exactly. There's a lot of best practices
18:50
we have to learn and we're
18:52
kind of mostly ignoring that right now. There
18:54
was a very nice paper by DeepMind last
18:56
year that was a neurosymbolic approach to math
18:58
problems that could solve some international math
19:01
Olympian problems called alpha geometry. So there's
19:03
a bit of work in that area
19:05
but it's underfunded compared to the rest.
19:07
So we've probably as a field put
19:09
in close to $100 billion, certainly well
19:11
over 50 on the neural
19:13
network side and the rest of it's getting like
19:15
2% of that
19:18
or something like that. You could
19:20
think like an investor wants to diversify their
19:22
holdings. They want some stocks. You want some
19:24
bonds. You want some real estate. Right now
19:27
there's an intellectual monoculture in AI where only
19:29
one idea is being pursued hard and that
19:31
idea is generative AI. We need some other
19:33
ideas to flourish before we get to I
19:36
think AI that we can trust and that really
19:38
is transformative in the way that we're all hoping.
19:40
So do you think that given that hitting
19:43
a kind of data wall might
19:45
be a good thing at least temporarily? Yeah.
19:47
I mean there is
19:49
a sense in which I think that's right. Right
19:51
now people are resisting. They're saying well give it
19:53
another year, another two years. Some people may kind
19:55
of stick to the wrong horse for a really
19:57
long time. We'll see. I
20:00
think hitting a wall might actually
20:02
turn out to be good in just the
20:04
way that you're saying because it might force
20:06
us to a more reliable, more trustworthy substrate
20:08
for AI. There's a saying or a phrase
20:10
in the field that the current stuff that
20:12
we have, they're called foundation models, but they're
20:14
terrible foundation, right? The point of a foundation
20:16
in a house is you build the rest
20:18
of the line, you know that it's going
20:20
to be stable. And what
20:22
we have now is an unstable foundation.
20:24
If what it takes to get people
20:27
to widely acknowledge the instability of that
20:29
foundation is a period of
20:31
slower progress so that we kind of finally
20:34
say, hey, we're not quite doing this
20:36
right, what else can we do? Then
20:38
yeah, a short-term slowdown might lead to
20:41
a longer-term acceleration and a longer-term more
20:44
stable way of doing AI. A lot of people
20:46
think that I hate AI and it's not true.
20:48
It's not at all true. You hate it. I
20:51
really don't, right? I mean, I built an AI
20:53
company and sold it. I've been working on it since
20:55
I was eight years old. I actually love AI. I
20:57
spend most of my discretionary time
20:59
thinking about AI. Mostly don't even do this
21:01
for pay. I mostly just want the world
21:03
to be in the right place. But I really
21:05
do kind of hate the way that generative AI
21:08
has been positioned. Like as a lab curiosity, it's
21:10
fine. People should look at different
21:12
approaches, but it is so
21:14
much sucking the life from everything else and
21:16
it is so unreliable that it's just not
21:19
a good way to do AI. OpenAI is
21:21
like instead of like saving lives, it's mostly
21:23
in the near term going to be used
21:25
to surveil people. OpenAI wants
21:28
to suck up all your documents and
21:30
your calendar entries. It's going
21:32
to be like the greatest surveillance tool ever made,
21:34
but that's not why I went into AI. OpenAI
21:38
CEO Sam Altman said at a conference last
21:40
year that we were coming to an end
21:43
of the era where we keep relying on
21:45
these giant data models and that we'd make
21:47
them better in other ways. So do you
21:49
think that the kinds of limitations
21:51
in the current approaches to generative AI are
21:54
acknowledged within the AI community? Well, I mean
21:56
it's hilarious that he said that because when
21:58
I first said that... in
22:00
2022. He posted on Twitter a meme that
22:03
looked like my article, Deep Learning is Hitting
22:05
a Wall, saying, God, give me the strength
22:07
of something like that of the mediocre deep
22:09
learning skeptic. So he came after me hard
22:11
for saying this stuff, but I think he's
22:13
come around. I think a few people have
22:15
come around. I think people who have really
22:18
looked at the problem of what intelligence is
22:20
almost uniformly recognize how far away we actually
22:22
are. Gary, thanks so much for
22:24
your insights on this. Sure. My pleasure. Gary
22:27
Marcus is a cognitive scientist, entrepreneur
22:29
and professor emeritus at New York
22:31
University. His forthcoming book is called
22:33
Teaming Silicon Valley. It's out September
22:35
24th, 2024. You
22:47
are listening to Spark. Democratizing
22:49
culture to me means not
22:51
just letting us shout into
22:54
the void of the internet. This
22:57
is Spark with Nora Yun on
22:59
CBC Radio. On
23:08
last week's show about tech and
23:10
music, Inongo Lumumba-Kassongo talked about technological
23:12
transformation in the history of hip
23:14
hop. Inongo is an associate
23:17
professor of music at Brown University. We
23:19
had such an engaging talk, but we didn't
23:22
have time for it all. So we decided
23:24
to play more from that conversation, especially because
23:26
it speaks directly to how data gathered from
23:28
hip hop artists work is used by generative
23:30
AI and the ethical problems that
23:33
poses. It also lets
23:35
us reflect not only on how AI challenges
23:37
what music is for, but also
23:39
the importance of lived human
23:41
experiences. The
23:47
thing our music prof is also a rapper. And
23:53
I go by the name Samus when I'm performing. I
24:01
started making beats in high school. In part,
24:03
I wanted to score a video game because
24:05
I love video games. And so
24:08
my older brother showed me how to
24:10
make beats on my laptop. And from there, I
24:12
started making these sort of little songs.
24:14
And then eventually that expanded into
24:16
me rapping over those songs. I
24:19
wasn't formally musically trained. So I felt like, OK, I
24:21
know how to make beats. And I have my voice.
24:23
What can I do? And
24:25
so rap became this really awesome mode for
24:28
me to be able to share things that
24:30
I was thinking were important. They trying to
24:32
bleed. The psycho on the right
24:34
folks trying to sight bloke. Say my
24:36
company's easy. In 2022, Inongo wrote
24:39
a piece for Public Books where she explored
24:41
the emergence of high tech blackface
24:43
and digital blackface, the
24:46
idea that digital technologies allow non-black
24:48
people to adopt the personas of
24:50
black artists online. One of
24:52
the examples she highlights is the case of
24:54
FN Mecha. So
24:57
FN Mecha had this almost
24:59
like Icarus tale, Rise
25:01
and Fall. So a set
25:04
of kind of creative technologists, or
25:06
really only one sort of entrepreneur
25:09
and another creative technologist, I think
25:11
around 2019, 2020 started developing the
25:13
idea to
25:16
create a kind of rap
25:18
avatar who would take on
25:21
rap, our hip hop mannerisms, and
25:24
promote music, and be sort
25:26
of the first quote unquote AI
25:29
rapper. And I say AI rapper
25:31
in quotes because it was not
25:33
actually ever made clear how AI
25:35
was being engaged in this context,
25:38
but it was clearly important for
25:40
the developers of this character to
25:43
place AI in dialogue with the
25:45
way that this character was being
25:47
developed. There was a recognition that
25:49
this signals, at the very least,
25:51
that there's a kind of innovation
25:54
happening here that other musicians and
25:56
record labels will want to sort of invest in.
25:58
And so this character of FN Mecha, Maca
26:01
started putting out music, which we later learned was
26:03
actually recorded by a black rapper
26:11
named Kyle the Hooligan. He
26:15
was sort of voicing the character but
26:17
was not properly compensated. And
26:20
this was the voice of F.N. Maca.
26:22
And he was sort of developing a
26:24
presence online on Instagram and on TikTok,
26:26
kind of performing this
26:28
prototypical rap persona where, you
26:30
know, he has lots of
26:32
cars and lots of jewelry.
26:35
And questions started to emerge
26:37
around who was the creative
26:39
force behind this avatar, right?
26:42
And I think part of that awareness has
26:44
been this understanding in the digital age
26:47
that stepping into black personhood is
26:50
particularly kind of easy through
26:52
some of the forms of the digital world.
26:55
And so there was an already kind
26:57
of a caution and suspicion on the
26:59
part of listeners and, you
27:01
know, folks who would be in
27:03
that space. Despite those
27:06
suspicions and its ethically dubious
27:08
origins, F.N. Maca's popularity
27:10
continued to grow with over one
27:12
billion views on TikTok and millions
27:14
of followers. And then in 2022,
27:16
the AI rapper was signed
27:20
to Capitol Records, the first time an
27:22
AI-generated musical artist was signed to a
27:24
major record label. And
27:26
was subsequently dropped within months
27:28
of being signed because so
27:30
many people responded with
27:32
concerns about what sort of image
27:35
of a rapper this avatar was
27:38
conveying. And again, questions about
27:40
transparency. Who is making decisions
27:42
about who this AI
27:45
or avatar rapper is
27:47
sort of how he moves through
27:49
the space and how he's understood. I think
27:51
there's a lot of healthy suspicion that this
27:54
was sort of a cash grab that was
27:56
not invested in the actual communities from
27:58
which the art form
28:01
and even the mannerisms were sort
28:03
of coming from. Yeah, yeah. And
28:05
you've argued that this is part of a long
28:07
history of black sound. Can you dig into that a little
28:09
bit for me? Absolutely. So
28:11
Matthew D. Morrison, who's a
28:14
musicologist, really brilliant thinker, has
28:17
asked for us to think
28:19
about the context of how
28:22
we engage with the work
28:24
and material of black
28:26
musical artists in our contemporary moment
28:29
by thinking back to the formation
28:31
of the music industry, particularly within
28:33
the US context. And so he
28:36
asks us to think about the
28:38
emergence of black-based minstrelsy,
28:40
which is this racist theatrical form
28:42
that emerges in the 1820s
28:46
and involves the
28:48
performance caricaturing of
28:51
enslaved Africans as well as free
28:53
black folks by white performers
28:55
who would don black face paint
28:58
and step into these caricatures of
29:00
these figures. And it was a
29:02
way not just to
29:05
express kind of fear and
29:08
revulsion around relationships
29:11
to black folks in
29:13
the US. It was also a way
29:15
to transgress and play with some of
29:17
the sort of gendered and class hierarchies
29:20
that were emerging at that time as
29:22
well. And so I think that dialectic
29:24
is really important to note because when
29:26
we think about digital black face, it's
29:29
not about sort of just mocking or
29:32
playing with representations of blackness
29:34
that are about demeaning black
29:36
folks, right? In a lot of
29:38
ways, these representations are ways that
29:41
non-black people can play with
29:43
transgression or trying new
29:46
modes of expression without
29:48
having to sort of deal with
29:50
the consequences of what that might
29:52
look like without doing so in
29:54
the body of a figure that
29:56
is commonly understood as transgressive just
29:58
as a matter matter of fact.
30:00
And so there's a kind
30:03
of play that's happening there that's
30:05
really harmful because folks get to
30:07
step in and out of presentations
30:09
and performances of black modes of
30:11
expression and thought without having to
30:13
deal with how being black shapes
30:16
one's life outside of that
30:18
context. You
30:21
know, it seems to me that in the sort of popular
30:23
conversation around this, there's been a lot of focus
30:25
on extremely high profile artists, people like Drake or
30:27
The Weeknd, you know, whose
30:29
voices and likenesses are being used. But ultimately,
30:31
who do you think really stands out to
30:33
lose in all this? I
30:36
mean, it's interesting because like you said,
30:39
the way in which this is
30:41
sort of unfolding, the people who
30:43
are at the moment the most
30:45
vulnerable when I think about these
30:47
kind of AI voice filters where
30:49
folks are able to really sound,
30:51
you know, like audio deepfakes to
30:53
really step into the sound of
30:55
a Drake or The Weeknd, you
30:58
know, because they have this kind of
31:00
cultural cachet built into the timbre of
31:02
their voice, it enables
31:04
people to step in and
31:07
to generate capital and clout
31:09
because their voice means something. So for
31:11
an artist who's just starting out, their
31:14
voice doesn't mean what Drake's voice means,
31:16
just the sound of it, right? Just
31:18
the sound of it is doing something
31:20
important. And so I think in many
31:23
ways, artists who are, you know, at
31:25
that sort of upper echelon, they're really
31:27
vulnerable because their voice, A, is
31:30
everywhere. Yeah, a lot of
31:32
training data there. So much, there's
31:34
so much material. And B,
31:37
their voice has a kind of value
31:39
pop culturally. I mean, I think
31:41
about the ways that when an
31:44
artist features on another artist's track,
31:46
the excitement about hearing these two
31:48
voices be in conversation
31:50
because this voice is meaningful to
31:53
us. So it's not
31:55
as, I think, overtly
31:57
destructive in the more deep.
32:00
DIY spaces or the spaces where
32:02
an artist hasn't yet developed a voice
32:05
or a timbre of a voice that's
32:07
recognizable. But again, I think
32:09
how that impacts artists who are
32:11
sort of on the underground is
32:14
that when we think about the possibilities
32:16
for how working musicians can
32:18
build a life, it's very,
32:21
very difficult at this moment to be
32:23
a working artist. I think every single
32:25
rapper friend that I have or music,
32:28
you know, just more generally folks who
32:30
work in music, they have
32:32
like five hustles. I mean, I myself
32:34
am a professor, and I'm also a
32:37
rapper. And, you know, I value
32:39
and appreciate being an academia and having
32:42
these conversations. And in
32:44
part, this has been a strategy to be
32:46
able to build a sustainable art practice, because
32:48
were I to just be actively pursuing music,
32:50
I would be subject to the whims of
32:52
the market. And that's a really, really difficult
32:55
position to be in as an artist. And
32:57
as an artist who doesn't want to just
32:59
make whatever is profitable on the
33:01
radio, like this is a really,
33:03
really difficult position to be in. And
33:05
so with the advent of AI in
33:08
the music space, again, I think about
33:10
questions of risk and who can afford
33:12
to absorb creating new kinds of sounds
33:15
or trying to make it. My worry
33:17
is that artists who are just starting
33:19
out or who are, you
33:21
know, creeping around the DIY basement
33:23
space, is that they don't even
33:25
see a possibility or a way
33:28
forward. Because what the sort of
33:30
large record labels do impacts what
33:32
the middle tier record labels do and
33:34
who they invest in. And
33:36
if the sort of Warner Music Groups
33:38
of the world are reflecting the message
33:41
that it's not really worth investing in
33:43
real human artists, and instead, maybe what
33:45
we should do is invest in tools
33:47
that enable us to take
33:50
on the personhood of artists, artists who we
33:52
don't then have to be accountable to in
33:54
the ways that we have to be accountable
33:56
to human artists. You know, I
33:58
can see that impacting the decision. making on
34:00
the part of everyone else
34:03
in the music industry. So I
34:05
think I'm worried about the culture
34:07
around how we view the work
34:09
of being a musician, that it's
34:12
devalued in this process. And that
34:14
devaluation actually significantly impacts
34:16
who sees themself as being
34:18
able to pursue a life
34:20
as an artist. Yeah.
34:22
Well, no, just from a technical point of
34:24
view, I mean, what do you
34:26
make of their ability to replicate sounds
34:28
from different genres, different forms of music?
34:32
I think that the tools that
34:34
I've engaged with are there's
34:37
a range of levels of sophistication.
34:39
So for example, if I were
34:41
to go into chat GPT and
34:43
say, write me a rhyme in
34:45
the style of Samus, myself. And
34:49
it'll generate this pretty
34:51
mundane, childish rhyme that
34:53
has a really not
34:56
a particularly innovative rhyme scheme. There's
34:58
not sort of like metrical complexity
35:00
to it. And the material
35:03
itself reflects sort
35:05
of like a shadow of who I am
35:07
as a rapper generally based on what exists
35:09
in the world. So a lot of
35:11
my music deals with metaphors around technology
35:13
and video games. And so there's some
35:16
reflection of that being important to me.
35:18
But it's very unspecific
35:21
and not particularly compelling. However,
35:24
with some of these sort
35:26
of tools that
35:28
allow folks to use
35:30
AI to create a filter for a
35:33
particular person's voice so they can rap
35:35
as themselves and then sort of put
35:37
this filter on so that it becomes,
35:39
as we've heard, Drake or The Weeknd,
35:42
that enables you to step into
35:45
the kind of flow and real
35:47
expressive qualities of what makes a
35:49
rap song, a rap song, or
35:51
what makes a rap interesting. So
35:54
the level of sophistication there, I
35:56
think, is troubling and does
35:59
sort of like. on a technical level, I
36:01
think we're moving into a space
36:03
where it will become really, really
36:05
difficult to kind of figure out
36:08
who's authoring what. And actually, it's really interesting.
36:10
We're seeing that happen right now with Drake,
36:12
who's in a bit of a beef with
36:14
a number of different artists. And
36:17
very, very recently, a track
36:19
was released and a real
36:21
discourse online was is this
36:23
diss track an AI track?
36:25
Like, did Drake actually write
36:27
this track? And there's so
36:29
many implications around that. You
36:32
know, if Drake says, I didn't write this
36:34
track, like if it is an AI track,
36:37
the next thing that he writes will be
36:39
compared to this other AI track. So as
36:41
an artist, he's kind of having to interface
36:44
with this shadow version of himself. But
36:46
then there's also the misinformation elements
36:49
of this where, you know, with a
36:51
diss track, or in the context of
36:53
a beef, this can have real implications
36:55
for people's relationships with the other people
36:57
in the music industry or with their
37:00
peers. And if it's not clear, whether
37:02
this was generated by some outside force
37:05
or by the artists themselves, it can
37:07
start to get really challenging interpersonally. So
37:09
it's we already see how
37:11
it's manifesting in the public sphere. Yeah,
37:13
I mean, historically, people have used songwriting
37:16
as ways to sort of, you
37:18
know, document their lives to
37:20
work through their feelings and their thoughts.
37:23
Does generative AI for music come
37:25
into conflict with that history? Like,
37:27
and the importance of just lived
37:29
human experience in that type of
37:31
storytelling? Absolutely. And I think
37:33
that there's a particular way in which
37:36
the rap context is interesting to
37:39
study because within the world of
37:41
rap, the sort of like subjectivity
37:43
of the rapper is so
37:45
critical to our understanding and love
37:47
of or engagement with that person.
37:49
So like the rapper saying, this
37:51
is me, this is my story,
37:54
even if it's not right, even
37:56
if there is
37:58
embellishment, which of course, of course for
38:01
all artists, we're telling stories. So there's
38:03
some artists are more committed to kind of
38:06
telling the story of their life in a
38:08
way that really reflects sort of the events
38:10
of it. And other artists have more of
38:12
a sort of playful relationship with their sense
38:14
of truth. But within the rap context, there's
38:17
very much a sort of understanding that what
38:19
you present is who you are. So
38:21
much so that the practice of ghostwriting
38:24
is frowned upon, right? That's just not
38:26
something you do. And in other songwriting
38:28
contexts, you know, we know Beyonce has
38:30
a team of songwriters. We know that
38:32
other artists work with songwriters. And what
38:35
we expect of them or desire of
38:37
them is that they implement
38:39
or use their own
38:41
capacity as a performer
38:44
to give the song life or
38:46
infuse their story with it. But
38:48
with the rap context, there
38:50
really is an expectation that the
38:53
rapper does all of that sort
38:55
of labor of writing and performing
38:57
and being. So when you bring
38:59
in these tools of generative AI
39:01
that really question authorship, it
39:04
kind of throws the
39:06
whole hip-hop project into question. Like what
39:08
do we think is the most important
39:10
value in this space? Is it okay
39:13
to have a
39:15
person who is a really
39:17
incredible performer but their words
39:20
that they're performing have come from a
39:22
context that is not of their lived
39:24
experience? I think in this
39:26
moment, many sort of rap fans would
39:28
say that's unacceptable. But I also think
39:30
a growing number of people who are
39:32
getting familiar with these tools would argue
39:34
that that's actually, that's okay. It's okay
39:36
to sort of play with authorship
39:39
in new ways. And maybe we don't
39:41
have to be so beholden to that
39:43
mode of being. So yeah, it
39:45
definitely pulls apart, I think, as some
39:48
of the central tenets of what we
39:50
think of as being constitutive of like
39:52
rap music. Yeah. Fascinating. Inango, thanks
39:54
so much for your insights on this. Thank
39:56
you so much for having me. Inango
39:59
Lumumba. is assistant professor of
40:01
music at Brown University, chief rap officer
40:03
at Glow Up Games, and a
40:06
rapper. Hello,
40:08
I'm Jess Milton. For 15 years,
40:10
I produced The Vinyl Cafe with the late,
40:12
great Stuart McLean. Every week, more
40:15
than 2 million people tuned in to hear
40:17
funny, fictional, feel-good stories about Dave and his
40:19
family. We're excited to welcome you back to
40:21
the warm and welcoming world of The
40:23
Vinyl Cafe with our new podcast, Backstage at
40:25
The Vinyl Cafe. Each week,
40:28
we'll share two hilarious stories by Stuart, and for
40:30
the first time ever, I'll tell you what it
40:32
was like behind the scenes. Subscribe
40:34
for free whenever you get your podcasts.
40:53
Hello, I'm Nora Young, and today on Spark, we're talking
40:56
about some of the limits in how we use
40:58
data in training AI, and
41:00
how we might think differently about how we
41:02
create, train, and use these systems. Models
41:05
are what they eat. They ultimately regurgitate the data
41:07
that you show them. So if you show them
41:09
high-quality data, they're going to be high-quality. If
41:12
you show them low-quality data, they're going to be low-quality. This
41:15
is Ari Morcos. He's the CEO
41:17
and co-founder of a data selection
41:19
tool startup called Datology AI, which
41:22
he formed after a career working at
41:24
Meta Platforms and Google's DeepMind unit. We
41:27
help companies train better models faster by optimizing
41:29
the quality of the data that they train
41:31
on. So at a high
41:34
level, we can exploit other models
41:36
to describe the relationships between billions
41:38
of data points, and use those
41:40
models to identify what data are
41:42
good, bad, redundant, etc. But
41:44
ultimately, it's a lot of various algorithms
41:46
that take into account the relationships between
41:49
data points to figure this out. In
41:52
2022, Ari co-authored
41:54
a landmark paper called Beyond Neural Scaling
41:56
Laws, which challenges the widespread notion that more data
41:58
can be used to solve the problem of data. data equals
42:01
better models. Not
42:03
all data are created equal. Some data teach the
42:05
model a lot, and some data teach the model
42:07
a little. The amount of information you learn
42:09
from a piece of data also depends on how much data
42:12
you've seen already. So if you've seen a
42:14
little bit of data, then the next data point is
42:16
probably going to teach you something new. But if you've
42:18
seen a ton of data already, then that next data
42:20
point is probably not going to teach you something new,
42:22
because it's likely to be similar to something you've seen
42:25
before. And in many data sets, we observe this distribution
42:27
where most of the data is focused
42:29
on a pretty small set of concepts. And then
42:31
you have this long tail of more esoteric concepts
42:33
that are really the most informative for the model
42:35
and teach the model the most. But naively, if
42:37
you were to just train on all the data
42:40
or just acquire as much data as possible, those
42:43
long tail data points that are really
42:45
informative would be massively underrepresented in the
42:48
data sets. This comes up commonly in
42:50
a lot of different use cases. And ultimately,
42:52
what's important to get models that are really
42:54
high quality is to identify what are the
42:56
most informative data points, what's the data that
42:59
teaches the model the most, and enrich your
43:01
data sets so that those data points are
43:03
most prevalent in training. So
43:05
what are the practical implications of looking at,
43:07
for example, the data that tells you not
43:09
the 1,000 times the chicken crossed the
43:11
road, but the one time the chicken didn't cross the
43:13
road? What is that actually giving you in practical terms?
43:16
Yeah, that's ultimately what teaches the model
43:18
to be robust and to be able
43:21
to generalize to lots of different situations.
43:23
There's another huge practical implication of this,
43:25
which is that it dramatically slows down
43:27
training and makes training far more expensive
43:29
to get much worse models. Because what
43:32
happens as a result of this is
43:34
that most data that a model is
43:36
looking at doesn't teach it anything at
43:38
all. But it costs money. It costs
43:40
compute to look at that data. And
43:43
it takes time. And ultimately,
43:45
we're in a regime now where we have
43:47
so much data that no model is actually
43:49
learning everything about the data that's presented to
43:51
us. We decide to stop training a model
43:53
because we ran out of money. So we have a budget
43:56
for how much we're willing to spend to train a model.
43:58
And we run out of math. say optimizing
44:00
the quality of the data that goes into a
44:02
model, what you're effectively doing is making it
44:04
so that the model learns faster. And
44:07
if the model learns faster, that provides what
44:09
we call a compute multiplier, but that
44:11
leads to what also is called a quality multiplier,
44:14
because if the model learns faster, then you can
44:16
get to the same performance much faster, but you
44:18
can also get to much better performance in the
44:20
same budget. So this is ultimately
44:23
critical to getting models that work robustly
44:25
across lots of situations in
44:27
which we can train in a cost-effective way. So
44:30
how does this thinking inform what you're
44:32
doing at Datology AI? Yeah.
44:34
So ultimately, our goal at Datology is
44:36
to make curating high-quality data easy for
44:38
everyone. This is a frontier research problem,
44:40
as you noted, kind of in many
44:43
ways. My company is based off of
44:45
this paper that we had in 2022,
44:47
Beyond Neural Scaling Laws. But
44:49
there's a ton of nuance and challenge into
44:51
how you do this. And this is an
44:53
area where there's been very little published research
44:56
in general. This is ultimately the secret sauce
44:58
that divides the best models from the average
45:00
models. Data quality really is everything.
45:03
Most of the big frontier model companies are
45:05
using the same architecture. Ultimately
45:07
what differentiates the quality of the model is
45:10
which data they show it. But of course,
45:12
they're strongly disincentivized to share with anybody how
45:14
they do that, because that is a secret
45:16
sauce. So what that means is, if you
45:19
wanted to train your own model, you would
45:21
not have access to this really critical part
45:23
of the AI infrastructure stack that's really quite
45:25
challenging and difficult and has a lot of
45:27
nuance in how you identify this data at
45:30
scale automatically. So that's what we
45:32
do at Datology. We make that easy for everybody
45:34
by automatically curating massive data sets up
45:36
to petabytes that in order to make
45:38
the data as high-quality and informative as
45:40
possible and make models train
45:43
much faster and to much better performance. But
45:45
doesn't the entire sort of big data
45:48
machine learning project rely on kind
45:50
of probabilistic outcomes of large amounts
45:52
of even sort of messy data?
45:54
I understand the importance of the outliers, the long tail,
45:57
but don't we need to know what mostly
45:59
happens? as well? This gets into this
46:01
notion of redundancy and redundancy is actually
46:03
good to a point. And
46:05
different concepts have different amount of complexity, which
46:08
means that they need different amounts of
46:10
redundancy. So I'll give you an example.
46:12
Imagine trying to understand elephants versus dogs.
46:14
Okay, elephants are pretty stereotypes, right? They're
46:17
all gray. They all have wrinkly skin.
46:19
They all have big floppy ears. They're
46:21
bigger and smaller elephants, African and Asian,
46:23
respectively. But ultimately, most elephants are pretty
46:25
similar to one another. Whereas dogs, you
46:27
have tons of variation. So the amount
46:29
of redundancy that I need in order
46:31
to understand what an elephant is is much
46:33
smaller than the amount of redundancy that I
46:35
need in order to understand what a dog
46:38
is. So if I were to use the
46:40
right amount of redundancy for elephants, for dogs,
46:42
then I'd end up doing very well on elephants,
46:44
but I would not fully understand dogs in my
46:47
model. Right. And if I were to do the
46:49
opposite, I would understand dogs perfectly well, but I
46:51
would have wasted a ton of compute, looking
46:53
and learning about elephants far beyond where I
46:55
need to. So the challenge here is that
46:58
you absolutely need redundancy about the common concepts,
47:00
but you need the appropriate amount of redundancy
47:02
for a given complexity. So what we have
47:04
to do given a massive data set that's
47:06
unlabeled, that doesn't have, it doesn't say this
47:08
is an elephant or this is a dog.
47:10
It just here's a bunch of data, we
47:12
have to identify automatically what are those concepts,
47:14
figure out how complicated are each of those
47:16
concepts. And then based off of that, determine
47:19
the right amount of data to remove from
47:21
each of those concepts, in addition to removing
47:23
the right data there, because obviously, even within
47:25
a concept of elephants, not all elephant data
47:27
is equally informative, some is going to be
47:30
better than others. One
47:32
of the things we've talked about on the
47:34
show in the past is not only the
47:36
cost of training these things, but the environmental
47:38
cost of these very, very data intensive models,
47:40
like deep learning, do you think this approach
47:42
has potential to address the end just a straight
47:44
up energy costs of this approach to computing? Absolutely.
47:46
And I think that's a big part of our
47:48
mission as well as to help with the compute
47:50
costs of these models, both on the training side,
47:52
but also on the inference side. During
47:55
training, by reducing the amount of data you
47:57
need to train models on, we can reduce
47:59
that currently by two to to 4X and
48:01
we're getting better at that every day. So
48:03
that already means that you can now train
48:05
a model with 2 to 4X less environmental
48:07
impact, which is obviously significant.
48:09
But one of the things that we can
48:11
also do with higher quality data is train
48:14
smaller models to the same performance. And in
48:16
the scheme of things, ultimately models are actually
48:18
gonna be run in what's called
48:20
inference, which is when you're actually using a model
48:22
in deployment or something like that, far more often
48:25
than they're gonna be used in training. And if
48:27
you deploy a model to inference that's bigger than
48:29
it needs to be because it didn't
48:31
see high quality data, then that's a
48:33
massively increased environmental and compute costs as
48:35
well. So better quality data both helps
48:37
to cut training costs of models, but
48:39
also helps you to train models that
48:42
are smaller and better optimized so that
48:44
the inference cost at deployment time is
48:46
also much lower, which is very helpful
48:48
from a business standpoint, but also clearly
48:50
has massive environmental impact. You
49:04
are listening to Spark. The idea that
49:06
we're somehow making proto humans and
49:09
that may approach or exceed us on
49:11
some mythical scale of intelligence or
49:13
decide they don't need us anymore, there's no they
49:15
there. This is Spark from
49:17
CBC. Hi,
49:29
I'm Nora Young. Today on Spark, we're talking
49:31
about the data limitations of some AI and
49:34
whether the way around the data wall is
49:36
to focus on data quality rather than quantity.
49:39
Right now, my guest is AI researcher, Ari
49:41
Morcos. His company, Datology AI, is
49:43
building tools to improve data selection, which
49:45
could help lower the amount of data
49:48
needed to train these systems. One
49:51
reason we wanted to talk to you is
49:54
that we've been hearing about concerns that data-hungry
49:56
AI like large language models will hit a
49:58
cap of good quality training data. So
50:00
if we don't rethink how to train these
50:02
systems, do you think large language models
50:05
are going to hit a plateau? I
50:07
think there's a ton more we can
50:09
do by just gumming up with better
50:11
quality metrics for our existing datasets. Obviously
50:14
more data is better given the same quality,
50:16
but if we look at the models that
50:18
we have right now, they're still getting better
50:21
with more data. They're not converging yet, even
50:23
on the data that we've already shown them.
50:25
So there's a lot of gains still to
50:27
be had from showing the model higher quality
50:30
data more times over so that it learns
50:32
it. Think about how you might do flashcards
50:34
if you're trying to study for a test.
50:37
You put all the different questions on your flashcards,
50:39
and then when you get one correct, you take
50:42
it out of the pack. When you get it
50:44
incorrect, you put it at the back, and then
50:46
you see it over and over again. So doing
50:48
things where we actually present the data that's most
50:51
difficult for the model or that teaches the model
50:53
the most multiple times is still an area where
50:55
I think we can get a ton of gains
50:57
and one that we've just really barely exploited. For
50:59
a number of cultural reasons, the field of machine
51:02
learning has largely ignored studying data. Part
51:04
of that is because data has often been viewed as kind of
51:06
boring or the plumbing.
51:09
In many cases, part of it is also that
51:11
in a lot of the competition style machine learning
51:13
research data is viewed as a given. So it's
51:15
like given a dataset, how can you create a
51:18
model that's going to do the best on that
51:20
dataset? As a result of
51:22
that, the field is mostly focused on advances
51:24
in modeling rather than advances in data. A
51:27
metaphor I like for this is
51:29
that there's this tree that's barren
51:31
that's surrounded by a bunch of
51:33
professors prodding their grad students to
51:35
climb this barren thorny tree to
51:38
reach up to find a shriveled apple
51:40
that is some site improvement in a
51:42
modeling advance. Meanwhile, just out of sight,
51:44
there's a lush orchard of
51:47
trees that are literally dropping fruit
51:49
onto the floor in the realm of
51:51
ways we can better improve data. So
51:53
I think this is an area that
51:56
just has been so massively understudied relative
51:58
to its potential impact. that
52:00
I think that even if we hit the
52:02
limits of what's available with respect to public
52:05
data, there's still far more we
52:07
can do by making better use
52:09
of the data that we already have.
52:11
I'll also note that the data that's
52:13
in public is a heavy
52:15
minority of the total data that's present in
52:17
the world, right? The majority of data is
52:19
private. So there's also a lot of
52:21
opportunities, I think, to get that private data and exploit
52:23
that. And I think that's one of the things that
52:26
a lot of businesses are thinking now, hey, we're sitting
52:28
on these hordes of data that could be really valuable.
52:30
How can we use that to make models better for
52:32
ourselves? And personally,
52:34
a lot of companies are concerned about their
52:36
proprietary data outside
52:39
of their proprietary wall as well,
52:41
right? Absolutely. They wanna make sure
52:43
that that advantage doesn't get ceded
52:45
to everyone. Right.
52:48
How widespread a problem do you think
52:50
this sort of potential data shortage is?
52:53
Like much of the conversation has been about chat,
52:55
GPT, and large language models, but is
52:57
this sort of issue with growing
53:00
data potentially kind of an existential issue for
53:02
a deep learning approach to AI
53:04
in general? How broad are we talking about
53:07
here? Yeah, I actually don't think
53:09
the data shortage is as big of an
53:11
issue as people make it out to be
53:13
in general. And in large part,
53:15
that's for the reasons we've been discussing, that there's just
53:17
a lot more we can do by making better use
53:19
of the data we have available. And I think if
53:21
you go to companies, many
53:23
enterprises have too much data. They have
53:26
petabytes or exabytes of data that they've
53:28
been collecting, most of which is mostly
53:31
useless because it's not very high
53:33
quality. And the problem is, right, that they
53:35
don't know how do I make the best use of that data?
53:37
How do I find the data that's actually gonna teach me the
53:39
most? But
53:41
I think for the largest frontier models
53:43
that you see coming out of OpenAI,
53:48
ultimately the path forward is going to be
53:50
to try to acquire more high quality data,
53:52
right? They've started doing a lot of licensing
53:54
deals with various data providers in
53:56
order to acquire new data that has some sort of
53:58
quality guarantee. and then also
54:00
by pushing forward a lot of research to do
54:03
better at identifying the right data, of course, which
54:05
they will not share with anybody else.
54:09
All right. Thanks so much for your insights on this.
54:11
Absolutely. Thank you for having me. Ari
54:14
Morkos is an AI researcher and founder
54:16
of Datology AI. The
54:24
show is made by Michelle Parisi, Samarit
54:26
Yohannes, Megan Carty and me, Nora
54:28
Young and by Gary
54:30
Marcus, Inongo Lumumba Kasongo and Ari
54:32
Morkos. Subscribe to Spark
54:34
on the free CBC Listen app or your favourite podcast
54:36
app. I'm Nora Young. Talk to you soon. For
55:04
more CBC podcasts,
55:06
go to cbc.ca/podcasts.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More