Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Hey Siri, play
0:02
IRL
0:02
podcast. Here's the
0:04
podcast IRL. Online
0:07
life is real life.
0:10
Lots of us use virtual assistants. They're
0:12
part of our everyday lives. We
0:15
use them to check the weather or the time. Or
0:17
if you're me, you might be like, Hey Siri,
0:19
play Beyonce. But
0:22
speech recognition systems don't work
0:25
equally well for everyone. They don't even
0:27
exist from any languages. Big
0:29
Tech has stepped up to offer more diversity
0:32
in their language models. For speech
0:34
and more. But it comes with a new set
0:36
of problems. How
0:40
do I feel about Big Tech
0:42
sort of paying attention to our marginalized
0:45
indigenous languages? I guess
0:47
the first thing I wonder is why? Why
0:50
do they care now? Do they genuinely
0:52
care to ensure
0:54
inclusivity online? Or
0:57
did they finally realize that
0:59
being more inclusive is better
1:02
for them and their bottom lines?
1:04
That's Keone Mahalona in New Zealand.
1:07
We'll hear more from him in a bit. In
1:09
this episode, we meet technology
1:11
builders who are reclaiming speech recognition
1:14
with and for their own language communities.
1:17
This is IRL, an original podcast from
1:19
Mozilla, the
1:23
nonprofit behind Firefox. I'm
1:25
Bridget Todd. This
1:27
season, we meet people who are building artificial
1:29
intelligence that puts people over profit. First,
1:32
let's make a stop in the US. We're
1:35
in Maryland, not far from where I live. I
1:40
spent a year with Alexa and
1:43
I allowed the device to do
1:45
whatever they wanted.
1:48
I allowed the device to
1:50
do whatever the device could be
1:52
seen.
1:54
at
2:00
Towson University. Three
2:03
years ago she conducted an experiment
2:06
with Amazon's home assistant Alexa
2:08
which is pretty popular in the US.
2:11
So for example I would ask can
2:13
you set a 5.50 alarm
2:16
and the device would hear 5.50
2:19
and so I would just wake up at 5.50. I wanted
2:22
to push and see what is the level of inconvenience
2:25
right? But this device would allow
2:27
me to
2:28
do. Halcyon
2:31
grew up in Trinidad and Tobago. While
2:34
Caribbean accents can still throw off voice
2:36
tech by US companies, the tech has
2:38
improved so much that it altered the focus
2:41
of Halcyon's research.
2:45
So why is it important for technology
2:47
to be able to understand us? Well
2:50
I think this is where
2:52
it sort of speaks to the convenience and
2:55
the question that arises is convenience
2:57
for whom? You know
2:59
the kinds of interactions
3:01
that I have with most
3:03
speech devices like personal assistants,
3:05
if they do not understand me it's
3:08
often very comical and
3:10
maybe a minor inconvenience and
3:12
so that's sort of part of the thesis.
3:15
But let's scale up because
3:18
these speech recognition devices are
3:20
being deployed in a number of other
3:23
spaces. So in the US for
3:25
example they're increasingly
3:27
being used to automatically transcribe
3:30
court recordings. They're being
3:33
used as aggression detectors in prisons
3:36
as well as schools and
3:38
so you can well imagine
3:41
these are spaces where being misheard
3:43
or misunderstood can have badly consequences.
3:53
colonial
4:00
powers dominated people in her region
4:02
and worldwide. She sees parallels
4:05
in how digital technology pushes people
4:07
to speak in certain ways just to be
4:09
understood. One
4:11
of the things that concerns me is
4:13
the expectation that you speak with a standard
4:15
accent, whether it be standard English
4:18
or standard French or any sort of standard
4:20
language suggests
4:23
that anybody who does not
4:25
speak with that
4:25
standard accent is
4:28
mishealed or misunderstood and
4:30
these are vulnerable populations who
4:32
turn up in spaces like prisons
4:34
and courts of law where they need
4:37
to be heard and understood accurately.
4:39
So you know it's as important
4:42
as why you know asking that question why do
4:44
we need to be heard or understood in person
4:47
is no less important in
4:49
the digital space. So
4:54
Halsey, are there ways that you think that technology
4:56
can be designed differently so that folks who
4:58
maybe don't speak North American or British
5:00
English can be understood?
5:03
So
5:04
your question hits
5:06
upon past me and
5:08
current me. Past
5:10
me when I started doing this
5:12
research, the easy answer would have been
5:15
yes we need more representation in
5:17
these devices right if if
5:19
I can hear and be heard
5:22
with a Trinidadian accent surely that
5:24
would solve the problem. But
5:27
recently on a trip home
5:29
she was reminded how language is also
5:31
used
5:31
as resistance.
5:33
For instance, by speaking in ways that
5:35
cannot be understood by oppressors.
5:38
I started visiting with friends
5:40
and I have forgotten how
5:45
we have also used language
5:47
to subdue colonial
5:49
authority. That
5:52
other kinds of dialects have
5:54
emerged that part war
5:56
has emerged as a way
5:59
of subverting.
5:59
And so the question then
6:02
arises,
6:03
what does it mean to give organizations
6:06
access to that kind of voice detail?
6:10
What kind of power are we
6:13
handing over if I am advocating
6:15
for greater representation of languages
6:18
and dialects and accents? And so
6:20
I am in a bit of a conundrum right now
6:23
thinking about the kind of research that
6:26
I do but more importantly thinking about what
6:28
I advocate for.
6:40
Let's
6:43
head to New Zealand. That's the
6:45
sound of the local radio station for
6:47
the indigenous Maori community in Kataya.
6:51
The haiku radio is the community
6:53
voice. Every day we speak to people
6:55
within the community to tell us about
6:57
everything, whether it's to talk to us about the
7:00
climate, the weather, or to talk
7:02
to us about what sorts of foods are
7:04
in season in terms of hunting and gathering
7:07
or fishing and what's going
7:09
on in politics or our health system or
7:11
data sovereignty and artificial intelligence.
7:14
That's Keone Mahalona. He's
7:16
the chief technology officer of Taheiku
7:19
Media. He has a Maori community
7:21
media network with 21 local radio
7:23
stations. It's been around since the 1990s. Since 2014,
7:28
Keone, who is Hawaiian, and his partner
7:31
Peter Lucas Jones, who is Maori, have
7:33
used the internet and more recently AI
7:36
in their efforts to reverse the decline of the Maori
7:38
language, Te Reo Maori.
7:41
Under colonial rule, speaking
7:42
the language was forbidden.
7:45
Now it's an official language of New Zealand.
7:48
Speech recognition
7:50
is just
7:51
a tool.
7:52
These AI models are just a tool
7:55
that enable us to do what we need
7:57
to do. The mission of our organization
7:59
is about... language revitalization and
8:02
language promotion and cultural restoration
8:05
and promoting Tereo Maori and
8:07
the culture of Maori.
8:09
So how we do that at our organization
8:12
is we tell stories. We tell
8:14
stories on the radio. We tell stories through video.
8:16
We tell stories through live broadcasting. But
8:22
we've been telling stories for more than 35 years,
8:25
and a lot of those stories are captured
8:28
on cassette tapes or VHS tapes. So
8:30
we're in this process of digitizing those tapes,
8:33
and now we want to make the content within them
8:35
available. A
8:36
few years ago, Tehiku Media
8:38
was working on a project to transcribe historic
8:41
broadcasts with elders who could explain
8:43
the nuances in language and context.
8:47
Keone realized automatic speech recognition,
8:49
or ASR for short, could
8:52
help.
8:53
So as we were working on this project, we were like, wow,
8:55
this is really hard. If an interview is an hour, it
8:57
takes at least three hours to transcribe it, right?
9:00
So we thought, oh, why don't we just train
9:02
a machine to automatically
9:05
transcribe this for us? Because, hey, you
9:07
know, Siri existed at the time. ASR
9:09
was a thing, so surely we could do
9:11
it in Tereo Maori. From a developer perspective,
9:14
like we knew the technology existed.
9:16
We knew there were open source projects out there we could use. But
9:19
what we also knew is that this was actually a
9:21
data problem, and that that would be
9:23
the most important part of this project,
9:26
was not just sort of getting the data, but
9:28
we knew we had to
9:30
gather this data in a way
9:33
in which we could safeguard it and protect
9:35
it and ensure that it would only be used
9:37
for the betterment of Maori and Maori
9:39
things.
9:41
The data is actually voice recordings
9:43
of short sentences paired with text. This
9:45
is what a speech recognition engine, in
9:48
this case, Mozilla's Deep Speech, uses
9:50
to decode what sounds go with which letters.
9:54
For its dataset, Tohiku Media reached
9:56
out to community groups, like traditional dance
9:58
troops and canoe racing teams. and soon
10:01
gathered over 300 hours of speech.
10:04
We mobilized the community to read thousands
10:07
of utterances to help us collect
10:09
a corpus that would enable us to train in ASR.
10:12
In doing that, we learned a lot.
10:15
And one of the things we learned about the
10:17
community, who were pretty much giving their
10:19
time to support this project, was
10:21
that they wanted real-time feedback on
10:24
their readings.
10:26
Keone says they realized they could
10:28
support language learning by giving people
10:31
immediate feedback on how they pronounce words at
10:33
the same time that they're donating voice data.
10:36
We pretty much hacked deep speech and built
10:38
the real-time pronunciation engine. It's
10:40
an app that we have called Rongo. It's in the
10:43
Apple and Google stores. Anyone
10:45
can download it anywhere in the world.
10:47
You'd like to share your data with us to help
10:49
promote Tibeo Māori.
10:51
Keone says their speech project will make decades
10:53
of audio recordings more accessible
10:55
online.
10:58
One of the things we're looking at is whether there's
11:00
any climate data embedded in
11:02
our archives and how that can help us to
11:04
better mitigate some of the effects of climate
11:06
change. And you need ASR
11:09
to actually do that, right? To go through all these archives
11:11
and transcribe it and then sort of find
11:13
the data embedded in that. And unless
11:15
we can document our knowledge, it won't
11:18
be available for our people in the future. I
11:20
think, you know, that's really the value in what
11:22
we do with our community. We
11:24
don't do it for our community. We
11:27
do this with our community.
11:31
Many big tech companies have been including indigenous
11:33
languages in their online services. And
11:35
on the surface, this seems like a good thing.
11:38
But Keone's not so sure.
11:40
These companies don't really know
11:43
much about our languages or our cultures.
11:46
And by simply trying
11:48
to include us, they could actually do more
11:51
harm than good to our
11:53
communities, to our languages, especially
11:55
languages that are in a state of revitalization.
11:59
What we've seen in the past... with tools
12:01
like Translate from companies like
12:03
Google and Microsoft is the translation
12:06
doesn't really work very well but
12:09
people use the tool and they treat the tool
12:11
as sort of 100% accurate but
12:14
the truth is the algorithms they use are
12:16
the models they've trained aren't 100%
12:18
correct.
12:20
About five years ago indigenous language
12:22
speakers started getting offers from a language tech
12:24
company for $45 to $90
12:27
an hour for their voice recordings. It
12:29
was for an unspecified corporate purpose
12:31
but said the goal was to keep languages
12:34
alive. Keone says this
12:36
approach is extractive and undermines
12:38
the work of communities. Then
12:40
in 2022 open AI dropped
12:43
a new multilingual speech recognition
12:46
model called Whisper. It was trained on
12:48
over 600,000 hours of audio from the web including
12:52
over 1,300 hours of
12:55
terao Maori. How they source
12:57
this data is secret.
12:59
We were very very concerned
13:02
when we heard about Whisper because
13:04
we thought oh well there we go you know
13:07
no point doing this anymore right because
13:09
hey look big tech has solved it for us
13:12
they've they've saved our language thank you but
13:14
we knew that the model was
13:16
crap like we knew it wasn't gonna be good even
13:19
though some of our like data scientists kind of had a quick
13:21
play with it like oh my god it's really good the
13:24
ones who had to play with it actually aren't speakers
13:26
or fluent speakers of terao Maori so
13:28
when one of our language experts had a quick
13:31
look it was obvious it was absolute trash
13:33
and then we quantified like we
13:35
quantified that trash.
13:40
Whisper is open source but that doesn't
13:42
make it feel any less like unfair
13:44
competition to tehiku media.
13:46
We are absolutely now in competition
13:49
with these tech companies when we fine-tuned
13:52
Whisper with our
13:55
data our highly curated data
13:57
of quality terao Maori. were
14:00
able to create a model that was
14:03
pretty good at recognizing
14:05
Tereomori. And it did perform
14:08
better than our previous model, but
14:10
our previous model was built on
14:12
very old technology. So
14:15
I think, I think where we're at now
14:17
is that we know we can do better than
14:19
them. Despite only having like,
14:22
you know, a handful of people in our team, not much
14:24
money and not much compute,
14:26
like we've proven we can do better than them for
14:29
Tereomori, but
14:31
there's still that existential risk
14:34
of when will they be as
14:36
good as us or better than us. And
14:39
understanding that when you also understand how
14:41
will they achieve that? And the only way they
14:43
can achieve that is with more language
14:46
data, more Maori language data.
14:48
So then we need to ask ourselves, how
14:51
will they get more language data or from
14:53
where will they get that data?
14:55
And that's the concern. Tehiku
14:58
Media says it's the guardian, not
15:01
the owner of the data it collects and the software
15:03
it creates for the community. The organization
15:06
developed a special license called Kai-Tiyaki-Tanga
15:09
that requires permission for reuse. This
15:12
way, the community has control over how they
15:14
get benefits back. Keone
15:16
says this approach to data sovereignty is
15:18
modeled after how indigenous communities traditionally
15:21
act as guardians of their land to
15:23
protect them from colonization for
15:25
future generations. And they've
15:27
taken all our land, right? So what left do
15:29
we have for them to take? Well, it's
15:31
our data. I mean, that's, that's pretty much it. You
15:34
know, they've taken everything else.
15:39
Let's
15:39
meet someone now who cares deeply
15:41
about speech recognition in African languages.
15:46
My name is Kathleen Siminu and
15:48
I'm a machine learning fellow at Mozilla Foundation.
15:51
In my career, I've worked to build
15:54
grassroots AI communities. Kathleen
15:56
lives in Kalifi, Kenya and works with
15:59
Mozilla on Common Vault.
15:59
It's a platform
16:01
for crowdsourcing open voice data
16:03
in over 100 languages. Its
16:06
mission is to make voice technology more inclusive.
16:09
Kathleen helps lead efforts to gather data
16:11
for Keeswahili on Common Voice. This
16:14
is a language spoken in several East African
16:16
countries by as many as 200 million people.
16:19
Until recently, it wasn't a language open
16:22
source developers could build speech applications for.
16:25
Common voice is important because it's
16:27
an open data set. Anybody can
16:29
build on it. Everyone can access the
16:31
data and therefore the communities can
16:34
start to build for the languages that they
16:36
care about or they speak or that those
16:38
around them speak.
16:40
My hope is that we open
16:42
up
16:43
the path for more voice
16:45
technology. And by this,
16:47
I can tell you a little story. At
16:49
my first job, I worked at a company in
16:51
the telco space and we
16:54
basically had products like voice and SMS. And
16:57
I remember in an election year, we needed
16:59
to be screening messages
17:01
to make sure insightful content is not being
17:03
sent on our platform.
17:05
In a heated political moment in Kenya,
17:07
Kathleen wanted to build a tool that would automatically
17:10
search for messages inciting violence.
17:12
And in my head, I thought this
17:14
is going to be super easy. But then
17:16
I realized that none of the tools that
17:18
existed were going to be of use
17:20
because I needed tools for Kiswahili
17:23
or other local languages spoken in
17:25
the country.
17:27
Kathleen's experience of not being able
17:29
to build a tool in her own language inspired
17:32
her to do more research on her own.
17:34
She soon discovered Masakane, a network
17:36
of researchers working on computer science and
17:39
linguistics in African languages since 2019.
17:42
I realized that, okay, there's other people
17:44
who are interested in these problems. And
17:47
one of the biggest projects, our first project
17:49
was a machine translation project. Since
17:52
then, we've gone to other
17:54
tasks. There's a lot of work
17:56
coming out of this community.
17:59
companies are gaining a foothold
18:01
on AI across Africa. Networks
18:04
like Moskane and Deep Learning Indaba
18:06
want to see AI shaped and
18:08
owned by Africans. For Kathleen,
18:11
working within communities is an opportunity
18:13
to create voice technologies that
18:15
respect language diversity. I
18:18
think the benefit
18:20
is the fact that the communities
18:22
are aware of the nuances
18:25
of the language. So
18:28
taking the context of speech recognition,
18:30
I'll give the example that we learned
18:32
from the West that gender bias is likely,
18:35
that accent bias is likely, but
18:37
then we then have to look
18:40
at an East African context and
18:42
ask ourselves, okay, what bias
18:44
is likely here?
18:45
Working with linguists with local knowledge
18:48
helped Kathleen understand how Kiswahili
18:50
was standardized by Christian missionaries
18:52
during colonization.
18:54
This knowledge for me made
18:56
me realize that we should not make the mistake
18:59
of only building for standardized
19:01
Kiswahili. There's already this growing
19:03
gap between the standardized version
19:06
and the other dialects. And if we're not careful,
19:08
we're continuing to push these other dialects
19:11
to extinction.
19:13
Extinction. It's like
19:16
AI takes on the role of the colonizer when
19:18
certain dialects are favored over others.
19:21
But convincing people to donate their voices isn't
19:23
easy.
19:24
So
19:25
incentivizing participation has been
19:28
quite difficult. I think one
19:30
reason is because AI is
19:34
very much in the media right now, right? And
19:36
everybody has this perception that people who
19:38
are working in AI are making loads of money.
19:41
So whenever we go
19:43
into spaces and start talking about the work that
19:45
we're doing and why we want people to contribute
19:48
to the data and tie it to the fact that
19:50
AI tools can be built,
19:52
they then want to know, okay, am I
19:54
going to get paid? But in our program,
19:57
we are not paying people to
19:59
contribute. So we have to be very creative
20:01
about how we think about incentives.
20:05
Like many advocates for open tech in Africa,
20:08
Cassleen is wrestling with how to build sustainable
20:10
projects and businesses when the data sets
20:12
are open,
20:13
because big tech uses these resources too.
20:16
So more projects are considering alternatives
20:18
to completely open licensing. There's
20:21
also been talk of creating something like a federation.
20:24
From the startups, we're learning
20:26
that big tech coming into
20:28
the scene and saying our tools
20:31
or our resources are multilingual and they
20:33
cover this number of African
20:36
languages has meant that for startups,
20:39
it's harder to get, say VC funding.
20:42
If you pitch to a VC and they say Kiswahili
20:45
is on open EI's whisper already,
20:47
why should we give you money? It's
20:49
a problem that's already solved. So
20:51
these questions are coming up often. How can
20:53
we give startups within our network
20:56
the advantage?
20:56
These startups are building
20:58
with the community. Can we license
21:01
the data
21:01
sets such that the startups
21:03
get access to them or maybe not
21:06
make the data sets open? Have them only
21:08
open within the network such that these
21:10
startups can have access to them, but then not
21:13
big tech.
21:21
With more than 7,000 languages worldwide,
21:24
decisions about voice data today will
21:26
influence how
21:27
people communicate tomorrow. A
21:29
lot more can be done.
21:30
This goes for big tech and the
21:33
open source communities getting squeezed
21:34
by their dominance.
21:36
Speech recognition is about more than just convenience.
21:39
For people who depend on AI to recognize
21:41
their voices at home on the phone or even
21:43
in court, these systems and
21:45
the data they're built with reinforce
21:48
inequality. This is what can be challenged
21:50
when communities reclaim a voice in AI
21:53
to build for themselves.
21:56
Before this episode ends, I've got some sad
21:58
news to share.
21:59
Halcyon Lawrence, the first guest in this episode,
22:02
passed away a few weeks after we spoke.
22:05
In honor of her legacy, we're glad we could still
22:07
include her voice in this show. We
22:09
hear you, Halcyon. Thank you for
22:11
everything. To learn more about Halcyon
22:13
and our other guests, please visit our show
22:15
notes.
22:19
I'm Bridget Todd.
22:20
You've been listening to IRL, Online
22:22
Life is Real Life, an original podcast
22:24
from Mozilla, the nonprofit behind Firefox.
22:28
Mozilla, reclaim the internet.
22:36
Hey, it's me again.
22:37
I just signed off with reclaim the internet,
22:39
but what does that mean? To find out,
22:42
we're turning to some of the 25 digital
22:44
visionaries
22:44
who have just received Mozilla's new
22:46
Rise 25 award.
22:48
This is Rafael Mimoun. He's the
22:50
founder of a tech nonprofit called Horizontal.
22:53
They support journalists and activists with
22:55
digital security and technology to document
22:57
human rights abuses.
22:59
To me, reclaiming the internet means
23:03
taking back control
23:04
over the technology we use on a daily basis. I
23:08
think we're realizing with billionaires
23:10
buying social networks that we all depend on
23:12
and that we all cherish and that have been so instrumental
23:15
in social movements that suddenly
23:18
we're not in control. And
23:21
really reclaiming the internet is finding
23:24
the structures, the infrastructures
23:27
where we as a community,
23:30
as a community of users, we
23:32
really can control and shape the present and the
23:34
future of the technology we use.
23:37
That's Rafael Mimoun on how to reclaim the
23:39
internet. To learn more about Rafael, Horizontal,
23:42
and the other winners of Mozilla's Rise 25 awards, go
23:45
to rise25.mozilla.org.
23:47
Now it's your turn.
23:49
Go reclaim the internet.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More