Lend Me Your Voice by IRL: Online Life is Real Life | Podchaser

Episode from the podcastIRL: Online Life is Real Life

Lend Me Your Voice

Released Tuesday, 21st November 2023

Good episode? Give it some love!

Lend Me Your Voice

Lend Me Your Voice

Tuesday, 21st November 2023

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Hey Siri, play

0:02

IRL

0:02

podcast. Here's the

0:04

podcast IRL. Online

0:07

life is real life.

0:10

Lots of us use virtual assistants. They're

0:12

part of our everyday lives. We

0:15

use them to check the weather or the time. Or

0:17

if you're me, you might be like, Hey Siri,

0:19

play Beyonce. But

0:22

speech recognition systems don't work

0:25

equally well for everyone. They don't even

0:27

exist from any languages. Big

0:29

Tech has stepped up to offer more diversity

0:32

in their language models. For speech

0:34

and more. But it comes with a new set

0:36

of problems. How

0:40

do I feel about Big Tech

0:42

sort of paying attention to our marginalized

0:45

indigenous languages? I guess

0:47

the first thing I wonder is why? Why

0:50

do they care now? Do they genuinely

0:52

care to ensure

0:54

inclusivity online? Or

0:57

did they finally realize that

0:59

being more inclusive is better

1:02

for them and their bottom lines?

1:04

That's Keone Mahalona in New Zealand.

1:07

We'll hear more from him in a bit. In

1:09

this episode, we meet technology

1:11

builders who are reclaiming speech recognition

1:14

with and for their own language communities.

1:17

This is IRL, an original podcast from

1:19

Mozilla, the

1:23

nonprofit behind Firefox. I'm

1:25

Bridget Todd. This

1:27

season, we meet people who are building artificial

1:29

intelligence that puts people over profit. First,

1:32

let's make a stop in the US. We're

1:35

in Maryland, not far from where I live. I

1:40

spent a year with Alexa and

1:43

I allowed the device to do

1:45

whatever they wanted.

1:48

I allowed the device to

1:50

do whatever the device could be

1:52

seen.

1:54

at

2:00

Towson University. Three

2:03

years ago she conducted an experiment

2:06

with Amazon's home assistant Alexa

2:08

which is pretty popular in the US.

2:11

So for example I would ask can

2:13

you set a 5.50 alarm

2:16

and the device would hear 5.50

2:19

and so I would just wake up at 5.50. I wanted

2:22

to push and see what is the level of inconvenience

2:25

right? But this device would allow

2:27

me to

2:28

do. Halcyon

2:31

grew up in Trinidad and Tobago. While

2:34

Caribbean accents can still throw off voice

2:36

tech by US companies, the tech has

2:38

improved so much that it altered the focus

2:41

of Halcyon's research.

2:45

So why is it important for technology

2:47

to be able to understand us? Well

2:50

I think this is where

2:52

it sort of speaks to the convenience and

2:55

the question that arises is convenience

2:57

for whom? You know

2:59

the kinds of interactions

3:01

that I have with most

3:03

speech devices like personal assistants,

3:05

if they do not understand me it's

3:08

often very comical and

3:10

maybe a minor inconvenience and

3:12

so that's sort of part of the thesis.

3:15

But let's scale up because

3:18

these speech recognition devices are

3:20

being deployed in a number of other

3:23

spaces. So in the US for

3:25

example they're increasingly

3:27

being used to automatically transcribe

3:30

court recordings. They're being

3:33

used as aggression detectors in prisons

3:36

as well as schools and

3:38

so you can well imagine

3:41

these are spaces where being misheard

3:43

or misunderstood can have badly consequences.

3:53

colonial

4:00

powers dominated people in her region

4:02

and worldwide. She sees parallels

4:05

in how digital technology pushes people

4:07

to speak in certain ways just to be

4:09

understood. One

4:11

of the things that concerns me is

4:13

the expectation that you speak with a standard

4:15

accent, whether it be standard English

4:18

or standard French or any sort of standard

4:20

language suggests

4:23

that anybody who does not

4:25

speak with that

4:25

standard accent is

4:28

mishealed or misunderstood and

4:30

these are vulnerable populations who

4:32

turn up in spaces like prisons

4:34

and courts of law where they need

4:37

to be heard and understood accurately.

4:39

So you know it's as important

4:42

as why you know asking that question why do

4:44

we need to be heard or understood in person

4:47

is no less important in

4:49

the digital space. So

4:54

Halsey, are there ways that you think that technology

4:56

can be designed differently so that folks who

4:58

maybe don't speak North American or British

5:00

English can be understood?

5:03

So

5:04

your question hits

5:06

upon past me and

5:08

current me. Past

5:10

me when I started doing this

5:12

research, the easy answer would have been

5:15

yes we need more representation in

5:17

these devices right if if

5:19

I can hear and be heard

5:22

with a Trinidadian accent surely that

5:24

would solve the problem. But

5:27

recently on a trip home

5:29

she was reminded how language is also

5:31

used

5:31

as resistance.

5:33

For instance, by speaking in ways that

5:35

cannot be understood by oppressors.

5:38

I started visiting with friends

5:40

and I have forgotten how

5:45

we have also used language

5:47

to subdue colonial

5:49

authority. That

5:52

other kinds of dialects have

5:54

emerged that part war

5:56

has emerged as a way

5:59

of subverting.

5:59

And so the question then

6:02

arises,

6:03

what does it mean to give organizations

6:06

access to that kind of voice detail?

6:10

What kind of power are we

6:13

handing over if I am advocating

6:15

for greater representation of languages

6:18

and dialects and accents? And so

6:20

I am in a bit of a conundrum right now

6:23

thinking about the kind of research that

6:26

I do but more importantly thinking about what

6:28

I advocate for.

6:40

Let's

6:43

head to New Zealand. That's the

6:45

sound of the local radio station for

6:47

the indigenous Maori community in Kataya.

6:51

The haiku radio is the community

6:53

voice. Every day we speak to people

6:55

within the community to tell us about

6:57

everything, whether it's to talk to us about the

7:00

climate, the weather, or to talk

7:02

to us about what sorts of foods are

7:04

in season in terms of hunting and gathering

7:07

or fishing and what's going

7:09

on in politics or our health system or

7:11

data sovereignty and artificial intelligence.

7:14

That's Keone Mahalona. He's

7:16

the chief technology officer of Taheiku

7:19

Media. He has a Maori community

7:21

media network with 21 local radio

7:23

stations. It's been around since the 1990s. Since 2014,

7:28

Keone, who is Hawaiian, and his partner

7:31

Peter Lucas Jones, who is Maori, have

7:33

used the internet and more recently AI

7:36

in their efforts to reverse the decline of the Maori

7:38

language, Te Reo Maori.

7:41

Under colonial rule, speaking

7:42

the language was forbidden.

7:45

Now it's an official language of New Zealand.

7:48

Speech recognition

7:50

is just

7:51

a tool.

7:52

These AI models are just a tool

7:55

that enable us to do what we need

7:57

to do. The mission of our organization

7:59

is about... language revitalization and

8:02

language promotion and cultural restoration

8:05

and promoting Tereo Maori and

8:07

the culture of Maori.

8:09

So how we do that at our organization

8:12

is we tell stories. We tell

8:14

stories on the radio. We tell stories through video.

8:16

We tell stories through live broadcasting. But

8:22

we've been telling stories for more than 35 years,

8:25

and a lot of those stories are captured

8:28

on cassette tapes or VHS tapes. So

8:30

we're in this process of digitizing those tapes,

8:33

and now we want to make the content within them

8:35

available. A

8:36

few years ago, Tehiku Media

8:38

was working on a project to transcribe historic

8:41

broadcasts with elders who could explain

8:43

the nuances in language and context.

8:47

Keone realized automatic speech recognition,

8:49

or ASR for short, could

8:52

help.

8:53

So as we were working on this project, we were like, wow,

8:55

this is really hard. If an interview is an hour, it

8:57

takes at least three hours to transcribe it, right?

9:00

So we thought, oh, why don't we just train

9:02

a machine to automatically

9:05

transcribe this for us? Because, hey, you

9:07

know, Siri existed at the time. ASR

9:09

was a thing, so surely we could do

9:11

it in Tereo Maori. From a developer perspective,

9:14

like we knew the technology existed.

9:16

We knew there were open source projects out there we could use. But

9:19

what we also knew is that this was actually a

9:21

data problem, and that that would be

9:23

the most important part of this project,

9:26

was not just sort of getting the data, but

9:28

we knew we had to

9:30

gather this data in a way

9:33

in which we could safeguard it and protect

9:35

it and ensure that it would only be used

9:37

for the betterment of Maori and Maori

9:39

things.

9:41

The data is actually voice recordings

9:43

of short sentences paired with text. This

9:45

is what a speech recognition engine, in

9:48

this case, Mozilla's Deep Speech, uses

9:50

to decode what sounds go with which letters.

9:54

For its dataset, Tohiku Media reached

9:56

out to community groups, like traditional dance

9:58

troops and canoe racing teams. and soon

10:01

gathered over 300 hours of speech.

10:04

We mobilized the community to read thousands

10:07

of utterances to help us collect

10:09

a corpus that would enable us to train in ASR.

10:12

In doing that, we learned a lot.

10:15

And one of the things we learned about the

10:17

community, who were pretty much giving their

10:19

time to support this project, was

10:21

that they wanted real-time feedback on

10:24

their readings.

10:26

Keone says they realized they could

10:28

support language learning by giving people

10:31

immediate feedback on how they pronounce words at

10:33

the same time that they're donating voice data.

10:36

We pretty much hacked deep speech and built

10:38

the real-time pronunciation engine. It's

10:40

an app that we have called Rongo. It's in the

10:43

Apple and Google stores. Anyone

10:45

can download it anywhere in the world.

10:47

You'd like to share your data with us to help

10:49

promote Tibeo Māori.

10:51

Keone says their speech project will make decades

10:53

of audio recordings more accessible

10:55

online.

10:58

One of the things we're looking at is whether there's

11:00

any climate data embedded in

11:02

our archives and how that can help us to

11:04

better mitigate some of the effects of climate

11:06

change. And you need ASR

11:09

to actually do that, right? To go through all these archives

11:11

and transcribe it and then sort of find

11:13

the data embedded in that. And unless

11:15

we can document our knowledge, it won't

11:18

be available for our people in the future. I

11:20

think, you know, that's really the value in what

11:22

we do with our community. We

11:24

don't do it for our community. We

11:27

do this with our community.

11:31

Many big tech companies have been including indigenous

11:33

languages in their online services. And

11:35

on the surface, this seems like a good thing.

11:38

But Keone's not so sure.

11:40

These companies don't really know

11:43

much about our languages or our cultures.

11:46

And by simply trying

11:48

to include us, they could actually do more

11:51

harm than good to our

11:53

communities, to our languages, especially

11:55

languages that are in a state of revitalization.

11:59

What we've seen in the past... with tools

12:01

like Translate from companies like

12:03

Google and Microsoft is the translation

12:06

doesn't really work very well but

12:09

people use the tool and they treat the tool

12:11

as sort of 100% accurate but

12:14

the truth is the algorithms they use are

12:16

the models they've trained aren't 100%

12:18

correct.

12:20

About five years ago indigenous language

12:22

speakers started getting offers from a language tech

12:24

company for $45 to $90

12:27

an hour for their voice recordings. It

12:29

was for an unspecified corporate purpose

12:31

but said the goal was to keep languages

12:34

alive. Keone says this

12:36

approach is extractive and undermines

12:38

the work of communities. Then

12:40

in 2022 open AI dropped

12:43

a new multilingual speech recognition

12:46

model called Whisper. It was trained on

12:48

over 600,000 hours of audio from the web including

12:52

over 1,300 hours of

12:55

terao Maori. How they source

12:57

this data is secret.

12:59

We were very very concerned

13:02

when we heard about Whisper because

13:04

we thought oh well there we go you know

13:07

no point doing this anymore right because

13:09

hey look big tech has solved it for us

13:12

they've they've saved our language thank you but

13:14

we knew that the model was

13:16

crap like we knew it wasn't gonna be good even

13:19

though some of our like data scientists kind of had a quick

13:21

play with it like oh my god it's really good the

13:24

ones who had to play with it actually aren't speakers

13:26

or fluent speakers of terao Maori so

13:28

when one of our language experts had a quick

13:31

look it was obvious it was absolute trash

13:33

and then we quantified like we

13:35

quantified that trash.

13:40

Whisper is open source but that doesn't

13:42

make it feel any less like unfair

13:44

competition to tehiku media.

13:46

We are absolutely now in competition

13:49

with these tech companies when we fine-tuned

13:52

Whisper with our

13:55

data our highly curated data

13:57

of quality terao Maori. were

14:00

able to create a model that was

14:03

pretty good at recognizing

14:05

Tereomori. And it did perform

14:08

better than our previous model, but

14:10

our previous model was built on

14:12

very old technology. So

14:15

I think, I think where we're at now

14:17

is that we know we can do better than

14:19

them. Despite only having like,

14:22

you know, a handful of people in our team, not much

14:24

money and not much compute,

14:26

like we've proven we can do better than them for

14:29

Tereomori, but

14:31

there's still that existential risk

14:34

of when will they be as

14:36

good as us or better than us. And

14:39

understanding that when you also understand how

14:41

will they achieve that? And the only way they

14:43

can achieve that is with more language

14:46

data, more Maori language data.

14:48

So then we need to ask ourselves, how

14:51

will they get more language data or from

14:53

where will they get that data?

14:55

And that's the concern. Tehiku

14:58

Media says it's the guardian, not

15:01

the owner of the data it collects and the software

15:03

it creates for the community. The organization

15:06

developed a special license called Kai-Tiyaki-Tanga

15:09

that requires permission for reuse. This

15:12

way, the community has control over how they

15:14

get benefits back. Keone

15:16

says this approach to data sovereignty is

15:18

modeled after how indigenous communities traditionally

15:21

act as guardians of their land to

15:23

protect them from colonization for

15:25

future generations. And they've

15:27

taken all our land, right? So what left do

15:29

we have for them to take? Well, it's

15:31

our data. I mean, that's, that's pretty much it. You

15:34

know, they've taken everything else.

15:39

Let's

15:39

meet someone now who cares deeply

15:41

about speech recognition in African languages.

15:46

My name is Kathleen Siminu and

15:48

I'm a machine learning fellow at Mozilla Foundation.

15:51

In my career, I've worked to build

15:54

grassroots AI communities. Kathleen

15:56

lives in Kalifi, Kenya and works with

15:59

Mozilla on Common Vault.

15:59

It's a platform

16:01

for crowdsourcing open voice data

16:03

in over 100 languages. Its

16:06

mission is to make voice technology more inclusive.

16:09

Kathleen helps lead efforts to gather data

16:11

for Keeswahili on Common Voice. This

16:14

is a language spoken in several East African

16:16

countries by as many as 200 million people.

16:19

Until recently, it wasn't a language open

16:22

source developers could build speech applications for.

16:25

Common voice is important because it's

16:27

an open data set. Anybody can

16:29

build on it. Everyone can access the

16:31

data and therefore the communities can

16:34

start to build for the languages that they

16:36

care about or they speak or that those

16:38

around them speak.

16:40

My hope is that we open

16:42

up

16:43

the path for more voice

16:45

technology. And by this,

16:47

I can tell you a little story. At

16:49

my first job, I worked at a company in

16:51

the telco space and we

16:54

basically had products like voice and SMS. And

16:57

I remember in an election year, we needed

16:59

to be screening messages

17:01

to make sure insightful content is not being

17:03

sent on our platform.

17:05

In a heated political moment in Kenya,

17:07

Kathleen wanted to build a tool that would automatically

17:10

search for messages inciting violence.

17:12

And in my head, I thought this

17:14

is going to be super easy. But then

17:16

I realized that none of the tools that

17:18

existed were going to be of use

17:20

because I needed tools for Kiswahili

17:23

or other local languages spoken in

17:25

the country.

17:27

Kathleen's experience of not being able

17:29

to build a tool in her own language inspired

17:32

her to do more research on her own.

17:34

She soon discovered Masakane, a network

17:36

of researchers working on computer science and

17:39

linguistics in African languages since 2019.

17:42

I realized that, okay, there's other people

17:44

who are interested in these problems. And

17:47

one of the biggest projects, our first project

17:49

was a machine translation project. Since

17:52

then, we've gone to other

17:54

tasks. There's a lot of work

17:56

coming out of this community.

17:59

companies are gaining a foothold

18:01

on AI across Africa. Networks

18:04

like Moskane and Deep Learning Indaba

18:06

want to see AI shaped and

18:08

owned by Africans. For Kathleen,

18:11

working within communities is an opportunity

18:13

to create voice technologies that

18:15

respect language diversity. I

18:18

think the benefit

18:20

is the fact that the communities

18:22

are aware of the nuances

18:25

of the language. So

18:28

taking the context of speech recognition,

18:30

I'll give the example that we learned

18:32

from the West that gender bias is likely,

18:35

that accent bias is likely, but

18:37

then we then have to look

18:40

at an East African context and

18:42

ask ourselves, okay, what bias

18:44

is likely here?

18:45

Working with linguists with local knowledge

18:48

helped Kathleen understand how Kiswahili

18:50

was standardized by Christian missionaries

18:52

during colonization.

18:54

This knowledge for me made

18:56

me realize that we should not make the mistake

18:59

of only building for standardized

19:01

Kiswahili. There's already this growing

19:03

gap between the standardized version

19:06

and the other dialects. And if we're not careful,

19:08

we're continuing to push these other dialects

19:11

to extinction.

19:13

Extinction. It's like

19:16

AI takes on the role of the colonizer when

19:18

certain dialects are favored over others.

19:21

But convincing people to donate their voices isn't

19:23

easy.

19:24

So

19:25

incentivizing participation has been

19:28

quite difficult. I think one

19:30

reason is because AI is

19:34

very much in the media right now, right? And

19:36

everybody has this perception that people who

19:38

are working in AI are making loads of money.

19:41

So whenever we go

19:43

into spaces and start talking about the work that

19:45

we're doing and why we want people to contribute

19:48

to the data and tie it to the fact that

19:50

AI tools can be built,

19:52

they then want to know, okay, am I

19:54

going to get paid? But in our program,

19:57

we are not paying people to

19:59

contribute. So we have to be very creative

20:01

about how we think about incentives.

20:05

Like many advocates for open tech in Africa,

20:08

Cassleen is wrestling with how to build sustainable

20:10

projects and businesses when the data sets

20:12

are open,

20:13

because big tech uses these resources too.

20:16

So more projects are considering alternatives

20:18

to completely open licensing. There's

20:21

also been talk of creating something like a federation.

20:24

From the startups, we're learning

20:26

that big tech coming into

20:28

the scene and saying our tools

20:31

or our resources are multilingual and they

20:33

cover this number of African

20:36

languages has meant that for startups,

20:39

it's harder to get, say VC funding.

20:42

If you pitch to a VC and they say Kiswahili

20:45

is on open EI's whisper already,

20:47

why should we give you money? It's

20:49

a problem that's already solved. So

20:51

these questions are coming up often. How can

20:53

we give startups within our network

20:56

the advantage?

20:56

These startups are building

20:58

with the community. Can we license

21:01

the data

21:01

sets such that the startups

21:03

get access to them or maybe not

21:06

make the data sets open? Have them only

21:08

open within the network such that these

21:10

startups can have access to them, but then not

21:13

big tech.

21:21

With more than 7,000 languages worldwide,

21:24

decisions about voice data today will

21:26

influence how

21:27

people communicate tomorrow. A

21:29

lot more can be done.

21:30

This goes for big tech and the

21:33

open source communities getting squeezed

21:34

by their dominance.

21:36

Speech recognition is about more than just convenience.

21:39

For people who depend on AI to recognize

21:41

their voices at home on the phone or even

21:43

in court, these systems and

21:45

the data they're built with reinforce

21:48

inequality. This is what can be challenged

21:50

when communities reclaim a voice in AI

21:53

to build for themselves.

21:56

Before this episode ends, I've got some sad

21:58

news to share.

21:59

Halcyon Lawrence, the first guest in this episode,

22:02

passed away a few weeks after we spoke.

22:05

In honor of her legacy, we're glad we could still

22:07

include her voice in this show. We

22:09

hear you, Halcyon. Thank you for

22:11

everything. To learn more about Halcyon

22:13

and our other guests, please visit our show

22:15

notes.

22:19

I'm Bridget Todd.

22:20

You've been listening to IRL, Online

22:22

Life is Real Life, an original podcast

22:24

from Mozilla, the nonprofit behind Firefox.

22:28

Mozilla, reclaim the internet.

22:36

Hey, it's me again.

22:37

I just signed off with reclaim the internet,

22:39

but what does that mean? To find out,

22:42

we're turning to some of the 25 digital

22:44

visionaries

22:44

who have just received Mozilla's new

22:46

Rise 25 award.

22:48

This is Rafael Mimoun. He's the

22:50

founder of a tech nonprofit called Horizontal.

22:53

They support journalists and activists with

22:55

digital security and technology to document

22:57

human rights abuses.

22:59

To me, reclaiming the internet means

23:03

taking back control

23:04

over the technology we use on a daily basis. I

23:08

think we're realizing with billionaires

23:10

buying social networks that we all depend on

23:12

and that we all cherish and that have been so instrumental

23:15

in social movements that suddenly

23:18

we're not in control. And

23:21

really reclaiming the internet is finding

23:24

the structures, the infrastructures

23:27

where we as a community,

23:30

as a community of users, we

23:32

really can control and shape the present and the

23:34

future of the technology we use.

23:37

That's Rafael Mimoun on how to reclaim the

23:39

internet. To learn more about Rafael, Horizontal,

23:42

and the other winners of Mozilla's Rise 25 awards, go

23:45

to rise25.mozilla.org.

23:47

Now it's your turn.

23:49

Go reclaim the internet.

Rate

Get this podcast via API

From The Podcast

IRL: Online Life is Real Life

How does artificial intelligence change when people — not profit — truly come first? Join IRL’s host Bridget Todd, as she meets people around the world building responsible alternatives to the tech that’s changing how we work, communicate, and even listen to music.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More