Podchaser Logo
Home
100 billion Points Every Day

100 billion Points Every Day

Released Wednesday, 16th August 2023
Good episode? Give it some love!
100 billion Points Every Day

100 billion Points Every Day

100 billion Points Every Day

100 billion Points Every Day

Wednesday, 16th August 2023
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:01

Welcome to another episode of the Mapscaping

0:03

Podcast. My name is Daniel and this

0:06

is a podcast for the geospatial community. Today

0:08

on the podcast we're starting with a very large

0:11

number. We're starting with 100 billion. Let's

0:14

say I gave you a spreadsheet with 100 billion

0:17

rows in it. Each row consisted of five

0:19

columns, latitude, longitude, device

0:22

ID, a timestamp, and a column telling

0:24

you the name of the data provider. What

0:26

would you do with that? How would you clean it? How

0:28

would you make sense of it, extract value from it?

0:31

What do you think people would use it for? How

0:33

would you do all of this stuff in a way that could

0:35

be systematized, in a way that you could repeat

0:38

again tomorrow? Foursquare does

0:40

this every day with the help of something they

0:42

call a movement engine. To help us

0:44

understand more about how they do this, I've invited

0:46

Gabriel Durkin, the director of

0:48

data science on the podcast. This is

0:50

the last in a series of episodes that I have been

0:52

working on together with Foursquare. I have to say they

0:55

have been absolutely brilliant to work with. If you're interested

0:57

in hearing some of the previous episodes, I'll put links

0:59

to them in the show notes of today's episode. But for

1:01

right

1:01

now, we're back to the 100 billion points.

1:06

Hey Gabriel, welcome to the podcast. You

1:09

are the director of data science at Foursquare.

1:11

You have something called a movement engine over there and

1:13

you process 100 billion records, GPS

1:16

records I should say, each and every day. At

1:18

least that's what I got out of our pre-interview

1:21

conversation. I'm hoping you can put a few more

1:23

words around that in just a minute. My guess

1:25

is you haven't always been the director of data science

1:27

at Foursquare. How did you get there? Where

1:29

did you come from? How did you get involved in processing

1:32

movement data? Well, it's nice to be here, first

1:34

of all, Daniel. When it comes

1:36

to how I got here, it's

1:38

been a circuitous journey. The

1:41

first 20 years of my adult working

1:44

life, I was a quantum physicist. I

1:46

did my PhD at Oxford in quantum physics

1:48

and then moved to

1:50

the States to work at the Jet

1:52

Propulsion Lab and then at the NASA

1:55

Ames Research Center. We had a

1:57

quantum computing team there and I was

1:59

part of it. So data science

2:01

is a, was kind of a career change for

2:04

me, you know, probably seven years ago now.

2:06

And, you know, I worked at Uber and some other

2:08

startups doing, you know, first as

2:10

an independent contributor and then eventually moving

2:13

into management. And that's some

2:15

of the story about how I got here today, working

2:17

at Foursquare on geospatial

2:19

data, leading them the movement

2:22

engine,

2:22

which is a name for the team,

2:25

the people working on movement

2:27

data, but also a name for the platform that

2:29

we built. That is a cool name for a team,

2:31

the movement engine. Hey, if we

2:33

just stay with you all past, just for a second

2:36

here, what was it like going from quantum

2:39

physics, I think you said, over to, to geospatial

2:41

data? Was it a big jump? Like, was

2:43

there anything that was difficult to learn? Was there a

2:46

huge vocabulary shift or is it all just,

2:48

you know, more data? Yeah, I mean, it's a,

2:51

it was a choice I made just because

2:52

I, I wanted to work, you know, more broadly

2:55

in industry. You know, I enjoy

2:57

research. I still consider

2:59

myself a quantum physicist, but

3:01

I wanted to, you know, work in

3:03

a faster paced environment. And,

3:06

uh, you know, I'd been at NASA a long time, so I thought a

3:08

change of pace might be interesting.

3:10

And I knew that data science

3:12

was a career that had a lot of transferable skills

3:14

for people with, you know, PhDs

3:17

in the so-called hard sciences, you know,

3:19

numeracy, analytical skills,

3:22

also kind of, you know, I think the best

3:25

data scientists are the ones that have that kind of

3:27

scientific curiosity. I'm willing

3:29

to kind of turn over every rock. That's something

3:31

that it's hard to just, you know, learn in college.

3:34

I think it's kind of either you have that instinct or you

3:36

don't. So yeah, so I, you know, I cut

3:38

my teeth on geospatial data at, at

3:41

Uber and learned a lot there. It's

3:43

a completely different type of work. I mean, you, there's

3:46

certainly the science aspect of it, but it's also working

3:49

collaboratively with people with different backgrounds,

3:51

you know, designers and product managers.

3:54

So it's, it's actually quite an enriching experience

3:57

and I've, I've definitely enjoyed it. And

3:59

it was for me. it was the right career move. And

4:01

it's only something that's become possible.

4:03

I mean, data science as a career has

4:06

only really existed for just

4:08

over 10 years, I guess. And so

4:10

the career path to

4:12

data science these days is

4:14

quite varied, but there was a program

4:17

and it's called the Inside Data

4:20

Science. It's kind of a fellowship where they, in

4:22

a very short space of time, kind of preppy for the world

4:24

of work as a data scientist. And for me,

4:27

that program was invaluable. I think there's

4:29

no way I

4:30

would have passed any of the data

4:32

science interviews, which are really quite

4:34

rigorous for tech companies without

4:36

that experience. So I have a lot of it

4:38

to Inside Data Science. That is

4:41

really interesting. I naively

4:43

just assume that you're someone with your background,

4:45

oh great, I'm really good at maths. I

4:48

understand all these complicated processes.

4:51

I've worked with big chunks of data before.

4:53

I can just change my name

4:56

or change the title, sorry. And voila,

4:58

now I'm a data scientist. That's interesting

5:00

to hear you say that there was a prep course

5:02

involved and that you got

5:04

a lot out of it as well, which is possibly even more

5:06

interesting. I would say, yes,

5:09

I mean, part of the narrative is like, especially

5:11

people do well if they have a background, like let's say in

5:13

astronomy, where they're good at dealing with large

5:15

data sets. But

5:18

it's really quite different. There's always

5:20

the fear that someone with

5:22

a nerd with a PhD is going

5:24

to be good at burrowing into problems, but isn't actually

5:27

very focused on execution or whether

5:30

you have a sense of urgency or whether

5:33

your technical expertise is aligned

5:35

with the business objectives of the company. So

5:38

those are all things that you have to demonstrate to

5:41

allay those fears that you're just some very technical

5:44

nerd who has minimal impact,

5:46

for the business. And that's always something that we

5:49

struggle with, I think, as data scientists. So yeah,

5:51

it's very important to kind of exercise those muscles,

5:54

like the business acumen part of things. Also,

5:57

just being able to talk to non-technical stakeholders

5:59

about your

5:59

work and why it has impact, communication

6:02

is key.

6:03

Thank you very much for sharing that with us. I really

6:05

appreciate it. The promise of this

6:07

podcast is the focus on

6:09

this movement engine and these 100 billion

6:12

records that you'll process each day. I

6:15

think we should maybe shift the conversation towards

6:17

that. Let's

6:19

start with these records. What is all that data

6:22

and where is it coming from?

6:23

100 billion records. Think

6:25

of a GPS record as a row

6:31

in a data table that is, you

6:33

might call it a ping, right? It's a

6:36

latitude, a longitude and a timestamp and a device

6:38

ID associated with it. We

6:41

at Foursquare have a differentiating

6:45

component compared with other big data companies

6:47

in that we have our

6:49

own owned and operated apps. Those

6:51

owned and operated apps, one of the famous ones

6:54

is Swarm, which is our life logging app

6:56

or Foursquare City Guides. Those

6:59

apps generate data for us as well. The

7:03

user of Swarm likes to be able to remember

7:05

how many times this week they went

7:07

to the gym or

7:09

what their sequence of

7:11

movements was yesterday. We also

7:13

can leverage that data to improve our

7:15

own data collection,

7:17

our own algorithms that we

7:19

build on top of the data. That's one component

7:21

of the data. We have those pings,

7:24

those latitude, longitudes and timestamps

7:26

from our own apps. We

7:29

also collect the majority of

7:31

the data from third party sources.

7:34

Those sources could be apps themselves

7:37

or they might be from other data companies.

7:41

That contributes to the 100 billion records

7:43

that we ingest every day. That's

7:46

a lot of data. I guess one

7:48

of the big questions now is,

7:49

what do you do with that? Is it all just

7:51

ready to use, analysis ready

7:54

data or do you have to do something to

7:56

it first? No, definitely not.

7:58

There's gold in them.

7:59

but it's not all

8:02

golden, I would say. And the

8:04

data is very raw. Like it is just literally

8:06

those raw records. And

8:09

one of the things that my team is responsible for is

8:12

refining that raw data. It's

8:14

like an oil refinery might, you know,

8:17

might be responsible for turning oil into different

8:19

things like petroleum based products, like

8:21

car gasoline or butane or whatever

8:23

through fractional distillation. We're

8:26

trying to distill value out of the raw data

8:28

itself. And this

8:31

raw data, it comes from multiple

8:33

providers, multiple sources, some of it's internal

8:36

to four squares, some of it is these third party

8:38

sources. And really what

8:40

we're doing with it is trying to imbue it with

8:42

geospatial intelligence, right? Trying to extract

8:44

value out of it. So, you know, you take the raw

8:47

pings and you know, the first

8:49

kind of part of the process, well, what

8:51

we're doing with it is really like building up more complex

8:54

structures out of the, you know, just

8:56

flat data that we're collecting. So, you

8:59

know, you start with the completely unstructured raw data

9:01

and from that, you build

9:04

those pings up into, well,

9:06

first of all, you might try to classify if the pings

9:08

are associated with a mobile phone or

9:10

a device that is in motion or at rest. So

9:12

you do

9:13

classification on those pings at the

9:15

device level, at the ping level. Then

9:18

you might start structuring those pings into

9:21

what we call segments. So it's this process

9:23

of segmentation. So collections

9:25

of pings might be seen as participating

9:28

in a moving segment for that

9:31

device. Like if the person who owned the device is walking

9:33

down the street, or if they're traveling

9:35

in a vehicle along the road, if

9:37

the person has stopped, there may be a collection

9:39

of those pings that is associated

9:41

with the stop, right? Maybe there's a clustering

9:44

around a particular, you know, commercial

9:46

venue. That's definitely of interest,

9:48

right? So, you know,

9:50

you go from pings to this segmentation

9:53

to produce these segments, which may be stops or

9:55

moving segments. And there is

9:58

maybe like a majority.

9:59

vote, right? Like you're, if you can put

10:02

a last sewer around a set of these pings, you

10:04

know, maybe the majority of them are

10:06

stopped pings, but there's

10:08

a few moving pings in there. You do a harmonization

10:11

to say, well, you know, within this

10:13

cluster, most of these pings are stopped pings.

10:15

So we identify the whole cluster as a

10:17

stop cluster.

10:19

Then you might build up, you know,

10:21

now you have these segments, you can build up this

10:23

timeline, uh, sort of more of a holistic

10:26

understanding of user's journey, but

10:28

we might just be looking at one particular provider,

10:30

right? We might be looking at one source of the data

10:32

might be coming from our own app, or it might be coming

10:34

from one of the external providers. So

10:37

we build up a timeline from those segments

10:39

for the device. And we

10:41

do that per source. You know, we've gone through

10:44

this process of building up structure, right? We've gone

10:46

from pings to segments and then from segments

10:48

to timelines.

10:49

And when you have a timeline per

10:51

source, then the next process is, you

10:53

know, an additional one of harmonization

10:56

kind of data fusion, right? We want

10:58

to build a master timeline for that device,

11:00

but we reconcile the different storylines

11:03

that are being told for that the user

11:05

of that device, uh, you know, for a particular

11:08

day. Um, so, you know, one provider

11:10

might be saying, well, the person was in motion

11:13

and then they, you know, they stopped somewhere

11:15

for 30 minutes before picking

11:18

up again and going somewhere else. You know, that

11:20

may not be completely aligned between the different

11:22

providers from which we get the data. So

11:25

we can do again, like a

11:27

sort of a weighted majority vote, you know, for

11:29

each, each moment in time, we can decide how many

11:31

of the providers are telling a movement story versus

11:34

others that say, no, actually that device was at

11:36

rest. And we can even be more sophisticated

11:39

than that. And it can be weighted by the

11:41

value that we attach to each provider.

11:43

Like some providers, the data

11:46

is more likely to be higher quality,

11:48

let's say, than, than others. Uh, sometimes

11:51

the data can be synthetic that

11:53

they provide. Sometimes it's, it's

11:55

very noisy. Sometimes it's been

11:57

manipulated in some way, you know, like for instance.

12:00

the data can be snapped to a grid.

12:02

When you have a particular ping at a location,

12:05

sometimes the latitude and longitude

12:07

get rounded up and basically

12:09

causing the location of the ping to

12:11

be snapped to somewhere on the grid.

12:14

So there's all sorts of components

12:17

to the quality of the input data, and

12:19

then that gives us the ability to define

12:22

kind of a quality score for

12:24

the providers, and that can then go into

12:26

the weighting of how much we value

12:28

their perspective on what the device

12:31

was doing when we build these storylines

12:33

for that user journey.

12:35

Wow, I've got a bunch of questions,

12:37

and I hope that you'll bear with me for a minute here.

12:40

The first one being,

12:42

if I only see a device once across

12:44

all the datasets, does it get a higher

12:47

weighting, or do you treat that differently?

12:49

I guess it's always nice to see a

12:51

device multiple times, like, ah, yeah,

12:53

it definitely is a device, multiple

12:56

providers see that data, that device

12:58

in their dataset.

13:00

Yes, that's right.

13:02

That can contribute to the kind of our ability

13:05

to determine the veracity of the device,

13:07

like, is this a real device?

13:10

That's one component of it. Another

13:12

component of it is, of course, the

13:14

ping could be real, and it

13:16

might end up in our dataset. We

13:18

do try to aggressively filter on quality

13:21

and veracity. We try to

13:23

filter out some of that synthetic data, for sure,

13:26

but if a device is only seen once

13:28

in a blue moon, it makes it much harder

13:30

to reconstruct this

13:33

holistic understanding of their user journey

13:35

throughout the day. And for some applications,

13:38

that doesn't matter as much, but

13:40

in general,

13:41

we want to start by having the fullest understanding

13:44

of what a device was doing throughout

13:45

the day. So if we only have very

13:48

patchy appearances of the

13:50

device and the data, it becomes very hard to kind

13:52

of impute what's happening in

13:54

the gaps where we don't see the device. And

13:57

we feel more confident about building high-quality

13:59

data products. products when we can actually

14:01

have the most holistic understanding

14:04

of the devices movements. So yes,

14:06

that data will not be excluded, but maybe

14:09

it'll be considered to be low fidelity

14:12

or

14:12

will only be used for certain products and not

14:15

others. That makes a lot of sense. Do

14:17

you ever interpolate the gaps that

14:19

you see in the data? Let's say you have this, I think

14:21

you talked about journey. So you could say

14:23

that you're segmenting these

14:26

things into at-rest

14:28

movement. And for a single device,

14:30

if the gap isn't too big, do you ever interpolate

14:33

that gap or interpolate the no

14:35

data points?

14:37

Yeah,

14:38

no, it's a really good question. And

14:41

on very small time scales, yes,

14:43

we do. One of the ways we

14:45

form, let's say a stop segment

14:48

is if we establish that a cluster of pings

14:50

is contributing to what we call a stop. When

14:53

we create the timeline for that stop, we create

14:56

a dwell time for the stop and it's really just the kind

14:58

of the maximum timestamp that's in

15:00

that cluster, subtracting off

15:03

the minimum timestamp. So we're establishing

15:05

that even though we only have a few pings contributing

15:08

to

15:08

it, we kind of fill in

15:10

that segment in the timeline and say like

15:12

during this block of time, that person was

15:15

stopped. Maybe they were at a venue. The

15:17

more difficult thing is kind of like between segments

15:19

when there are gaps between segments, because

15:21

obviously in an ideal world, you

15:24

would want to have a stop segment

15:26

followed by a movement segment followed by a stop

15:28

segment. And so when these things

15:30

are being created, there is a kind of

15:33

a process of coalescing. If

15:35

you have two moving segments that are close together,

15:37

we will coalesce them into one larger moving

15:39

segment because it just makes sense. It should

15:41

be this kind of flip-flopping between moving

15:44

and stop. But there are times

15:46

where for an extended period of hours,

15:48

for whatever reason, perhaps the

15:50

person with the device was indoors

15:54

or their battery died or they

15:56

got on a plane like there's lots of reasons why

15:58

they disappear from our radar.

16:00

And we don't try to do

16:02

currently, you know, with

16:05

the way we process the data, we don't try to

16:08

get too inventive with how we interpret

16:11

it

16:11

between those in those large gaps. And

16:13

the one exception to that is, you know,

16:15

in the evening and at night, like if you

16:18

live in, let's say you live in a concrete

16:20

apartment building, the chances of signal

16:22

being able to reach a satellite in order

16:24

to produce and record these things

16:27

is very attenuated. So

16:30

if we see that you stopped or that you

16:32

entered a building that we did, that has been designated

16:35

as your home in our modeling and

16:38

we don't see any paintings for many hours,

16:40

you know, during the nighttime.

16:42

If you then reappear the next morning within

16:45

a proximity of a few hundred

16:47

meters or so of where you disappeared

16:49

off our radar and it was overnight, then

16:52

we will interpolate between those two

16:55

points, which we did have data and say like you had

16:57

an overnight stop at this place. And it's

16:59

even more likely to be the case if it's

17:02

a place that we've, our modeling has designated

17:04

that you live, we'll call it an overnight stop,

17:07

even though we didn't have any data in that

17:09

gap. But other than that,

17:11

you know, there are companies out there that

17:13

are in the business of generating synthetic

17:16

data to kind of

17:18

mimic, you know, human patterns of movement.

17:21

But we don't currently do that

17:23

at four squares. So we try to

17:24

minimally interpolate when there are gaps.

17:27

We let those gaps exist for the most part.

17:30

And all these data streams, are they being delivered

17:32

to you in real time? And what

17:34

I'm wondering here is that let's say you

17:36

get a delivery update or your own systems,

17:39

your own apps pick up this device

17:41

ID. You can see it today. And

17:43

then a week later, you get some more data from

17:45

a third party provider. Do you need to wait

17:48

a certain amount of time to, you know, gather

17:50

that data in and make sure that you can,

17:53

like, identify those devices

17:54

before you start processing data? Does

17:57

that make sense? I mean, this is a really

17:59

good question.

17:59

You've kind of hit the

18:02

nail on the head with one of the issues.

18:04

In consuming third party data, there are always issues.

18:06

There are issues around quality, but there are also issues around

18:09

latency. Obviously, the data that we

18:11

get from our own apps, we're

18:13

able to process fairly quickly.

18:15

There's very low latency there, and it's high

18:17

quality because we own the

18:19

data. It

18:21

never leaves the boundaries of our

18:23

company's data. The

18:26

external data, it's interesting.

18:28

If you consider a particular date of observation,

18:30

a

18:31

date during which stuff was happening

18:33

in the real world, and we want to collect

18:35

as much of that third party data as possible,

18:38

it can take many days. Now, we

18:40

get deliveries daily from third parties,

18:42

but it can take many days to fill

18:45

in all the blanks about that day of observations,

18:48

the date in which things were generated.

18:51

Even five, six days later, it's worth

18:54

waiting those extra few days to get more,

18:56

to fill in the blanks

18:58

and to get substantially more

19:00

data about that

19:01

particular day of observation. The

19:03

flip side of that is then,

19:05

if this data is going to be used for

19:08

anything that requires a quick turnaround, like

19:11

for instance, some

19:14

of our data products lead to attribution.

19:17

If someone sees an ad for

19:19

a quick serve restaurant on their

19:21

mobile device, that ad impression

19:23

may be registered with that device. Then,

19:27

much like digital, a little bit more

19:29

challenging, I would say, than digital conversions

19:32

where someone

19:33

might see an ad for socks and then click on

19:35

a website and go and buy some socks,

19:37

like within 10 minutes of seeing the ad,

19:40

it's more of a challenge to connect

19:42

the real world conversion of someone walking

19:44

into a quick serve restaurant

19:46

because they saw an ad for

19:49

a hamburger online

19:51

and it made them hungry. But either way,

19:53

there's still this issue of the

19:55

conversion window and we want to have feedback

19:58

from the campaign, from the advertising. campaign that

20:00

produced the impression as soon as possible.

20:03

And depending on the needs of the client, that may

20:05

be ideally within a few

20:07

days. So we wait more time to collect

20:09

more data so we can make

20:12

more high-fidelity observations

20:15

about what the person did in the real world, but then

20:17

the clients also want a fast turnaround.

20:19

So typically, there's some sort of sweet

20:21

spot. It could be between

20:25

two or three and seven days, depending

20:27

on the client. And their tolerance

20:30

for a delay and waiting for that signal.

20:33

This is a perfect segue onto the

20:35

next obvious question here, which is what is this

20:37

data used for? And in a previous conversation,

20:40

you've mentioned this idea of attribution

20:42

and targeting.

20:43

And I want to get into that in just a second.

20:46

But first, I want to understand

20:48

about movement and at

20:50

rest.

20:51

Because I think this will help people understand

20:53

where the conversation is going

20:55

to go from here. Which one of those two things

20:57

is more important for you as a company,

21:00

to know that the device is moving or at

21:02

the device is at rest? Yes.

21:05

So I see there's a good story behind this. So

21:08

when I came to the company, one of the things I was

21:10

tasked with was building a team

21:12

to

21:13

upgrade these movement pipelines,

21:16

use more cutting-edge technology

21:18

and more robust, make these pipelines

21:20

more robust. And we

21:23

were looking at how things have been done previously. And

21:25

there's a certain amount of ML and algorithmic

21:28

work that went into it. And we

21:30

wanted to move quickly and build something that

21:33

was simple to understand and also

21:36

easy to maintain. So we started by building

21:39

a baseline model for

21:41

this movement segmentation piece

21:43

that's not relying on

21:46

kind of off-the-shelf algorithms or

21:48

any sophisticated ML that would

21:50

then require upkeep

21:53

and ML ops practices.

21:56

I think just in general as a data

21:58

scientist, we should always... start by building

22:00

a simple, heuristic rule-based model.

22:03

And that can be our baseline. But it also demonstrates

22:06

that the people who are tackling the problem

22:08

understand the problem, because they built rules

22:10

that work. And it's also super easy

22:12

to debug, whereas ML

22:15

can be a bit of a black box phenomenon. So

22:17

one of the epiphanies we had was, as

22:20

you look at a device trail as someone is

22:22

walking down the street with their mobile phone in their pocket,

22:24

it's easier to actually measure

22:27

movement. And you sort of

22:29

see this with your, even when you're using

22:32

Google Maps, quite often it doesn't

22:34

know which way you're facing when you

22:36

start driving. Like, it thinks you're going the wrong

22:38

way down the street. And then it quickly updates

22:41

and flips you around on the map. And

22:43

so my point is that actually movement is easier

22:45

to detect than stops.

22:48

And in some sense, stops are like the absence

22:50

of movement. So as the indexing

22:53

on movement was one of the key things that we were

22:55

able to do to actually get a much more accurate

22:57

understanding of this phenomenon, it sounds

23:00

trivial. I know very

23:02

clearly if I'm moving or at rest, but

23:05

the mobile phone signals can suffer

23:08

from all sorts of jitter

23:10

and issues with urban

23:13

canyons, signals

23:15

reflecting off buildings or walls. It's called

23:17

multi-path, I guess, indoor

23:19

underground use, satellites being

23:22

blocked, and so on, trees. And

23:25

it's actually non-trivial to try

23:27

and solve that. And so you could look at speed,

23:29

for instance. But because of the jitter that's in the

23:32

signal, the kind of speed measurements

23:34

are quite often not reliable

23:36

when you're trying to do this segmentation.

23:38

So the takeaway was that

23:41

we wanted to focus on movement. Instead of worrying

23:43

about stops, let's focus on movement. Because

23:45

when you're moving down the street, your trajectory

23:48

takes a very definite shape. So

23:50

the idea was to focus on the shape of the trajectory,

23:53

rather than things like speed.

23:55

Because an old lady shuffling down the street with

23:58

her shopping bags is not moving very well. quickly

24:00

and her signal may have a lot of jitter in it, but

24:03

if you look at her kind of average trajectory,

24:05

it's a very uncoiled shape. And

24:08

so that's kind of the metric that we were using.

24:10

It's kind of a shape metric. I call it the spaghetti

24:13

shape index versus

24:15

like when you're stopped, your pink trail

24:17

tends to be kind of coiled up. Maybe because

24:19

of jitter, it looks like you're moving fast, but your trail

24:22

tends to be kind of coiled up like spaghetti on a fork.

24:24

So then stops became like the absence

24:27

of movement once we had that that epiphany.

24:30

And what's interesting

24:32

is, yes, we focused on movement,

24:34

but actually in

24:35

terms of our business, stops

24:38

provide more value, but it's kind of

24:40

like yin and yang. So stops provide

24:42

value because if you can define

24:44

a device is at rest, if it's in

24:46

the vicinity of some commercial venue,

24:49

then you can say maybe

24:50

that person who was stopped there went into that coffee

24:53

shop nearby. So you've elevated the

24:55

stop from being a stop, becoming a visit

24:57

by doing this venue attachment.

24:59

Once you have a visit, then there's all sorts of

25:01

like commercial applications that open

25:03

up. And you mentioned already, there's attribution

25:06

and targeting.

25:07

If someone goes into coffee shops regularly, we

25:09

can assign them to an

25:12

audience, a bucket of devices

25:15

that can be sold as the audience

25:17

can be sold as an audience of coffee

25:19

lovers. And so the

25:21

attribution and targeting are like the opposite

25:23

sides of the coin with digital advertising.

25:26

First of all, you need to understand what

25:29

type of person might be interested in an advertisement,

25:32

digital advertisement. And so you

25:34

show someone a coffee ad and they're susceptible

25:36

to drinking coffee, then that's a good approach.

25:39

The other side of it is attribution. When you show

25:42

them the ad, do we know if that

25:44

person responded and went to a coffee shop?

25:46

And we have a team at the company that

25:50

looks at solving, making that

25:52

connection and doing it in a sophisticated

25:55

enough way to understand like, would

25:58

that person have gone to the coffee shop anyway? even

26:00

if we had not shown them the ad. So

26:02

there's like, there's ML in this in terms

26:04

of like causal inference models to,

26:07

you know, compare the actual behavior

26:09

with the counterfactual, like the baseline,

26:12

which is people tend to go to coffee

26:14

shops anyway, is there a lift in,

26:17

you know, their visitation if they see

26:19

an ad, right? So that's

26:21

a very valuable revenue

26:23

generating activity for the company. Like

26:26

being able to connect stops

26:28

to venues to be able to assign visits. And

26:31

then from the visit, you can match

26:33

that back to an ad impression. And

26:35

then that's what attribution is.

26:37

So like our, our partners, the clients

26:39

that are interested in our attribution product can understand

26:42

if their campaigns, their advertising

26:44

campaigns are successful or not. When you

26:46

talk about it like that, it sounds like you're looking at

26:48

stops and movement and as discrete objects.

26:51

Okay. That this one stop was important

26:53

to us. But

26:54

again, like harking back to our previous

26:56

conversation, you had this great phrase, let me see if I can pronounce it

26:58

correctly. It was

27:01

a semantically meaningful journey. And

27:03

when you've said that, it made me think like,

27:05

maybe this is more than just one discrete stop. Maybe

27:08

this is, you know, trying to build up a picture

27:11

of the journey itself. Like what was this device doing during

27:13

the day? What,

27:15

what does the pattern of the weekly

27:17

daily monthly pattern of this device look like? Am

27:20

I on the right track or am I completely

27:22

out of the way?

27:24

No, no, it's, I think you're right. Uh,

27:26

yes, there is definitely a contrast between

27:28

the kind of, uh, once

27:31

and done, uh, scenarios that I'm

27:33

describing. Like, you know, it's much

27:35

more generic to say someone tends to visit

27:37

coffee shops or to say, Oh, they saw this ad

27:39

and then they went to the coffee shop, right. Or

27:41

the big

27:42

serve restaurant. It is definitely

27:44

like, I think foundational to,

27:47

as a data scientist, I want to be able

27:49

to recreate a

27:51

picture of reality. I want to be faithful

27:53

to what's happening in the real world for

27:56

the users of these devices and

27:58

you know, that's kind of.

27:59

foundational to what we try to do. And it's

28:02

not anything necessarily that we then,

28:04

I want to make a distinction between

28:06

that and then what is actually presented

28:08

as a product, either internally or to external

28:11

clients. Um, you know, there's lots of

28:13

privacy concerns that we have

28:15

front and center about, you know, the products that

28:17

we deliver, but you know, as a

28:19

data scientist, it's my goal to have a full

28:21

understanding because I don't want to make mistakes

28:24

about how I infer

28:27

what was happening with that person.

28:29

If we have that full picture, you know, we

28:31

can serve those obvious use cases

28:33

of targeting and attribution,

28:35

but there are other, you know, more sophisticated

28:37

scenarios that you're alluding to with

28:39

like, wouldn't it be great if we could understand

28:42

the full, uh, like

28:44

longitudinal movements of a device throughout

28:46

the day. And it also helps us

28:48

understand the quality of our data. If we're, you

28:50

know, if for a particular provider, we can't

28:53

do that reconstruction in a very convincing

28:56

way, it might suggest that that provider

28:58

is not giving you very high quality data. But

29:00

in terms of like how we derive

29:03

on, uh, you know, that full longitudinal understanding

29:06

that, uh, holistic understanding of the

29:08

user journey throughout the day, one of the

29:10

applications for this is that,

29:12

uh, you know, we have a client who

29:15

is interested in building

29:17

synthetic models, synthetic twins of,

29:19

you know, real, uh, users in the real

29:21

world, these digital synthetic twins.

29:24

They populate cities with these

29:27

synthetic models and,

29:29

uh, you know, from these models

29:31

that are, that are trimmed on the real data that

29:33

we supply them. So this is a scenario where

29:36

we do have to have high quality

29:38

longitudinal stories about these, you

29:40

know, these holistic stories about the user journey,

29:42

because then the models they build will be a much

29:44

higher quality. And these are

29:46

the types of, uh, you know, synthetic

29:49

models that are really, really useful

29:51

to, um, like

29:53

city, uh, transit authorities,

29:56

uh, urban planners as they kind

29:58

of model the flow.

29:59

of human beings through the urban landscape. It

30:02

can really help with things like urban planning.

30:04

The good thing about that is it's very

30:07

privacy safe because none of

30:09

the real user data gets exposed

30:11

to the outside world. It's merely used

30:14

to train these synthetic models.

30:16

That sounds fascinating.

30:17

You're talking about getting

30:19

data in different chunks earlier on in

30:21

the conversation, and you were relating this to

30:24

attribution. How is our

30:26

client gonna know if the

30:28

device saw that, walk past the billboard, and then

30:30

went and had a cup of coffee?

30:32

But that made me think of

30:34

this idea, wow, you could monitor

30:37

a disaster, for example. Or

30:39

you could look at a disaster in retrospect

30:42

and see how people responded to it, like leading

30:44

up to it and after it, maybe even

30:46

during it and after it. Do you not have anyone

30:49

doing work like that? I mean,

30:51

I can say the answer is yes. We

30:53

have a

30:55

government client that's very interested in

30:57

modeling what happens in,

30:59

not even modeling, but just actually observing

31:02

what happens in the aftermath of,

31:04

let's say, a hurricane. You can

31:06

imagine that satellite data

31:08

can be rather patchy. You may

31:10

not have satellite imagery of what's happening. So

31:13

in terms of even disaster

31:15

relief and planning for future disasters

31:18

in response to those, this

31:22

sort of data is and will

31:24

be immensely valuable. We

31:26

do have an interest from a client

31:28

in that. And I probably can't say who it

31:31

is, but that is definitely, you've

31:33

hit the nail on the head there too with that. So

31:35

interesting, you talked about

31:37

satellite data. For the

31:39

last little while, people were talking about, we can

31:41

use satellite data and we can look at the car park

31:44

at Walmart and figure out how many

31:46

cars in there and sort of, long story short,

31:48

figure out what the share price is gonna be, essentially,

31:50

whether it's gonna go up and down. Lots of people are visiting

31:52

Walmart. But my guess is you have pretty

31:55

great data on that. Do you work with

31:57

satellite companies to help them sort of augment the

31:59

analysis that...

31:59

they're producing or

32:02

could you? There are companies that,

32:04

like you mentioned, will take that satellite data

32:07

and try to infer, you know,

32:09

as you say, if, if

32:11

for a particular big box store,

32:14

there are 20% fewer

32:16

cars in the parking lots, this

32:18

border compared to the last border, you know,

32:20

maybe earnings will be down, right? That

32:23

has all sorts of, you know, potential

32:25

issues in that the data is really

32:28

quite sparse. You know, there, until recently,

32:30

I think, you know, Planet Labs has

32:32

these amazing doves that, that encircle

32:34

the globe and maybe have a, you

32:37

know, daily line scan

32:39

image of the earth that gets updated daily. But

32:42

apart from them, I mean, you're relying

32:44

on, you know, a very low

32:46

coverage of parking lots of

32:49

big box stores, you know, from other satellites.

32:52

And you're also, you know, at the mercy of

32:54

the weather and you're also at the mercy

32:56

of the fact that parking lots can be underground,

32:58

right? So definitely the

33:01

people that are interested in our data products

33:04

might be using those as well for the same purposes.

33:06

But I would say that we're

33:08

immune to some of those concerns like weather, for instance,

33:11

we can actually determine what foot traffic

33:13

was like to a particular high street store or

33:15

a big box store in a way that

33:17

is much less sparse. So

33:20

that's definitely one of the applications

33:22

of the, you know, us being able to

33:24

generate visits from stops. That's

33:27

like a direct application of visits. These are kind

33:29

of like business insights that you would, you

33:31

know, derive from the visits. It's interesting.

33:34

Like this really makes me think that

33:36

that whole argument, and I realize it was just

33:38

an example that they could, you know, a tangible

33:41

example that they could tell people about, oh, we could

33:43

do this, you know.

33:44

And my guess is this was

33:46

an example that was sort of supposed

33:48

to help people, to open people's eyes to

33:51

the possibilities. But it really does make

33:53

me think like there's probably a better

33:55

way of doing this. And maybe your data

33:57

is a better way of doing this.

33:59

to this conversation about

34:02

how we use

34:04

the data to provide insights

34:07

about foot traffic to various chains and

34:10

business categories. There is that

34:12

data, of course, coming in from third parties,

34:14

but one of the things we really focus on is

34:18

coming back to this idea that we have our own owned and

34:20

operated apps, we can look at,

34:22

and this comes down to quality, right? And

34:25

this is one of the ways in which we filter aggressively

34:27

for quality. I just wanted to bring

34:29

this up, like looking at the intersection

34:32

of those devices that are in both the

34:34

data from our own apps

34:36

and that appear in the third party apps,

34:38

we were able to, well,

34:40

first of all, we build a machine learning model to

34:42

determine like which providers are

34:45

more trustworthy than others. In other words, I

34:47

think of it this way. If our first party

34:50

apps are saying that at a particular point

34:52

in time, a device that is in

34:54

third party and in first party data,

34:57

if that device at that time was, let's say in

34:59

San Francisco where I am and a third

35:02

party app says, Oh, actually that device

35:04

is 20 miles away or

35:06

it's in San Jose, right? That's an example

35:09

of a training label. We can then

35:11

apply to the third party data. We can look

35:13

at the composition of the stops

35:16

and business that we generate and they're composite.

35:18

You know, some of the pings come from one provider,

35:20

some of them come from another source.

35:22

And so we can see like based on the composition,

35:24

what's the likelihood that that

35:26

stop is real or that visit is real.

35:29

And then we can build a model on top

35:31

of that. So we can train on the devices that

35:33

are in the intersection of our data and the

35:36

third party data.

35:37

And then we can apply the model to predict

35:39

on top of the third party data. And that

35:42

way, you know, we can do some very

35:44

aggressive filtering for this, this idea

35:46

of veracity. And that way, because

35:49

as you say, there's a hundred billion pings coming

35:51

in, we need to be very careful about, you know,

35:53

how much of that we just directly ingest

35:55

in a very naive fashion. So at the very

35:57

top of the funnel, we can actually take that data.

36:00

and apply these models and start to really restrict

36:02

it, like turn down the flow based on

36:04

these, like, veracity predictions, and that way,

36:07

then we get to something we can say more confidently about

36:10

the, you know, how many people visited the mall that

36:12

day or the big box store. Would

36:14

it be fair to say that you can use your own first-party

36:17

data as a form of ground truth? Yeah,

36:19

of course, that's right, yeah. I see it

36:21

as sort of a quality assurance

36:23

chain that starts with, you know, we have

36:25

this app that has such a great loyal user

36:28

base, you know, this warm app, and

36:30

these people are creating their own

36:32

visits, basically. They're doing it for themselves,

36:34

but in a way, they're doing it for us, too, right?

36:36

We know

36:37

then that, you know, when the phone says

36:39

this person stopped somewhere, that

36:42

the algorithm inside the phone that is part

36:44

of our app is doing a good job because

36:46

the person is verifying the human in the loop is

36:48

creating that training label and saying, yes, I was

36:50

at this venue, so we can calibrate

36:52

our own models on our first-party apps, and

36:54

then the chain then goes to the third-party

36:57

data, so we use the first-party data to

36:59

validate the third-party data

37:01

and do this veracity modeling. That's

37:04

the way I look at it, as a chain of quality

37:06

assurance. I'm really, really pleased

37:09

you shared that with me. It makes a lot of sense, and

37:11

it's interesting. So we've been talking about these

37:13

different sort of use case

37:15

applications for the

37:17

data. You've done a great job of telling us about

37:19

the data, where it comes from, how you process

37:21

it, the way you check it, these checks and balances

37:23

that are in place, the idea of segmenting

37:26

it into stops and movement

37:29

and why that's important. We talk a little bit about

37:31

attribution and targeting, that the flow

37:33

through a city, this

37:36

idea of a segmentically meaningful journey.

37:39

I want you to describe one last example for us, if

37:41

you would please, and this is the idea of crowdsourced

37:43

routing. Right,

37:45

yes, so this is a

37:48

work in progress. Part of our

37:50

research team is working on, if

37:52

you think about it, and this is a good

37:54

example. Earlier we were talking about how

37:57

most of the value we bring through understanding.

38:00

the raw GPS data is in determining

38:03

stops and then visits. And then obviously that

38:05

leads to

38:06

attribution. But

38:07

now that we are doing a better job at

38:09

segmenting, you know, the movement

38:12

in terms of understanding movement

38:14

itself, not just the stops, you

38:16

can imagine a very straightforward application is, you

38:19

know, when you look at a map, you might look at a hotspots,

38:22

you know, hotspots on the map, you can aggregate

38:25

where people stop, you know, on some grid,

38:28

you know, at the four square, we use the H3

38:31

grid system, hexagonal grid system.

38:33

So you can just simply do a, you

38:36

know, a binning, like how many stops have

38:38

occurred in this particular location. And that'll

38:40

tell you maybe some information

38:42

about like where people enter a building,

38:45

right? Because people stop near the entrance or

38:48

they, you know, the density of stops is higher

38:50

near the entrance of a building. So it gives you some meaningful

38:54

understanding of places beyond just say

38:56

polygons. So that's about stops,

38:58

right? So what to think about movements. So instead of thinking

39:00

about

39:01

hotspots, you could think about hot trails,

39:04

right? Like we also like

39:06

aggregate

39:07

people's movement segments, like

39:10

again, on a H3 grid, it's a way to kind

39:12

of horse grin those trails,

39:14

those moving segments. It also

39:17

guarantees a certain amount of privacy

39:19

because, you know, we're talking about public roads

39:22

and we want to sort of

39:24

snap people's moving segments

39:26

to those roads and the ones that are

39:28

more heavily trafficked are

39:31

the ones that then, you know, you might designate

39:33

as these hot trails. In some sense,

39:36

that's crowdsourced routing, right? Like, you

39:38

know, you learn in computer

39:40

science about, you know, pathfinding algorithms

39:43

like the Dextra pathfinding, right? Which

39:45

tries to find the shortest, lowest cost

39:48

path between two spots. But there may

39:50

be other reasons why, you know, on a map

39:52

that people use a different

39:55

sequence of waypoints, a different

39:57

route through the map than maybe the shortest.

40:01

And so, you know, much as when,

40:03

I don't know if you were a kid growing up, like I grew

40:05

up in Ireland, then in the country, there

40:07

are always these kind of well-worn paths, and you

40:09

wonder, like, were they made by people or by animals,

40:12

like through the forests and the hills. This

40:15

is an example like that, right, where you're finding

40:18

that kind of hot trail on the map, and

40:20

that could definitely have an application for sure. So we're

40:22

just exploring that and seeing

40:24

if we can actually produce that as

40:27

a product, then, you know, working with sales

40:29

folks to

40:29

see if there is a market for it. That

40:32

would be really interesting to see how that plays out.

40:35

No one reminds me of. You

40:37

have the restaurant and it says, Popular. Most

40:39

people eat this thing here. And it gives

40:42

you a sense of certainty. And I just

40:44

imagine looking at my, you know,

40:45

navigation app thing there, fastest,

40:48

shortest, most eco-friendly and popular.

40:52

And I wonder which one people would choose because, Popular,

40:56

there's a certain amount of certainty that comes

40:58

with that.

40:59

Most people choose this one. That's right.

41:01

That's right. And, you know, quite

41:04

often, routing apps will send you on

41:06

a route that maybe doesn't

41:09

penalize left turns, like maybe it's

41:11

the shortest, like end-to-end. But, you

41:13

know, there are more, maybe it's a

41:15

more

41:16

dangerous, less effortless way to

41:18

go. So I

41:20

think sometimes as you're implying,

41:23

like the maybe the

41:25

kind of the lowest common denominator route is

41:27

the one that is the most effortless. Maybe

41:29

that's kind of what we should be optimizing for. Yeah. Or

41:32

maybe it's the most peaceful. Maybe it's the most beautiful.

41:34

Maybe it's whatever else. That's right.

41:38

Most scenic. And my guess is there'd

41:40

be an interesting overlap, you know, between

41:42

what the computer thinks is the best and what the

41:45

humans think is the best. I mean, so in the

41:47

city where I live, for example, there's

41:48

lots of cycleways, cycleways everywhere.

41:50

And you can, they've made a huge effort to try

41:53

and go, please go on the cycleway. But

41:55

people always cut corners if they can because,

41:57

well,

41:57

that's great. The machine said I should go straight.

41:59

here, but you know what? The human in me just wants

42:02

to turn around the corner there. And you can see these

42:04

well-worn bike paths, you know,

42:06

just on the side and these little sneaky

42:08

routes that people take because that

42:10

is clearly a great place for humans

42:12

to go. Humans would like to move in that direction

42:15

or in that way. I think that's really

42:17

interesting. Yeah, it'd be great if, you know, that

42:20

we could have like a data-informed approach

42:22

to that too, right? I think that would be

42:24

amazing.

42:25

Yeah, yeah. And that, like, this ties back

42:27

into what you were saying earlier about,

42:29

you know, city planning. If we know more, the

42:31

more we know about the movement, how are the people

42:34

living in the city actually

42:35

moving through the city? And if we can model that and

42:37

create a city, and how

42:40

would they like to move through here? Not how are

42:42

we going to force them to move. It's probably a bit of,

42:44

you know, give and take there. But I think that would be

42:46

interesting. Yes, for sure. And also,

42:48

like, I think there is a component of this that

42:51

is going back to

42:53

semantic segmentation. I mean, I think,

42:55

you know, much as like, you know,

42:57

when we do, let's

42:59

say, video calls and

43:02

the algorithm on your video

43:04

call knows the difference between you and the foreground,

43:06

and then it can blur out the background. So

43:09

it has that distinction of, like, foreground

43:11

from background or like earth versus sky,

43:13

that type of segmentation that we can do in computer

43:16

vision. I think there's another research

43:18

direction in this, which is sort

43:21

of semantic segmentation on maps.

43:23

You know, we have different ways of mapping. Some

43:26

of it is crowdsourced, you

43:29

know, there are people out there annotating maps for

43:31

OSM. You can also draw maps

43:33

using satellite imagery using the same

43:36

sort of segment, semantic segmentation.

43:38

I think Microsoft

43:38

has done that. But we could also

43:41

be using the mobile phone signals and this understanding

43:43

of like, stops and motion,

43:46

you

43:46

know, vehicular motion versus pedestrian

43:48

motion, to be able to draw maps, maps

43:51

without maps, using the ping trails

43:53

of humans, just, you know, aggregating over

43:55

time to remove the noise, even

43:57

the speed of which the people are moving. would

44:00

provide segmentation of roads

44:03

into fast moving roads versus slow

44:05

moving roads, and also uncover

44:07

anomalies between the usage

44:09

of roads versus how the roads have been drawn

44:12

by these other sources. So there's

44:14

a possibility of enrichment there too,

44:16

I think. This is kind of fascinating. So we've

44:19

been talking for a while now and coming up with all

44:21

these ideas of stuff you can do. And

44:23

right at the start of the conversation,

44:24

you said our data looks like this. It's

44:27

like imagine a spreadsheet with a latitude,

44:30

a longitude, a time, and an ID. It's

44:32

kind of amazing that from that

44:35

we can see so much potential. Yeah. Yeah.

44:39

And so that's the power of data science and data

44:41

engineering. And I think geospatial data,

44:43

you asked me why I chose this career, but I think

44:46

it's some of the most challenging out there in

44:48

the domain of data,

44:50

data science, because so little

44:52

of what we do

44:54

makes sense unless you can

44:56

really just look at it on a map. Foursquare

44:58

has Foursquare Studio, which is our visualization

45:01

studio. And I think anytime

45:03

we have any incoming data scientists, I always insist

45:06

that they draw their maps. They

45:09

don't just look at the data in tables and in statistics

45:12

and metrics, but they actually plot their maps in studio

45:15

because you just don't really appreciate, you

45:18

don't have the correct contextual awareness until

45:20

you plot things on a map. And

45:22

that's part of the nuance of geospatial. And I think that's

45:24

always been part of what fascinates me about

45:26

it.

45:27

Yeah. So with that, I had

45:29

a question I was going to ask you about, like, what are you going to do

45:31

next? Or are you run out of things to

45:33

do, but you kind of, I think you've alluded

45:36

to it that you haven't run out of things to do, that there's a lot

45:38

to do there. It's very challenging. So

45:40

I think I'll take this in a different direction for the last

45:42

question. So the last question is, do

45:44

you think that spatial is special? Or

45:47

is it just more data? Well,

45:49

I think it's related to what I've just been saying. Trying

45:51

to extract value from spatial data

45:54

is definitely very challenging.

45:57

I think in terms of the industry, it's

45:59

still...

45:59

still somewhat untapped.

46:02

Things like targeting and attribution are very

46:05

much low-hanging fruit. And

46:08

I think we owe it to ourselves to unlock

46:10

a lot of the other potential out there. And

46:14

luckily, the other potential comes

46:17

down to doing these more sophisticated things like understanding

46:22

these user journeys throughout the urban landscape,

46:24

building maps without maps. And

46:27

figuring out, we do work for businesses. This

46:30

is not just a research project. It's all about,

46:32

how can we generate revenue from that as well?

46:34

Is this something people are interested in, or is it just a

46:37

paper that I'm going to present at a conference?

46:40

Yeah, that makes a lot of sense.

46:42

And I think you mentioned that right at the start,

46:44

when you're talking about being a good data

46:46

scientist, can you see the bigger picture?

46:49

Can you see how this is going to create

46:51

value for our customers? So

46:53

that ties in nicely with what you just said there,

46:55

at least in my mind. Gabriel, this

46:57

has been awesome. I've really enjoyed talking

47:00

with you. Covered a lot of ground and

47:02

really, really enjoyed the conversation. Yeah, me

47:04

too.

47:05

People know that you work for Foursquare. Is

47:07

there anywhere in particular you'd like to point

47:09

them towards if they want to go and learn more?

47:12

I think you mentioned Foursquare Studio.

47:15

That would be a great place for people to check

47:17

out. Anywhere else we can share

47:19

a link to or point people towards? Yeah,

47:21

we can link the website. The website

47:24

is kind of a good springboard

47:25

into all of the different

47:28

activities.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features