Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:01
Welcome to another episode of the Mapscaping
0:03
Podcast. My name is Daniel and this
0:06
is a podcast for the geospatial community. Today
0:08
on the podcast we're starting with a very large
0:11
number. We're starting with 100 billion. Let's
0:14
say I gave you a spreadsheet with 100 billion
0:17
rows in it. Each row consisted of five
0:19
columns, latitude, longitude, device
0:22
ID, a timestamp, and a column telling
0:24
you the name of the data provider. What
0:26
would you do with that? How would you clean it? How
0:28
would you make sense of it, extract value from it?
0:31
What do you think people would use it for? How
0:33
would you do all of this stuff in a way that could
0:35
be systematized, in a way that you could repeat
0:38
again tomorrow? Foursquare does
0:40
this every day with the help of something they
0:42
call a movement engine. To help us
0:44
understand more about how they do this, I've invited
0:46
Gabriel Durkin, the director of
0:48
data science on the podcast. This is
0:50
the last in a series of episodes that I have been
0:52
working on together with Foursquare. I have to say they
0:55
have been absolutely brilliant to work with. If you're interested
0:57
in hearing some of the previous episodes, I'll put links
0:59
to them in the show notes of today's episode. But for
1:01
right
1:01
now, we're back to the 100 billion points.
1:06
Hey Gabriel, welcome to the podcast. You
1:09
are the director of data science at Foursquare.
1:11
You have something called a movement engine over there and
1:13
you process 100 billion records, GPS
1:16
records I should say, each and every day. At
1:18
least that's what I got out of our pre-interview
1:21
conversation. I'm hoping you can put a few more
1:23
words around that in just a minute. My guess
1:25
is you haven't always been the director of data science
1:27
at Foursquare. How did you get there? Where
1:29
did you come from? How did you get involved in processing
1:32
movement data? Well, it's nice to be here, first
1:34
of all, Daniel. When it comes
1:36
to how I got here, it's
1:38
been a circuitous journey. The
1:41
first 20 years of my adult working
1:44
life, I was a quantum physicist. I
1:46
did my PhD at Oxford in quantum physics
1:48
and then moved to
1:50
the States to work at the Jet
1:52
Propulsion Lab and then at the NASA
1:55
Ames Research Center. We had a
1:57
quantum computing team there and I was
1:59
part of it. So data science
2:01
is a, was kind of a career change for
2:04
me, you know, probably seven years ago now.
2:06
And, you know, I worked at Uber and some other
2:08
startups doing, you know, first as
2:10
an independent contributor and then eventually moving
2:13
into management. And that's some
2:15
of the story about how I got here today, working
2:17
at Foursquare on geospatial
2:19
data, leading them the movement
2:22
engine,
2:22
which is a name for the team,
2:25
the people working on movement
2:27
data, but also a name for the platform that
2:29
we built. That is a cool name for a team,
2:31
the movement engine. Hey, if we
2:33
just stay with you all past, just for a second
2:36
here, what was it like going from quantum
2:39
physics, I think you said, over to, to geospatial
2:41
data? Was it a big jump? Like, was
2:43
there anything that was difficult to learn? Was there a
2:46
huge vocabulary shift or is it all just,
2:48
you know, more data? Yeah, I mean, it's a,
2:51
it was a choice I made just because
2:52
I, I wanted to work, you know, more broadly
2:55
in industry. You know, I enjoy
2:57
research. I still consider
2:59
myself a quantum physicist, but
3:01
I wanted to, you know, work in
3:03
a faster paced environment. And,
3:06
uh, you know, I'd been at NASA a long time, so I thought a
3:08
change of pace might be interesting.
3:10
And I knew that data science
3:12
was a career that had a lot of transferable skills
3:14
for people with, you know, PhDs
3:17
in the so-called hard sciences, you know,
3:19
numeracy, analytical skills,
3:22
also kind of, you know, I think the best
3:25
data scientists are the ones that have that kind of
3:27
scientific curiosity. I'm willing
3:29
to kind of turn over every rock. That's something
3:31
that it's hard to just, you know, learn in college.
3:34
I think it's kind of either you have that instinct or you
3:36
don't. So yeah, so I, you know, I cut
3:38
my teeth on geospatial data at, at
3:41
Uber and learned a lot there. It's
3:43
a completely different type of work. I mean, you, there's
3:46
certainly the science aspect of it, but it's also working
3:49
collaboratively with people with different backgrounds,
3:51
you know, designers and product managers.
3:54
So it's, it's actually quite an enriching experience
3:57
and I've, I've definitely enjoyed it. And
3:59
it was for me. it was the right career move. And
4:01
it's only something that's become possible.
4:03
I mean, data science as a career has
4:06
only really existed for just
4:08
over 10 years, I guess. And so
4:10
the career path to
4:12
data science these days is
4:14
quite varied, but there was a program
4:17
and it's called the Inside Data
4:20
Science. It's kind of a fellowship where they, in
4:22
a very short space of time, kind of preppy for the world
4:24
of work as a data scientist. And for me,
4:27
that program was invaluable. I think there's
4:29
no way I
4:30
would have passed any of the data
4:32
science interviews, which are really quite
4:34
rigorous for tech companies without
4:36
that experience. So I have a lot of it
4:38
to Inside Data Science. That is
4:41
really interesting. I naively
4:43
just assume that you're someone with your background,
4:45
oh great, I'm really good at maths. I
4:48
understand all these complicated processes.
4:51
I've worked with big chunks of data before.
4:53
I can just change my name
4:56
or change the title, sorry. And voila,
4:58
now I'm a data scientist. That's interesting
5:00
to hear you say that there was a prep course
5:02
involved and that you got
5:04
a lot out of it as well, which is possibly even more
5:06
interesting. I would say, yes,
5:09
I mean, part of the narrative is like, especially
5:11
people do well if they have a background, like let's say in
5:13
astronomy, where they're good at dealing with large
5:15
data sets. But
5:18
it's really quite different. There's always
5:20
the fear that someone with
5:22
a nerd with a PhD is going
5:24
to be good at burrowing into problems, but isn't actually
5:27
very focused on execution or whether
5:30
you have a sense of urgency or whether
5:33
your technical expertise is aligned
5:35
with the business objectives of the company. So
5:38
those are all things that you have to demonstrate to
5:41
allay those fears that you're just some very technical
5:44
nerd who has minimal impact,
5:46
for the business. And that's always something that we
5:49
struggle with, I think, as data scientists. So yeah,
5:51
it's very important to kind of exercise those muscles,
5:54
like the business acumen part of things. Also,
5:57
just being able to talk to non-technical stakeholders
5:59
about your
5:59
work and why it has impact, communication
6:02
is key.
6:03
Thank you very much for sharing that with us. I really
6:05
appreciate it. The promise of this
6:07
podcast is the focus on
6:09
this movement engine and these 100 billion
6:12
records that you'll process each day. I
6:15
think we should maybe shift the conversation towards
6:17
that. Let's
6:19
start with these records. What is all that data
6:22
and where is it coming from?
6:23
100 billion records. Think
6:25
of a GPS record as a row
6:31
in a data table that is, you
6:33
might call it a ping, right? It's a
6:36
latitude, a longitude and a timestamp and a device
6:38
ID associated with it. We
6:41
at Foursquare have a differentiating
6:45
component compared with other big data companies
6:47
in that we have our
6:49
own owned and operated apps. Those
6:51
owned and operated apps, one of the famous ones
6:54
is Swarm, which is our life logging app
6:56
or Foursquare City Guides. Those
6:59
apps generate data for us as well. The
7:03
user of Swarm likes to be able to remember
7:05
how many times this week they went
7:07
to the gym or
7:09
what their sequence of
7:11
movements was yesterday. We also
7:13
can leverage that data to improve our
7:15
own data collection,
7:17
our own algorithms that we
7:19
build on top of the data. That's one component
7:21
of the data. We have those pings,
7:24
those latitude, longitudes and timestamps
7:26
from our own apps. We
7:29
also collect the majority of
7:31
the data from third party sources.
7:34
Those sources could be apps themselves
7:37
or they might be from other data companies.
7:41
That contributes to the 100 billion records
7:43
that we ingest every day. That's
7:46
a lot of data. I guess one
7:48
of the big questions now is,
7:49
what do you do with that? Is it all just
7:51
ready to use, analysis ready
7:54
data or do you have to do something to
7:56
it first? No, definitely not.
7:58
There's gold in them.
7:59
but it's not all
8:02
golden, I would say. And the
8:04
data is very raw. Like it is just literally
8:06
those raw records. And
8:09
one of the things that my team is responsible for is
8:12
refining that raw data. It's
8:14
like an oil refinery might, you know,
8:17
might be responsible for turning oil into different
8:19
things like petroleum based products, like
8:21
car gasoline or butane or whatever
8:23
through fractional distillation. We're
8:26
trying to distill value out of the raw data
8:28
itself. And this
8:31
raw data, it comes from multiple
8:33
providers, multiple sources, some of it's internal
8:36
to four squares, some of it is these third party
8:38
sources. And really what
8:40
we're doing with it is trying to imbue it with
8:42
geospatial intelligence, right? Trying to extract
8:44
value out of it. So, you know, you take the raw
8:47
pings and you know, the first
8:49
kind of part of the process, well, what
8:51
we're doing with it is really like building up more complex
8:54
structures out of the, you know, just
8:56
flat data that we're collecting. So, you
8:59
know, you start with the completely unstructured raw data
9:01
and from that, you build
9:04
those pings up into, well,
9:06
first of all, you might try to classify if the pings
9:08
are associated with a mobile phone or
9:10
a device that is in motion or at rest. So
9:12
you do
9:13
classification on those pings at the
9:15
device level, at the ping level. Then
9:18
you might start structuring those pings into
9:21
what we call segments. So it's this process
9:23
of segmentation. So collections
9:25
of pings might be seen as participating
9:28
in a moving segment for that
9:31
device. Like if the person who owned the device is walking
9:33
down the street, or if they're traveling
9:35
in a vehicle along the road, if
9:37
the person has stopped, there may be a collection
9:39
of those pings that is associated
9:41
with the stop, right? Maybe there's a clustering
9:44
around a particular, you know, commercial
9:46
venue. That's definitely of interest,
9:48
right? So, you know,
9:50
you go from pings to this segmentation
9:53
to produce these segments, which may be stops or
9:55
moving segments. And there is
9:58
maybe like a majority.
9:59
vote, right? Like you're, if you can put
10:02
a last sewer around a set of these pings, you
10:04
know, maybe the majority of them are
10:06
stopped pings, but there's
10:08
a few moving pings in there. You do a harmonization
10:11
to say, well, you know, within this
10:13
cluster, most of these pings are stopped pings.
10:15
So we identify the whole cluster as a
10:17
stop cluster.
10:19
Then you might build up, you know,
10:21
now you have these segments, you can build up this
10:23
timeline, uh, sort of more of a holistic
10:26
understanding of user's journey, but
10:28
we might just be looking at one particular provider,
10:30
right? We might be looking at one source of the data
10:32
might be coming from our own app, or it might be coming
10:34
from one of the external providers. So
10:37
we build up a timeline from those segments
10:39
for the device. And we
10:41
do that per source. You know, we've gone through
10:44
this process of building up structure, right? We've gone
10:46
from pings to segments and then from segments
10:48
to timelines.
10:49
And when you have a timeline per
10:51
source, then the next process is, you
10:53
know, an additional one of harmonization
10:56
kind of data fusion, right? We want
10:58
to build a master timeline for that device,
11:00
but we reconcile the different storylines
11:03
that are being told for that the user
11:05
of that device, uh, you know, for a particular
11:08
day. Um, so, you know, one provider
11:10
might be saying, well, the person was in motion
11:13
and then they, you know, they stopped somewhere
11:15
for 30 minutes before picking
11:18
up again and going somewhere else. You know, that
11:20
may not be completely aligned between the different
11:22
providers from which we get the data. So
11:25
we can do again, like a
11:27
sort of a weighted majority vote, you know, for
11:29
each, each moment in time, we can decide how many
11:31
of the providers are telling a movement story versus
11:34
others that say, no, actually that device was at
11:36
rest. And we can even be more sophisticated
11:39
than that. And it can be weighted by the
11:41
value that we attach to each provider.
11:43
Like some providers, the data
11:46
is more likely to be higher quality,
11:48
let's say, than, than others. Uh, sometimes
11:51
the data can be synthetic that
11:53
they provide. Sometimes it's, it's
11:55
very noisy. Sometimes it's been
11:57
manipulated in some way, you know, like for instance.
12:00
the data can be snapped to a grid.
12:02
When you have a particular ping at a location,
12:05
sometimes the latitude and longitude
12:07
get rounded up and basically
12:09
causing the location of the ping to
12:11
be snapped to somewhere on the grid.
12:14
So there's all sorts of components
12:17
to the quality of the input data, and
12:19
then that gives us the ability to define
12:22
kind of a quality score for
12:24
the providers, and that can then go into
12:26
the weighting of how much we value
12:28
their perspective on what the device
12:31
was doing when we build these storylines
12:33
for that user journey.
12:35
Wow, I've got a bunch of questions,
12:37
and I hope that you'll bear with me for a minute here.
12:40
The first one being,
12:42
if I only see a device once across
12:44
all the datasets, does it get a higher
12:47
weighting, or do you treat that differently?
12:49
I guess it's always nice to see a
12:51
device multiple times, like, ah, yeah,
12:53
it definitely is a device, multiple
12:56
providers see that data, that device
12:58
in their dataset.
13:00
Yes, that's right.
13:02
That can contribute to the kind of our ability
13:05
to determine the veracity of the device,
13:07
like, is this a real device?
13:10
That's one component of it. Another
13:12
component of it is, of course, the
13:14
ping could be real, and it
13:16
might end up in our dataset. We
13:18
do try to aggressively filter on quality
13:21
and veracity. We try to
13:23
filter out some of that synthetic data, for sure,
13:26
but if a device is only seen once
13:28
in a blue moon, it makes it much harder
13:30
to reconstruct this
13:33
holistic understanding of their user journey
13:35
throughout the day. And for some applications,
13:38
that doesn't matter as much, but
13:40
in general,
13:41
we want to start by having the fullest understanding
13:44
of what a device was doing throughout
13:45
the day. So if we only have very
13:48
patchy appearances of the
13:50
device and the data, it becomes very hard to kind
13:52
of impute what's happening in
13:54
the gaps where we don't see the device. And
13:57
we feel more confident about building high-quality
13:59
data products. products when we can actually
14:01
have the most holistic understanding
14:04
of the devices movements. So yes,
14:06
that data will not be excluded, but maybe
14:09
it'll be considered to be low fidelity
14:12
or
14:12
will only be used for certain products and not
14:15
others. That makes a lot of sense. Do
14:17
you ever interpolate the gaps that
14:19
you see in the data? Let's say you have this, I think
14:21
you talked about journey. So you could say
14:23
that you're segmenting these
14:26
things into at-rest
14:28
movement. And for a single device,
14:30
if the gap isn't too big, do you ever interpolate
14:33
that gap or interpolate the no
14:35
data points?
14:37
Yeah,
14:38
no, it's a really good question. And
14:41
on very small time scales, yes,
14:43
we do. One of the ways we
14:45
form, let's say a stop segment
14:48
is if we establish that a cluster of pings
14:50
is contributing to what we call a stop. When
14:53
we create the timeline for that stop, we create
14:56
a dwell time for the stop and it's really just the kind
14:58
of the maximum timestamp that's in
15:00
that cluster, subtracting off
15:03
the minimum timestamp. So we're establishing
15:05
that even though we only have a few pings contributing
15:08
to
15:08
it, we kind of fill in
15:10
that segment in the timeline and say like
15:12
during this block of time, that person was
15:15
stopped. Maybe they were at a venue. The
15:17
more difficult thing is kind of like between segments
15:19
when there are gaps between segments, because
15:21
obviously in an ideal world, you
15:24
would want to have a stop segment
15:26
followed by a movement segment followed by a stop
15:28
segment. And so when these things
15:30
are being created, there is a kind of
15:33
a process of coalescing. If
15:35
you have two moving segments that are close together,
15:37
we will coalesce them into one larger moving
15:39
segment because it just makes sense. It should
15:41
be this kind of flip-flopping between moving
15:44
and stop. But there are times
15:46
where for an extended period of hours,
15:48
for whatever reason, perhaps the
15:50
person with the device was indoors
15:54
or their battery died or they
15:56
got on a plane like there's lots of reasons why
15:58
they disappear from our radar.
16:00
And we don't try to do
16:02
currently, you know, with
16:05
the way we process the data, we don't try to
16:08
get too inventive with how we interpret
16:11
it
16:11
between those in those large gaps. And
16:13
the one exception to that is, you know,
16:15
in the evening and at night, like if you
16:18
live in, let's say you live in a concrete
16:20
apartment building, the chances of signal
16:22
being able to reach a satellite in order
16:24
to produce and record these things
16:27
is very attenuated. So
16:30
if we see that you stopped or that you
16:32
entered a building that we did, that has been designated
16:35
as your home in our modeling and
16:38
we don't see any paintings for many hours,
16:40
you know, during the nighttime.
16:42
If you then reappear the next morning within
16:45
a proximity of a few hundred
16:47
meters or so of where you disappeared
16:49
off our radar and it was overnight, then
16:52
we will interpolate between those two
16:55
points, which we did have data and say like you had
16:57
an overnight stop at this place. And it's
16:59
even more likely to be the case if it's
17:02
a place that we've, our modeling has designated
17:04
that you live, we'll call it an overnight stop,
17:07
even though we didn't have any data in that
17:09
gap. But other than that,
17:11
you know, there are companies out there that
17:13
are in the business of generating synthetic
17:16
data to kind of
17:18
mimic, you know, human patterns of movement.
17:21
But we don't currently do that
17:23
at four squares. So we try to
17:24
minimally interpolate when there are gaps.
17:27
We let those gaps exist for the most part.
17:30
And all these data streams, are they being delivered
17:32
to you in real time? And what
17:34
I'm wondering here is that let's say you
17:36
get a delivery update or your own systems,
17:39
your own apps pick up this device
17:41
ID. You can see it today. And
17:43
then a week later, you get some more data from
17:45
a third party provider. Do you need to wait
17:48
a certain amount of time to, you know, gather
17:50
that data in and make sure that you can,
17:53
like, identify those devices
17:54
before you start processing data? Does
17:57
that make sense? I mean, this is a really
17:59
good question.
17:59
You've kind of hit the
18:02
nail on the head with one of the issues.
18:04
In consuming third party data, there are always issues.
18:06
There are issues around quality, but there are also issues around
18:09
latency. Obviously, the data that we
18:11
get from our own apps, we're
18:13
able to process fairly quickly.
18:15
There's very low latency there, and it's high
18:17
quality because we own the
18:19
data. It
18:21
never leaves the boundaries of our
18:23
company's data. The
18:26
external data, it's interesting.
18:28
If you consider a particular date of observation,
18:30
a
18:31
date during which stuff was happening
18:33
in the real world, and we want to collect
18:35
as much of that third party data as possible,
18:38
it can take many days. Now, we
18:40
get deliveries daily from third parties,
18:42
but it can take many days to fill
18:45
in all the blanks about that day of observations,
18:48
the date in which things were generated.
18:51
Even five, six days later, it's worth
18:54
waiting those extra few days to get more,
18:56
to fill in the blanks
18:58
and to get substantially more
19:00
data about that
19:01
particular day of observation. The
19:03
flip side of that is then,
19:05
if this data is going to be used for
19:08
anything that requires a quick turnaround, like
19:11
for instance, some
19:14
of our data products lead to attribution.
19:17
If someone sees an ad for
19:19
a quick serve restaurant on their
19:21
mobile device, that ad impression
19:23
may be registered with that device. Then,
19:27
much like digital, a little bit more
19:29
challenging, I would say, than digital conversions
19:32
where someone
19:33
might see an ad for socks and then click on
19:35
a website and go and buy some socks,
19:37
like within 10 minutes of seeing the ad,
19:40
it's more of a challenge to connect
19:42
the real world conversion of someone walking
19:44
into a quick serve restaurant
19:46
because they saw an ad for
19:49
a hamburger online
19:51
and it made them hungry. But either way,
19:53
there's still this issue of the
19:55
conversion window and we want to have feedback
19:58
from the campaign, from the advertising. campaign that
20:00
produced the impression as soon as possible.
20:03
And depending on the needs of the client, that may
20:05
be ideally within a few
20:07
days. So we wait more time to collect
20:09
more data so we can make
20:12
more high-fidelity observations
20:15
about what the person did in the real world, but then
20:17
the clients also want a fast turnaround.
20:19
So typically, there's some sort of sweet
20:21
spot. It could be between
20:25
two or three and seven days, depending
20:27
on the client. And their tolerance
20:30
for a delay and waiting for that signal.
20:33
This is a perfect segue onto the
20:35
next obvious question here, which is what is this
20:37
data used for? And in a previous conversation,
20:40
you've mentioned this idea of attribution
20:42
and targeting.
20:43
And I want to get into that in just a second.
20:46
But first, I want to understand
20:48
about movement and at
20:50
rest.
20:51
Because I think this will help people understand
20:53
where the conversation is going
20:55
to go from here. Which one of those two things
20:57
is more important for you as a company,
21:00
to know that the device is moving or at
21:02
the device is at rest? Yes.
21:05
So I see there's a good story behind this. So
21:08
when I came to the company, one of the things I was
21:10
tasked with was building a team
21:12
to
21:13
upgrade these movement pipelines,
21:16
use more cutting-edge technology
21:18
and more robust, make these pipelines
21:20
more robust. And we
21:23
were looking at how things have been done previously. And
21:25
there's a certain amount of ML and algorithmic
21:28
work that went into it. And we
21:30
wanted to move quickly and build something that
21:33
was simple to understand and also
21:36
easy to maintain. So we started by building
21:39
a baseline model for
21:41
this movement segmentation piece
21:43
that's not relying on
21:46
kind of off-the-shelf algorithms or
21:48
any sophisticated ML that would
21:50
then require upkeep
21:53
and ML ops practices.
21:56
I think just in general as a data
21:58
scientist, we should always... start by building
22:00
a simple, heuristic rule-based model.
22:03
And that can be our baseline. But it also demonstrates
22:06
that the people who are tackling the problem
22:08
understand the problem, because they built rules
22:10
that work. And it's also super easy
22:12
to debug, whereas ML
22:15
can be a bit of a black box phenomenon. So
22:17
one of the epiphanies we had was, as
22:20
you look at a device trail as someone is
22:22
walking down the street with their mobile phone in their pocket,
22:24
it's easier to actually measure
22:27
movement. And you sort of
22:29
see this with your, even when you're using
22:32
Google Maps, quite often it doesn't
22:34
know which way you're facing when you
22:36
start driving. Like, it thinks you're going the wrong
22:38
way down the street. And then it quickly updates
22:41
and flips you around on the map. And
22:43
so my point is that actually movement is easier
22:45
to detect than stops.
22:48
And in some sense, stops are like the absence
22:50
of movement. So as the indexing
22:53
on movement was one of the key things that we were
22:55
able to do to actually get a much more accurate
22:57
understanding of this phenomenon, it sounds
23:00
trivial. I know very
23:02
clearly if I'm moving or at rest, but
23:05
the mobile phone signals can suffer
23:08
from all sorts of jitter
23:10
and issues with urban
23:13
canyons, signals
23:15
reflecting off buildings or walls. It's called
23:17
multi-path, I guess, indoor
23:19
underground use, satellites being
23:22
blocked, and so on, trees. And
23:25
it's actually non-trivial to try
23:27
and solve that. And so you could look at speed,
23:29
for instance. But because of the jitter that's in the
23:32
signal, the kind of speed measurements
23:34
are quite often not reliable
23:36
when you're trying to do this segmentation.
23:38
So the takeaway was that
23:41
we wanted to focus on movement. Instead of worrying
23:43
about stops, let's focus on movement. Because
23:45
when you're moving down the street, your trajectory
23:48
takes a very definite shape. So
23:50
the idea was to focus on the shape of the trajectory,
23:53
rather than things like speed.
23:55
Because an old lady shuffling down the street with
23:58
her shopping bags is not moving very well. quickly
24:00
and her signal may have a lot of jitter in it, but
24:03
if you look at her kind of average trajectory,
24:05
it's a very uncoiled shape. And
24:08
so that's kind of the metric that we were using.
24:10
It's kind of a shape metric. I call it the spaghetti
24:13
shape index versus
24:15
like when you're stopped, your pink trail
24:17
tends to be kind of coiled up. Maybe because
24:19
of jitter, it looks like you're moving fast, but your trail
24:22
tends to be kind of coiled up like spaghetti on a fork.
24:24
So then stops became like the absence
24:27
of movement once we had that that epiphany.
24:30
And what's interesting
24:32
is, yes, we focused on movement,
24:34
but actually in
24:35
terms of our business, stops
24:38
provide more value, but it's kind of
24:40
like yin and yang. So stops provide
24:42
value because if you can define
24:44
a device is at rest, if it's in
24:46
the vicinity of some commercial venue,
24:49
then you can say maybe
24:50
that person who was stopped there went into that coffee
24:53
shop nearby. So you've elevated the
24:55
stop from being a stop, becoming a visit
24:57
by doing this venue attachment.
24:59
Once you have a visit, then there's all sorts of
25:01
like commercial applications that open
25:03
up. And you mentioned already, there's attribution
25:06
and targeting.
25:07
If someone goes into coffee shops regularly, we
25:09
can assign them to an
25:12
audience, a bucket of devices
25:15
that can be sold as the audience
25:17
can be sold as an audience of coffee
25:19
lovers. And so the
25:21
attribution and targeting are like the opposite
25:23
sides of the coin with digital advertising.
25:26
First of all, you need to understand what
25:29
type of person might be interested in an advertisement,
25:32
digital advertisement. And so you
25:34
show someone a coffee ad and they're susceptible
25:36
to drinking coffee, then that's a good approach.
25:39
The other side of it is attribution. When you show
25:42
them the ad, do we know if that
25:44
person responded and went to a coffee shop?
25:46
And we have a team at the company that
25:50
looks at solving, making that
25:52
connection and doing it in a sophisticated
25:55
enough way to understand like, would
25:58
that person have gone to the coffee shop anyway? even
26:00
if we had not shown them the ad. So
26:02
there's like, there's ML in this in terms
26:04
of like causal inference models to,
26:07
you know, compare the actual behavior
26:09
with the counterfactual, like the baseline,
26:12
which is people tend to go to coffee
26:14
shops anyway, is there a lift in,
26:17
you know, their visitation if they see
26:19
an ad, right? So that's
26:21
a very valuable revenue
26:23
generating activity for the company. Like
26:26
being able to connect stops
26:28
to venues to be able to assign visits. And
26:31
then from the visit, you can match
26:33
that back to an ad impression. And
26:35
then that's what attribution is.
26:37
So like our, our partners, the clients
26:39
that are interested in our attribution product can understand
26:42
if their campaigns, their advertising
26:44
campaigns are successful or not. When you
26:46
talk about it like that, it sounds like you're looking at
26:48
stops and movement and as discrete objects.
26:51
Okay. That this one stop was important
26:53
to us. But
26:54
again, like harking back to our previous
26:56
conversation, you had this great phrase, let me see if I can pronounce it
26:58
correctly. It was
27:01
a semantically meaningful journey. And
27:03
when you've said that, it made me think like,
27:05
maybe this is more than just one discrete stop. Maybe
27:08
this is, you know, trying to build up a picture
27:11
of the journey itself. Like what was this device doing during
27:13
the day? What,
27:15
what does the pattern of the weekly
27:17
daily monthly pattern of this device look like? Am
27:20
I on the right track or am I completely
27:22
out of the way?
27:24
No, no, it's, I think you're right. Uh,
27:26
yes, there is definitely a contrast between
27:28
the kind of, uh, once
27:31
and done, uh, scenarios that I'm
27:33
describing. Like, you know, it's much
27:35
more generic to say someone tends to visit
27:37
coffee shops or to say, Oh, they saw this ad
27:39
and then they went to the coffee shop, right. Or
27:41
the big
27:42
serve restaurant. It is definitely
27:44
like, I think foundational to,
27:47
as a data scientist, I want to be able
27:49
to recreate a
27:51
picture of reality. I want to be faithful
27:53
to what's happening in the real world for
27:56
the users of these devices and
27:58
you know, that's kind of.
27:59
foundational to what we try to do. And it's
28:02
not anything necessarily that we then,
28:04
I want to make a distinction between
28:06
that and then what is actually presented
28:08
as a product, either internally or to external
28:11
clients. Um, you know, there's lots of
28:13
privacy concerns that we have
28:15
front and center about, you know, the products that
28:17
we deliver, but you know, as a
28:19
data scientist, it's my goal to have a full
28:21
understanding because I don't want to make mistakes
28:24
about how I infer
28:27
what was happening with that person.
28:29
If we have that full picture, you know, we
28:31
can serve those obvious use cases
28:33
of targeting and attribution,
28:35
but there are other, you know, more sophisticated
28:37
scenarios that you're alluding to with
28:39
like, wouldn't it be great if we could understand
28:42
the full, uh, like
28:44
longitudinal movements of a device throughout
28:46
the day. And it also helps us
28:48
understand the quality of our data. If we're, you
28:50
know, if for a particular provider, we can't
28:53
do that reconstruction in a very convincing
28:56
way, it might suggest that that provider
28:58
is not giving you very high quality data. But
29:00
in terms of like how we derive
29:03
on, uh, you know, that full longitudinal understanding
29:06
that, uh, holistic understanding of the
29:08
user journey throughout the day, one of the
29:10
applications for this is that,
29:12
uh, you know, we have a client who
29:15
is interested in building
29:17
synthetic models, synthetic twins of,
29:19
you know, real, uh, users in the real
29:21
world, these digital synthetic twins.
29:24
They populate cities with these
29:27
synthetic models and,
29:29
uh, you know, from these models
29:31
that are, that are trimmed on the real data that
29:33
we supply them. So this is a scenario where
29:36
we do have to have high quality
29:38
longitudinal stories about these, you
29:40
know, these holistic stories about the user journey,
29:42
because then the models they build will be a much
29:44
higher quality. And these are
29:46
the types of, uh, you know, synthetic
29:49
models that are really, really useful
29:51
to, um, like
29:53
city, uh, transit authorities,
29:56
uh, urban planners as they kind
29:58
of model the flow.
29:59
of human beings through the urban landscape. It
30:02
can really help with things like urban planning.
30:04
The good thing about that is it's very
30:07
privacy safe because none of
30:09
the real user data gets exposed
30:11
to the outside world. It's merely used
30:14
to train these synthetic models.
30:16
That sounds fascinating.
30:17
You're talking about getting
30:19
data in different chunks earlier on in
30:21
the conversation, and you were relating this to
30:24
attribution. How is our
30:26
client gonna know if the
30:28
device saw that, walk past the billboard, and then
30:30
went and had a cup of coffee?
30:32
But that made me think of
30:34
this idea, wow, you could monitor
30:37
a disaster, for example. Or
30:39
you could look at a disaster in retrospect
30:42
and see how people responded to it, like leading
30:44
up to it and after it, maybe even
30:46
during it and after it. Do you not have anyone
30:49
doing work like that? I mean,
30:51
I can say the answer is yes. We
30:53
have a
30:55
government client that's very interested in
30:57
modeling what happens in,
30:59
not even modeling, but just actually observing
31:02
what happens in the aftermath of,
31:04
let's say, a hurricane. You can
31:06
imagine that satellite data
31:08
can be rather patchy. You may
31:10
not have satellite imagery of what's happening. So
31:13
in terms of even disaster
31:15
relief and planning for future disasters
31:18
in response to those, this
31:22
sort of data is and will
31:24
be immensely valuable. We
31:26
do have an interest from a client
31:28
in that. And I probably can't say who it
31:31
is, but that is definitely, you've
31:33
hit the nail on the head there too with that. So
31:35
interesting, you talked about
31:37
satellite data. For the
31:39
last little while, people were talking about, we can
31:41
use satellite data and we can look at the car park
31:44
at Walmart and figure out how many
31:46
cars in there and sort of, long story short,
31:48
figure out what the share price is gonna be, essentially,
31:50
whether it's gonna go up and down. Lots of people are visiting
31:52
Walmart. But my guess is you have pretty
31:55
great data on that. Do you work with
31:57
satellite companies to help them sort of augment the
31:59
analysis that...
31:59
they're producing or
32:02
could you? There are companies that,
32:04
like you mentioned, will take that satellite data
32:07
and try to infer, you know,
32:09
as you say, if, if
32:11
for a particular big box store,
32:14
there are 20% fewer
32:16
cars in the parking lots, this
32:18
border compared to the last border, you know,
32:20
maybe earnings will be down, right? That
32:23
has all sorts of, you know, potential
32:25
issues in that the data is really
32:28
quite sparse. You know, there, until recently,
32:30
I think, you know, Planet Labs has
32:32
these amazing doves that, that encircle
32:34
the globe and maybe have a, you
32:37
know, daily line scan
32:39
image of the earth that gets updated daily. But
32:42
apart from them, I mean, you're relying
32:44
on, you know, a very low
32:46
coverage of parking lots of
32:49
big box stores, you know, from other satellites.
32:52
And you're also, you know, at the mercy of
32:54
the weather and you're also at the mercy
32:56
of the fact that parking lots can be underground,
32:58
right? So definitely the
33:01
people that are interested in our data products
33:04
might be using those as well for the same purposes.
33:06
But I would say that we're
33:08
immune to some of those concerns like weather, for instance,
33:11
we can actually determine what foot traffic
33:13
was like to a particular high street store or
33:15
a big box store in a way that
33:17
is much less sparse. So
33:20
that's definitely one of the applications
33:22
of the, you know, us being able to
33:24
generate visits from stops. That's
33:27
like a direct application of visits. These are kind
33:29
of like business insights that you would, you
33:31
know, derive from the visits. It's interesting.
33:34
Like this really makes me think that
33:36
that whole argument, and I realize it was just
33:38
an example that they could, you know, a tangible
33:41
example that they could tell people about, oh, we could
33:43
do this, you know.
33:44
And my guess is this was
33:46
an example that was sort of supposed
33:48
to help people, to open people's eyes to
33:51
the possibilities. But it really does make
33:53
me think like there's probably a better
33:55
way of doing this. And maybe your data
33:57
is a better way of doing this.
33:59
to this conversation about
34:02
how we use
34:04
the data to provide insights
34:07
about foot traffic to various chains and
34:10
business categories. There is that
34:12
data, of course, coming in from third parties,
34:14
but one of the things we really focus on is
34:18
coming back to this idea that we have our own owned and
34:20
operated apps, we can look at,
34:22
and this comes down to quality, right? And
34:25
this is one of the ways in which we filter aggressively
34:27
for quality. I just wanted to bring
34:29
this up, like looking at the intersection
34:32
of those devices that are in both the
34:34
data from our own apps
34:36
and that appear in the third party apps,
34:38
we were able to, well,
34:40
first of all, we build a machine learning model to
34:42
determine like which providers are
34:45
more trustworthy than others. In other words, I
34:47
think of it this way. If our first party
34:50
apps are saying that at a particular point
34:52
in time, a device that is in
34:54
third party and in first party data,
34:57
if that device at that time was, let's say in
34:59
San Francisco where I am and a third
35:02
party app says, Oh, actually that device
35:04
is 20 miles away or
35:06
it's in San Jose, right? That's an example
35:09
of a training label. We can then
35:11
apply to the third party data. We can look
35:13
at the composition of the stops
35:16
and business that we generate and they're composite.
35:18
You know, some of the pings come from one provider,
35:20
some of them come from another source.
35:22
And so we can see like based on the composition,
35:24
what's the likelihood that that
35:26
stop is real or that visit is real.
35:29
And then we can build a model on top
35:31
of that. So we can train on the devices that
35:33
are in the intersection of our data and the
35:36
third party data.
35:37
And then we can apply the model to predict
35:39
on top of the third party data. And that
35:42
way, you know, we can do some very
35:44
aggressive filtering for this, this idea
35:46
of veracity. And that way, because
35:49
as you say, there's a hundred billion pings coming
35:51
in, we need to be very careful about, you know,
35:53
how much of that we just directly ingest
35:55
in a very naive fashion. So at the very
35:57
top of the funnel, we can actually take that data.
36:00
and apply these models and start to really restrict
36:02
it, like turn down the flow based on
36:04
these, like, veracity predictions, and that way,
36:07
then we get to something we can say more confidently about
36:10
the, you know, how many people visited the mall that
36:12
day or the big box store. Would
36:14
it be fair to say that you can use your own first-party
36:17
data as a form of ground truth? Yeah,
36:19
of course, that's right, yeah. I see it
36:21
as sort of a quality assurance
36:23
chain that starts with, you know, we have
36:25
this app that has such a great loyal user
36:28
base, you know, this warm app, and
36:30
these people are creating their own
36:32
visits, basically. They're doing it for themselves,
36:34
but in a way, they're doing it for us, too, right?
36:36
We know
36:37
then that, you know, when the phone says
36:39
this person stopped somewhere, that
36:42
the algorithm inside the phone that is part
36:44
of our app is doing a good job because
36:46
the person is verifying the human in the loop is
36:48
creating that training label and saying, yes, I was
36:50
at this venue, so we can calibrate
36:52
our own models on our first-party apps, and
36:54
then the chain then goes to the third-party
36:57
data, so we use the first-party data to
36:59
validate the third-party data
37:01
and do this veracity modeling. That's
37:04
the way I look at it, as a chain of quality
37:06
assurance. I'm really, really pleased
37:09
you shared that with me. It makes a lot of sense, and
37:11
it's interesting. So we've been talking about these
37:13
different sort of use case
37:15
applications for the
37:17
data. You've done a great job of telling us about
37:19
the data, where it comes from, how you process
37:21
it, the way you check it, these checks and balances
37:23
that are in place, the idea of segmenting
37:26
it into stops and movement
37:29
and why that's important. We talk a little bit about
37:31
attribution and targeting, that the flow
37:33
through a city, this
37:36
idea of a segmentically meaningful journey.
37:39
I want you to describe one last example for us, if
37:41
you would please, and this is the idea of crowdsourced
37:43
routing. Right,
37:45
yes, so this is a
37:48
work in progress. Part of our
37:50
research team is working on, if
37:52
you think about it, and this is a good
37:54
example. Earlier we were talking about how
37:57
most of the value we bring through understanding.
38:00
the raw GPS data is in determining
38:03
stops and then visits. And then obviously that
38:05
leads to
38:06
attribution. But
38:07
now that we are doing a better job at
38:09
segmenting, you know, the movement
38:12
in terms of understanding movement
38:14
itself, not just the stops, you
38:16
can imagine a very straightforward application is, you
38:19
know, when you look at a map, you might look at a hotspots,
38:22
you know, hotspots on the map, you can aggregate
38:25
where people stop, you know, on some grid,
38:28
you know, at the four square, we use the H3
38:31
grid system, hexagonal grid system.
38:33
So you can just simply do a, you
38:36
know, a binning, like how many stops have
38:38
occurred in this particular location. And that'll
38:40
tell you maybe some information
38:42
about like where people enter a building,
38:45
right? Because people stop near the entrance or
38:48
they, you know, the density of stops is higher
38:50
near the entrance of a building. So it gives you some meaningful
38:54
understanding of places beyond just say
38:56
polygons. So that's about stops,
38:58
right? So what to think about movements. So instead of thinking
39:00
about
39:01
hotspots, you could think about hot trails,
39:04
right? Like we also like
39:06
aggregate
39:07
people's movement segments, like
39:10
again, on a H3 grid, it's a way to kind
39:12
of horse grin those trails,
39:14
those moving segments. It also
39:17
guarantees a certain amount of privacy
39:19
because, you know, we're talking about public roads
39:22
and we want to sort of
39:24
snap people's moving segments
39:26
to those roads and the ones that are
39:28
more heavily trafficked are
39:31
the ones that then, you know, you might designate
39:33
as these hot trails. In some sense,
39:36
that's crowdsourced routing, right? Like, you
39:38
know, you learn in computer
39:40
science about, you know, pathfinding algorithms
39:43
like the Dextra pathfinding, right? Which
39:45
tries to find the shortest, lowest cost
39:48
path between two spots. But there may
39:50
be other reasons why, you know, on a map
39:52
that people use a different
39:55
sequence of waypoints, a different
39:57
route through the map than maybe the shortest.
40:01
And so, you know, much as when,
40:03
I don't know if you were a kid growing up, like I grew
40:05
up in Ireland, then in the country, there
40:07
are always these kind of well-worn paths, and you
40:09
wonder, like, were they made by people or by animals,
40:12
like through the forests and the hills. This
40:15
is an example like that, right, where you're finding
40:18
that kind of hot trail on the map, and
40:20
that could definitely have an application for sure. So we're
40:22
just exploring that and seeing
40:24
if we can actually produce that as
40:27
a product, then, you know, working with sales
40:29
folks to
40:29
see if there is a market for it. That
40:32
would be really interesting to see how that plays out.
40:35
No one reminds me of. You
40:37
have the restaurant and it says, Popular. Most
40:39
people eat this thing here. And it gives
40:42
you a sense of certainty. And I just
40:44
imagine looking at my, you know,
40:45
navigation app thing there, fastest,
40:48
shortest, most eco-friendly and popular.
40:52
And I wonder which one people would choose because, Popular,
40:56
there's a certain amount of certainty that comes
40:58
with that.
40:59
Most people choose this one. That's right.
41:01
That's right. And, you know, quite
41:04
often, routing apps will send you on
41:06
a route that maybe doesn't
41:09
penalize left turns, like maybe it's
41:11
the shortest, like end-to-end. But, you
41:13
know, there are more, maybe it's a
41:15
more
41:16
dangerous, less effortless way to
41:18
go. So I
41:20
think sometimes as you're implying,
41:23
like the maybe the
41:25
kind of the lowest common denominator route is
41:27
the one that is the most effortless. Maybe
41:29
that's kind of what we should be optimizing for. Yeah. Or
41:32
maybe it's the most peaceful. Maybe it's the most beautiful.
41:34
Maybe it's whatever else. That's right.
41:38
Most scenic. And my guess is there'd
41:40
be an interesting overlap, you know, between
41:42
what the computer thinks is the best and what the
41:45
humans think is the best. I mean, so in the
41:47
city where I live, for example, there's
41:48
lots of cycleways, cycleways everywhere.
41:50
And you can, they've made a huge effort to try
41:53
and go, please go on the cycleway. But
41:55
people always cut corners if they can because,
41:57
well,
41:57
that's great. The machine said I should go straight.
41:59
here, but you know what? The human in me just wants
42:02
to turn around the corner there. And you can see these
42:04
well-worn bike paths, you know,
42:06
just on the side and these little sneaky
42:08
routes that people take because that
42:10
is clearly a great place for humans
42:12
to go. Humans would like to move in that direction
42:15
or in that way. I think that's really
42:17
interesting. Yeah, it'd be great if, you know, that
42:20
we could have like a data-informed approach
42:22
to that too, right? I think that would be
42:24
amazing.
42:25
Yeah, yeah. And that, like, this ties back
42:27
into what you were saying earlier about,
42:29
you know, city planning. If we know more, the
42:31
more we know about the movement, how are the people
42:34
living in the city actually
42:35
moving through the city? And if we can model that and
42:37
create a city, and how
42:40
would they like to move through here? Not how are
42:42
we going to force them to move. It's probably a bit of,
42:44
you know, give and take there. But I think that would be
42:46
interesting. Yes, for sure. And also,
42:48
like, I think there is a component of this that
42:51
is going back to
42:53
semantic segmentation. I mean, I think,
42:55
you know, much as like, you know,
42:57
when we do, let's
42:59
say, video calls and
43:02
the algorithm on your video
43:04
call knows the difference between you and the foreground,
43:06
and then it can blur out the background. So
43:09
it has that distinction of, like, foreground
43:11
from background or like earth versus sky,
43:13
that type of segmentation that we can do in computer
43:16
vision. I think there's another research
43:18
direction in this, which is sort
43:21
of semantic segmentation on maps.
43:23
You know, we have different ways of mapping. Some
43:26
of it is crowdsourced, you
43:29
know, there are people out there annotating maps for
43:31
OSM. You can also draw maps
43:33
using satellite imagery using the same
43:36
sort of segment, semantic segmentation.
43:38
I think Microsoft
43:38
has done that. But we could also
43:41
be using the mobile phone signals and this understanding
43:43
of like, stops and motion,
43:46
you
43:46
know, vehicular motion versus pedestrian
43:48
motion, to be able to draw maps, maps
43:51
without maps, using the ping trails
43:53
of humans, just, you know, aggregating over
43:55
time to remove the noise, even
43:57
the speed of which the people are moving. would
44:00
provide segmentation of roads
44:03
into fast moving roads versus slow
44:05
moving roads, and also uncover
44:07
anomalies between the usage
44:09
of roads versus how the roads have been drawn
44:12
by these other sources. So there's
44:14
a possibility of enrichment there too,
44:16
I think. This is kind of fascinating. So we've
44:19
been talking for a while now and coming up with all
44:21
these ideas of stuff you can do. And
44:23
right at the start of the conversation,
44:24
you said our data looks like this. It's
44:27
like imagine a spreadsheet with a latitude,
44:30
a longitude, a time, and an ID. It's
44:32
kind of amazing that from that
44:35
we can see so much potential. Yeah. Yeah.
44:39
And so that's the power of data science and data
44:41
engineering. And I think geospatial data,
44:43
you asked me why I chose this career, but I think
44:46
it's some of the most challenging out there in
44:48
the domain of data,
44:50
data science, because so little
44:52
of what we do
44:54
makes sense unless you can
44:56
really just look at it on a map. Foursquare
44:58
has Foursquare Studio, which is our visualization
45:01
studio. And I think anytime
45:03
we have any incoming data scientists, I always insist
45:06
that they draw their maps. They
45:09
don't just look at the data in tables and in statistics
45:12
and metrics, but they actually plot their maps in studio
45:15
because you just don't really appreciate, you
45:18
don't have the correct contextual awareness until
45:20
you plot things on a map. And
45:22
that's part of the nuance of geospatial. And I think that's
45:24
always been part of what fascinates me about
45:26
it.
45:27
Yeah. So with that, I had
45:29
a question I was going to ask you about, like, what are you going to do
45:31
next? Or are you run out of things to
45:33
do, but you kind of, I think you've alluded
45:36
to it that you haven't run out of things to do, that there's a lot
45:38
to do there. It's very challenging. So
45:40
I think I'll take this in a different direction for the last
45:42
question. So the last question is, do
45:44
you think that spatial is special? Or
45:47
is it just more data? Well,
45:49
I think it's related to what I've just been saying. Trying
45:51
to extract value from spatial data
45:54
is definitely very challenging.
45:57
I think in terms of the industry, it's
45:59
still...
45:59
still somewhat untapped.
46:02
Things like targeting and attribution are very
46:05
much low-hanging fruit. And
46:08
I think we owe it to ourselves to unlock
46:10
a lot of the other potential out there. And
46:14
luckily, the other potential comes
46:17
down to doing these more sophisticated things like understanding
46:22
these user journeys throughout the urban landscape,
46:24
building maps without maps. And
46:27
figuring out, we do work for businesses. This
46:30
is not just a research project. It's all about,
46:32
how can we generate revenue from that as well?
46:34
Is this something people are interested in, or is it just a
46:37
paper that I'm going to present at a conference?
46:40
Yeah, that makes a lot of sense.
46:42
And I think you mentioned that right at the start,
46:44
when you're talking about being a good data
46:46
scientist, can you see the bigger picture?
46:49
Can you see how this is going to create
46:51
value for our customers? So
46:53
that ties in nicely with what you just said there,
46:55
at least in my mind. Gabriel, this
46:57
has been awesome. I've really enjoyed talking
47:00
with you. Covered a lot of ground and
47:02
really, really enjoyed the conversation. Yeah, me
47:04
too.
47:05
People know that you work for Foursquare. Is
47:07
there anywhere in particular you'd like to point
47:09
them towards if they want to go and learn more?
47:12
I think you mentioned Foursquare Studio.
47:15
That would be a great place for people to check
47:17
out. Anywhere else we can share
47:19
a link to or point people towards? Yeah,
47:21
we can link the website. The website
47:24
is kind of a good springboard
47:25
into all of the different
47:28
activities.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More