Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello and welcome to the Data Engineering
0:13
Podcast, the show about modern data management. Data
0:16
lakes are notoriously complex. For
0:19
data engineers who battle to build
0:21
and scale high-quality data workflows on
0:23
the data lake, Starburst powers petabyte-scale
0:25
SQL analytics fast at a fraction
0:27
of the cost of traditional methods
0:29
so that you can meet all
0:31
of your data needs, ranging from
0:33
AI to data applications to complete
0:35
analytics. Trusted by teams of all
0:37
sizes, including Comcast and DoorDash, Starburst
0:39
is a data lake analytics platform
0:41
that delivers the adaptability and flexibility
0:43
a lake has ecosystem promises. And
0:46
Starburst does all of this on an
0:48
open architecture, with first-class support for Apache
0:50
Iceberg, Delta Lake, and Hoodie, so you
0:53
always maintain ownership of your data. Want
0:56
to see Starburst in action? Go
0:58
to dataengineeringpodcast.com/Starburst and get $500
1:00
in credits to try Starburst
1:02
Galaxy today, the easiest and
1:04
fastest way to get started
1:06
using Trino. DAGSTOR
1:08
offers a new approach to building
1:10
and running data platforms and data
1:13
pipelines. It is an open-source, cloud-native
1:15
orchestrator for the whole development lifecycle,
1:17
with integrated lineage and observability, a
1:19
declarative programming model, and best-in-class testability.
1:22
Your team can get up and running
1:25
in minutes thanks to DAGSTOR Cloud, an
1:27
enterprise-class hosted solution that offers serverless and
1:29
hybrid deployments, enhanced security, and on-demand ephemeral
1:32
test deployments. Go to
1:34
dataengineeringpodcast.com/DAGSTOR today to get started, and
1:36
your first 30 days are free.
1:39
Your host is Tobias Macy, and today I'm interviewing
1:41
Andy Jefferson about how to solve the problem of
1:44
data sharing. Can you start by introducing yourself? Yeah,
1:46
hi Tobias. I'm Andy. I'm the CTO
1:48
at Foxled. We're
1:50
a Series A startup solving
1:53
the problem of data sharing for enterprises in
1:55
the cloud. And do you remember how
1:57
you first got started working in data? For me, like, software is a
1:59
very, very important thing. Software engineering has always
2:01
kind of been about moving and processing data,
2:04
whether it's getting a
2:06
tweet or an iMessage from
2:08
my phone on the Yourphone, or
2:10
whether it's controlled software in
2:13
a power plant or a chemical plant
2:15
as it's taking input data from sensors
2:17
and things and then processing
2:20
that data and then creating output data for the
2:22
signal to control software and pumps and
2:24
things like that. I started
2:26
relatively late during my
2:28
PhD, I was doing a PhD in chemical
2:30
engineering and I started with
2:33
work on control software. I wanted to enjoy
2:35
about that was that you actually write software
2:37
that did something very tangible and
2:40
real and it worked in the real world and
2:42
from that I got into computer
2:44
modelling during my PhD and
2:46
then a little later I actually quit
2:48
my PhD to join software and I
2:50
started because I found I was enjoying
2:53
the software engineering more than I was
2:55
enjoying the welding together of tons of
2:57
stainless steel and then from
2:59
there my first job was the database administrator
3:02
on a SQL server, micro SQL server
3:04
as it was then. So my very
3:06
first job in software engineering was in
3:09
the data realm. There are a lot
3:11
of administration, I got
3:13
first-hand experience of the move from on-prem
3:16
to cloud and I started
3:18
starting today when I remember
3:20
when Microsoft SQL Server on Azure was
3:22
launched and then testing out
3:24
and being really excited about it. No
3:26
longer having SQL Server initially just missed me
3:29
and then moved to a really big greenhouse
3:31
and I was going to visit them
3:33
and I got the chance to see
3:36
that transition just kind of really un-benefited. No longer
3:38
having to pay the sector for boxes and I
3:40
think that could go wrong with them and giving
3:42
the upgrade and all that kind of stuff. And
3:44
yeah, from there I think it's a pretty exciting
3:47
and fun career so far and moving through different
3:49
kinds of things. So from there I moved into
3:51
a company that was doing an OLAP
3:54
database built on top of
3:56
Cassandra at a time when no SQL was a big thing.
4:00
No people, no people, no faces with a new hot thing. OMAQ
4:03
is very fashionable. I have a company called
4:05
Akunu who are building a hot solution as
4:07
a standard. And we work
4:10
with some large ride sharing firms and
4:12
things there. So I got, I
4:15
got to see some of the power of like big data
4:17
and things you could do. So
4:20
a real fail. And
4:22
from Akunu, I went to Apple,
4:24
I found. We're
4:27
gonna first experience doing data sharing
4:29
in consumer data sharing. So
4:31
I found I was working on kind
4:34
of pre-sales and base-books that was
4:37
related to how you share, not just between
4:39
devices. I changed
4:41
some of the things like photos, and
4:43
updates between different devices. But we
4:45
also worked on the first pre-sales for
4:47
doing sharing where you could do stuff like share
4:49
photos to another. So
4:52
we also did some big data processing work
4:54
there and we were doing stuff with rich
4:56
and charging at scale. So
4:58
when you're sharing things, you need to keep
5:00
keeping it safe. How many people have shared it?
5:03
And if all of the people who've shared it
5:05
have deleted it, then you can go on to
5:07
collect tips and things. We were doing that with
5:09
very large, Hadoop jobs. That
5:12
was the thing that was of its time. Sure,
5:15
it'll probably be spark today. After
5:17
working at Apple, I
5:19
went on to work at Neo4j. Right
5:23
here, once we're done for track four, we're building an
5:25
AI solution. And
5:27
there I've worked on building data infrastructure
5:29
again, the training of neural networks to
5:32
do computer vision. And after
5:34
that, I worked, Neo4j was a graph
5:36
database. So
5:38
my career has earned quite
5:40
a range of databases and data
5:42
technologies. And therefore, I worked on that. On
5:46
both the database and service products. So
5:48
you have managing Neo4j clusters as
5:51
a service for you to do. And on
5:53
the clustering algorithms, doing things like the
5:55
raft, implementation and working on
5:58
scaling and distributed. And
6:00
this is algorithms for a near-fore day. So you
6:02
could scale out your clusters, like thousands of nodes.
6:05
If you wanted to do kind of big data
6:07
graph, pretty. So yeah, quite a range of graphs.
6:10
I've had a near-fore day where I met Jason,
6:12
my co-founder at Oslet. And now
6:14
for the context of this conversation, I'm
6:17
wondering if you can start by giving
6:19
some scope and framing around what we
6:21
mean when we say data sharing, because
6:23
that can mean any number of a
6:25
broad variety of things. And I'm wondering
6:27
if we can just kind of give
6:30
the proper framing for what we want
6:32
to discuss during the rest of this
6:34
conversation. I think we've been sharing data for
6:36
years. I've always talked about consumers have
6:38
been sharing data for a long time. And even if
6:40
you think about things like a tweet and a form
6:42
of data sharing, you write some data and then you
6:45
share it with the world on Twitter.
6:47
And I'm thinking since I've been doing this
6:50
for years, you can go well
6:52
back to businesses that used to place, set
6:54
the data around on CDs. Again, I
6:56
think that's my very first role. I
6:59
was in the UK in London and
7:02
we used to get a CD
7:04
from company like the police office,
7:07
US Code would be like US Code for the month, and
7:09
we had all the zip codes and
7:11
mapping of all the zip codes into kind
7:13
of addresses and reasons. And that used to
7:15
be something that people provided on a CD.
7:18
You sign up and you pay money, they
7:20
got a good deal with you in the
7:22
mail. And so data sharing between all relations
7:24
has been going on for years
7:26
in lots of ways, from
7:28
CDs in the mail through
7:30
to APIs, different
7:32
kind of cloud based sharing. Techniques, right,
7:35
a lot of people
7:37
use APIs. A lot of people
7:39
use sharing data. I have data. And
7:42
if you call my API and I tell you, you know, tell
7:44
them about some of the data I have, we're
7:47
sharing that data and you're doing something with it. And
7:50
for this conversation, we're concerned with
7:52
data sharing between businesses and
7:55
data that's being shared, really for the purpose
7:57
of analytics. We're actually OLTP versus OLTP. We're
8:01
thinking about a fairly large
8:03
amount of data that you're sharing with someone else so
8:06
that they can use that data in their analytics. And
8:08
the typical usage involves things like joining
8:10
that data, that other data that
8:13
the recipient has, and then doing
8:15
that. So it's pretty rad that someone just
8:17
said, hey, give me some data, I'm gonna
8:19
analyze it in isolation. You can think of
8:21
some things like in the financial world where
8:24
maybe you say, you give me all of
8:26
the stock particular data and I'm just gonna
8:28
analyze it in isolation and then try
8:30
and use that to make predictions about what stock prices will be.
8:33
But in reality, even that usage is quite
8:35
rare and most other use cases you're saying,
8:38
let's share data between our two enterprises and then
8:40
each of us, it might be one
8:42
way or two way, but the recipient is saying, join
8:45
that data with some stuff they have or use it in
8:47
their own applications. But it is a broad
8:49
scope. Any
8:53
other thing that you can do for the decision. And
8:56
so given that context of I
8:59
at organization A want to be able
9:02
to send data to organization B or
9:04
I need to be able to request
9:06
data from organization B to use for
9:08
purposes of some sort of partnership agreement
9:10
or whatever the case might be, what
9:12
is the current state of the art
9:15
and state of the ecosystem for being
9:17
able to enable data sharing across organizational
9:19
boundaries, whether that is separate businesses or
9:21
just different business units within an enterprise
9:24
and some of the complexities that arise
9:26
because of that current state of the ecosystem. Yeah,
9:29
during the cloud, we do see a lot of
9:31
internal organization data sharing as
9:33
well. We speak to a
9:36
number of people who have problems, particularly larger
9:38
organizations, with geography or
9:40
whether you've done things like acquire different business units
9:42
who have different platforms and things. Yeah,
9:44
it's great class. But what's the current state of
9:46
the art theory? I started off
9:48
talking about kind of sending CDs in
9:50
the mail. The kind of follow on
9:52
technology from that really is SDP, SSTP
9:54
and sharing CFPs. And
9:58
we see this is actually... dominant
10:01
mechanism today where
10:03
they should have two organizations.
10:06
If someone maintains an FCP server and
10:09
they put CSV files on it,
10:12
there's usually some dance involving sharing
10:15
RFA certificates so
10:17
that you can SSH connect to someone's
10:19
FCP server and then you can reach
10:21
an H-ROPP. FCP can
10:23
operate in a push or a pull orientation
10:26
so you could push data onto
10:28
my FCP server or a pull benchmark for
10:30
your FCP server. That's been a dominant thing. I
10:32
think some of the shortcomings of
10:34
that are kind of obvious because this is one
10:37
of the most important ways to make this kind
10:40
of environment. So let's think
10:42
a lot. Then the follow-on from
10:44
that is really data
10:46
API. This I think is
10:48
kind of shared with you via HTTP. It's
10:50
not round it too different from shared with
10:53
you on FTP but they're very difficult that
10:55
we can see in a lot of particular
10:57
things like SaaS businesses. You make an API
11:00
call often with some query
11:02
parameters which is just necessary because you
11:04
can't pack that much data into a
11:06
single HTTP request and
11:08
so you say hey give me some data here
11:11
are some parameters that scrape it down for a
11:13
reason before it's honked and I make the API
11:15
call and you return that data. It's often again
11:17
ultimately kind of CSV or JSON formatted. In some
11:19
scenarios you make an API call and you kind
11:22
of get back a parquet file or something and
11:24
that's less common. I talked a bit
11:27
about limitations. That's just really not something
11:29
that HTTP is built for and there
11:31
is a cottage industry of home
11:33
built tools that people have for
11:35
kind of scraping these APIs and
11:37
then reconstructing your complete tables via
11:39
lots and lots of queries. That
11:42
was really I think based on a lot of people who were
11:44
kind of saying like we have a hammer. So
11:46
you know everything that the hammer that people had
11:48
is like REST API and
11:51
JSON that serve JSON data and they kind
11:53
of just took that hammer and applied it
11:55
to carrying it with the data. Then you
11:57
have the major state of the art. At
12:00
the moment, it's connected. Companies
12:04
like Fytern, Stitch, and
12:06
others who provide, connect
12:09
to things either as a service or as
12:11
a software or as a source, a product
12:14
you can use and run yourself. That
12:18
really helps you to, as a
12:20
consumer data, to pull in data that
12:22
is shared with you from a range of
12:25
different sources. If the connectors can
12:27
connect to the Stator API, they
12:30
can connect to things like FTP. And
12:32
they're based on a kind of
12:34
pull principle. The consumer
12:36
of the data takes responsibility and they
12:38
use it for getting hold of the
12:41
data that's been shared to them, and
12:43
they are moving the bytes using the connector
12:45
and then putting it somewhere, whether it's like
12:47
File Store or a Jira Data Warehouse. And
12:51
connectors in an ecosystem
12:54
where there's lots of data sharing through
12:56
that are inherently quite user-efficient because
12:58
every consumer has their own connector
13:01
running, has their own copy of
13:03
the data, and there's kind
13:05
of inefficiency and there's a latency, and there's
13:07
a lot of duplicated compute with lots of
13:09
different people making the same API request to
13:11
pull the same data into different places. It
13:13
puts the responsibility on the consumer of the
13:16
data to kind of operate and maintain and
13:18
run the system. And then there is in-place
13:20
kind of cloud native sharing. Pretty
13:22
much every major data platform or cloud
13:24
platform offers that today. So whether it's
13:26
something like S3, which
13:28
has a feature access point, which is
13:31
particularly designed for sharing data between S3
13:34
rockets, whether it's Snowflake sharing,
13:36
BigQuery sharing through Analytic Hub,
13:40
Big Plate has Delta sharing, Azure has
13:42
Big sharing. All the platforms offer
13:44
these things and they do what we call
13:46
in-place sharing, which the
13:48
key thing of in-place sharing is the data is
13:51
duplicated. So you immediately get a
13:53
huge efficiency bonus, particularly if you start sharing
13:55
safe and all kinds of data to
13:57
different users. And all the data warehouse.
14:00
they allow you to share the
14:02
data you have kind of as
14:04
you can see it you can
14:06
share the tables, not just
14:08
the data but things like the current teaching phrase
14:10
and the indexes and the views and all that
14:12
stuff. So it's really rich
14:14
as well as a more efficient report. When
14:17
we talk to all these things, what it
14:19
shows is that this data sharing isn't just
14:21
a purely technical concern. A huge
14:23
right tied up in the kind
14:25
of business socio-technical arrangement. Or
14:28
like when we share data, who takes
14:30
responsibility for what? Who
14:33
takes responsibility for paying the compute cost?
14:35
Who takes responsibility for maintaining the structure
14:37
of the data and the indexes and
14:40
the foreign teaching phrase and things like
14:42
that? And that's all tied up
14:44
in an approach. So when I talk about connectors,
14:47
there's an implicit expectation that practice
14:49
is a cool thing, that the
14:51
consumer data will be paying for a lot
14:53
of the compute and stuff happens. We
14:56
talk about push FTP, that's a
14:58
reverse expectation on who has responsibility
15:01
things. The consumer maintains an
15:03
FTP server, but the provider pushes it
15:05
to data internet. And that's also true
15:08
in place sharing. And yeah, we think
15:10
the in place sharing provides some of
15:12
the best splits of
15:14
these responsibilities. It allows things like
15:16
the person who's analyzing
15:19
and running compute on the data
15:21
page with the compute, but the
15:23
person generally who's providing data is
15:25
paying things like the storage. And
15:27
it works incredibly well for people
15:30
in terms of the use because we're basically eliminate you.
15:33
Right. Here's my table in so like I like
15:36
to share with you. That's it. There is no
15:38
e-mail computer process right now. You can start analyzing
15:40
straight away. So yeah, that's really the current state
15:42
of. And in
15:44
terms of those socio-technical elements
15:47
of data sharing and the
15:49
methods and motivations behind it,
15:52
one of the other complexities
15:54
also comes into the compliance
15:56
question where as the providing
15:58
organization I need to make
16:00
sure that I am eliding or masking certain
16:02
pieces of information because I can't share it
16:05
externally or I need to ensure that there
16:07
are appropriate controls on that data as it
16:09
is being shared so that it is not
16:11
accessible by some man in the middle or
16:14
a third party that is not supposed to
16:16
be involved in this sharing. And then there
16:18
are also questions of public data sharing where
16:20
I as an organization want to be able
16:23
to create and publish a public data set
16:25
that anybody can use, but I
16:27
don't want to have to pay millions of dollars
16:29
because somebody else is using all of
16:31
my compute to do analysis. And
16:33
I'm wondering if you can talk
16:35
to some of the ways that
16:38
those considerations factor into when and
16:40
how businesses decide that they actually
16:42
want to engage in these data
16:44
sharing agreements and some of the
16:46
ways that those considerations will maybe
16:48
prevent what would otherwise be an
16:50
amenable relationship. Yes, there are
16:52
two things that I think
16:54
can't be heard. The compliance
16:57
and privacy and sensitivity management
16:59
have some strong technical aspects.
17:01
It's also good to think about the
17:04
group that we see in data sharing
17:06
kind of arrangement. So we see
17:08
a real range from
17:10
things like supplier consumer relationships
17:13
in manufacturing where the
17:15
kinds of data being shared and stuff like
17:17
how much stock does
17:19
a manufacturer who is providing parts
17:22
to an assembler like have. So
17:24
we see this in the automotive
17:26
industry where the automotive buying organization
17:29
has huge power and they actually
17:31
have arrangements with their providers. They
17:33
say, hey, you've got to tell us how much stock you
17:35
have and how many parts you have on the shelf
17:37
and manage up the site area. And think
17:40
about that. And that's a positive version
17:42
of making that I want to keep between themselves. It's
17:45
also not subject to kind of compliance that you
17:48
might see at the other end of the day
17:50
when you go to like health care
17:52
data and they come like
17:54
Hitler and health insurance company
17:56
in the United States that wants to share
17:58
data with a farmer. a policy
18:01
company or a hospital
18:03
organization. And that's a
18:05
very different concern. The
18:07
healthcare data you start to have to turn
18:10
not just of like what data is accessible,
18:13
how is it access data, what is hidden
18:15
in tracks. And in
18:17
Europe you also have things like right to
18:19
be forgotten, where you wanna say not maybe
18:22
you need to change to this data, but
18:24
you need to have a way to re-enact
18:26
data. And doing in-place sharing, you're getting on
18:28
the technical side of these are kind of
18:31
impact sharing with quite a lot
18:33
of them because the cloud platforms provide things
18:36
that allow you to do things like audit to run the
18:39
app or the data. And if you remove
18:41
data from in-place and you know it's gone where
18:43
there's someone copied your CSV files, here's
18:47
some routes and you can look
18:49
first and place. But a
18:51
lot of this comes down to
18:53
the kind of relationships
18:58
between organizations as well. So all
19:00
those alternates where they've got the right legal things
19:03
and the right compliance, in-place before
19:05
all can do, say, sharing based on their industry
19:07
and the kind of data they wanna share. Yeah,
19:09
does that answer the question on that? Yeah,
19:12
and digging more into the
19:14
mechanical aspects of data sharing, as you mentioned, there
19:16
are a few different ways to think about it
19:19
where one is I have this data, I am
19:21
going to extract it from the system that I
19:23
used to maintain and I'm gonna push it into
19:26
some other system, whether that's S3 or FTP, you
19:29
can take it, do whatever it is you
19:31
want with it, I have no more visibility
19:33
or control over that data versus on the
19:35
other end of the spectrum, you have the
19:37
snowflake and BigQuery approach of I have this
19:39
table, I'm going to make it available to
19:41
you as long as you have
19:43
an account with that same provider, you can
19:45
query it, do whatever you want and I
19:47
have some level of visibility of how it's
19:49
being utilized. But I also still don't maintain
19:51
control over it once you use it because
19:54
maybe you're extracting it elsewhere and I'm wondering
19:56
if you can talk to what
19:58
are maybe some of the shortcomings. of
20:00
even that more sophisticated approach of
20:02
the sharing the entire table and
20:04
its context and history and some
20:06
of the technical capabilities that need
20:08
to be present for the data
20:10
sharing solution to be
20:13
effective, whatever effective might mean
20:15
given the context. Yeah,
20:17
you particularly around sensitive data and
20:19
things like that. It is a
20:22
business-specific technical area, so some
20:24
of it comes down to just the
20:26
contractual agreement. If you take a look
20:28
at the data sharing, there is a
20:30
level of trust and legal enforcement and
20:32
places have to carry you have to
20:34
agree not to do certain things. Outside
20:36
of getting into data sceneries and differential
20:38
privacy, the, as I said, there's not
20:40
a lot you can do to technically
20:42
prevent people from your exacting data from
20:44
different products. So, some of the things that
20:48
you can do to exhibit is, you can make
20:50
use of things in views, make me
20:52
comment back and say you can have
20:55
a lot of views, make mistakes, make
20:57
decisions, you can have views with legal
20:59
enforcement. And using in place sharing
21:01
means you can do a lot more of
21:04
that than you can with kind of older
21:06
techniques because the data doesn't need to be
21:08
duplicated for every view. When you do carrying
21:10
with something like a CSV file is extracted,
21:12
if you want to share different views of
21:15
data, you have to extract all the different
21:17
possible combinations into different types of CSV files
21:19
for different consumers. And that
21:22
obviously means it uses a lot of resources
21:24
and computing stories and stuff. Whereas, so like
21:26
or a database or BigQuery, you can create
21:28
a view that is exactly the data that
21:30
is seen. And you can use that to
21:32
apply things like obfuscation
21:35
or some of the kind of differential privacy
21:37
you might be when you're exchanging like tokens.
21:39
So, one of the things that we can,
21:41
some people do is two-way carrying where
21:44
I share with you an
21:46
obfuscated token and use
21:48
that, allow you to identify the data that you
21:50
have that you would then share back to
21:52
me. You join on the obfuscated token and
21:55
then you share the data back with me
21:57
of only the rows that match the obfuscated
21:59
token. It's a little technical, but
22:01
it means that we can say, where
22:03
we have data related to things, we
22:05
can ensure that we share the joint
22:07
of those data without necessarily sharing the
22:09
details of what we know about those
22:12
individuals. So there is something that
22:14
could be done there. But
22:18
at the limit, you do get into your clean
22:20
rooms. Once you get beyond your
22:22
confidence to operate with an
22:25
unbitness based on kind of contractual agreement
22:27
and legal homework that's in place, you
22:30
get into the morale of clean rooms which
22:32
are fully controlled in rooms. And
22:35
they often maintain the probability
22:37
of data. And the clean
22:39
room solution is a little bit different from, and that's
22:41
what we do, where you're actually saying like, we set
22:43
up this environment, you log into it and you have
22:45
a married control access of what you can do in
22:47
that environment and like whether you can express
22:50
data out of it. And in
22:52
terms of the work that
22:54
you're doing at BobSled and
22:56
some of the specific problem
22:58
areas that you're trying to
23:00
solve for, what is
23:02
the kind of unique set of capabilities
23:04
that you're enabling that aren't present in
23:07
these other platforms or some of the
23:09
ways that you are approaching the problem
23:12
that is maybe vendor agnostic
23:14
or removes the constraint of
23:16
everybody having to use the
23:18
same technology platform? Yeah,
23:21
the largest problem that
23:24
people face and I've told them times,
23:26
it's not coming that people face using
23:28
templates to make sure they
23:30
can perform the receiving dates
23:33
and to make sure they can be using the
23:36
platform. I always talk hate with you actually
23:38
on the same kind of cloud platform and
23:40
region. So we talked about
23:42
things like snowflakes, we're a real leader
23:44
here, but snowflake sharing is only truly
23:46
straightforward if we're both
23:48
in the same snowflake region on
23:50
the same platform. So we're both in US East 1
23:52
on AWS. If you're in snowflake,
23:55
if you're using snowflakes on GCP, in
23:58
EU Central, it is... It's impossible, but
24:00
you then have to do database reputation with Snowflake
24:02
and it's no longer, I want to share this
24:05
table for you, it's actually a whole process, we
24:07
have to replicate the database and then do a
24:09
share and different reason. So even within the same
24:11
platform, there are challenges, but the major challenge in
24:13
shortcoming is to do a different page
24:16
sharing, is that we have
24:18
to agree on the pattern that we're
24:20
gonna use, and that's the extremely difficult
24:22
impractic. If you're in a practical, if
24:25
you're a person providing an initial base test,
24:27
you can relocate your data operation into
24:30
another cloud. Anyways, data is
24:33
often the result or the whole process of
24:35
what's done to collect that data so you
24:37
want data to tie up with other things
24:39
and ask you have the place, so you
24:41
don't relocate your data onto Azure because you've
24:43
got someone who wants to run Azure. And
24:45
it's almost, now
24:47
it's actually practical for a recipient of data to
24:49
relocate their usage. We said, it's very rare for
24:51
data to be in isolation, you want to join
24:54
it on other dates you have and
24:56
see it into your existing, your
24:58
processes, whether they're analytical or
25:00
transactional, and to
25:03
relocate your application to a different platform
25:05
just to receive some basis data and
25:07
share a few. So in this many
25:10
to many environment where it's really high
25:12
patternality, particularly when you take into account
25:14
cloud regions, you could be on
25:16
site, you could be on base scripts, you
25:18
could be on BigQuery, and then you could
25:21
easily be in different regions in the same
25:23
platform, and there's a huge temptation
25:25
that just isn't solved for you unless
25:28
someone makes the move to a different platform.
25:30
And that's one of the massive things that
25:32
we're solving for in BobSays. So our
25:34
aim is to provide that really
25:36
simple, straightforward experience when
25:39
you say, I want to share these
25:41
specific views or tables or this specific
25:43
data from my own tech storage to
25:45
this person. And with BobSays, you
25:47
say, this is where I want to share it to. And
25:50
so you can say, I want to share
25:52
it to BigQuery, I want to share it
25:54
to Azure blog storage, and what
25:57
the recipient experiences with BobSays in
25:59
beta six. that same straightforward share
26:01
in the cloud-native way of
26:03
their platform that they're on. And what
26:06
the provider experiences is that we either access
26:08
their data directly or we access their native
26:10
via simple share. And we
26:12
solve the problem of like how does the
26:15
data move from one place to the other.
26:18
And how do we maintain efficient
26:20
sharing if you're doing sharing to
26:22
multiple people in the same destination
26:25
and region without
26:27
doing things like replicate more of your data. So
26:29
yeah, this allows people to maintain
26:32
that kind of shift-left of simplicity,
26:35
taking responsibility for structuring their data, making
26:38
it analytics ready and usable
26:41
with the ability to
26:43
straightforwardly share it to someone else without anyone having
26:45
to think about all the ETL and some of
26:48
that's involved. And that's what Bob said
26:50
does under the hood. And then another challenge
26:53
to this data sharing question that
26:55
we touched on a little bit,
26:57
but is that question of auditability
26:59
and governance when you are sending
27:01
data to another entity because at
27:03
a certain point, there's no way
27:06
for you to maintain control anymore
27:08
because once somebody has access to
27:10
the data, even if you want
27:12
them to be analyzing it in
27:14
situ, there's always the possibility
27:16
that they're going to extract it and do
27:18
some other thing with it. And I'm wondering
27:20
how that factors into the ways that
27:23
the sociotechnical aspect comes into play
27:25
with some of the sharing agreements
27:27
and some of the regulation and
27:29
compliance aspects of doing data sharing,
27:31
particularly when you're dealing with something
27:33
like healthcare data and you're maybe
27:35
a medical provider sharing patient data
27:38
with a medical researcher for being
27:40
able to develop some new sort
27:42
of therapy, etc. And
27:44
I think that's one of the ways
27:46
that the sharing protocol maybe can and
27:48
should incorporate that audit and access control
27:50
and governance enforcement in
27:52
the process of that access and
27:55
sharing. Okay, a lot of that
27:57
comes down to the agreements and what
27:59
the protocol can do. new is help to note
28:01
people at very clear and have a shared
28:03
understanding of a group. So for example, with
28:06
something like right to be
28:08
forgotten, we can help people to standardise
28:10
on the way that they communicate things
28:12
that need to be deleted. So
28:15
if we are both signed
28:17
up and compliant to the
28:19
European data pricing rules around
28:22
that, if
28:25
one of my, and you're a subprocessor for my data, if
28:28
I pass on a right to be forgotten, request
28:31
to you, you need to process that and
28:33
delete the relevant data. And if that's the
28:35
contract you're in between us, we get into
28:37
the practical level, you get into
28:39
practical questions like, well, okay, how do we do that?
28:41
How do we communicate to you the information so that
28:43
we can be reasonably confident that we've given it to
28:45
you and that you know what to do with it
28:47
and that you then process with the
28:50
leaks. And there are some interesting challenges with that particular
28:52
thing of how do you keep track of the
28:54
fact that you have deleted something and
28:57
also that you have deleted it, right? And you have
28:59
to keep track of like, we know that we did
29:01
delete this and can prove we deleted it, but we
29:03
don't actually have the data because we deleted it. The,
29:05
and so I think we can really help people to
29:08
standardise around, you know, how they
29:10
communicate and share that data and things like
29:12
whether that's something that people also want to
29:14
sign up to you with an API. Like,
29:16
we don't maybe just want, you
29:18
know, the kind of token identifiers and things you
29:21
need to delete and they're in a table, but
29:23
there's also an API call that
29:26
you can do to process that, that's just to
29:28
trigger that. And we also
29:30
like sharing being more
29:32
rich than just the data. So
29:35
some of the things that you can
29:37
share in the platform, the things like
29:39
you find functions, stores
29:41
for team goes, make a path and things
29:43
like that. So there you can help people
29:45
to share a function that can do things
29:48
like carry out and delete and
29:50
all of the different things you run the
29:52
function and generate the output. And
29:55
two-way sharing can be something that people
29:57
can use as part of a compliance
29:59
process. Can you share back to us?
30:02
And a data or something that's computed
30:04
over the data like a checksup. So
30:06
can you provide us some kind of
30:08
receipt that shows that you've carried out
30:10
certain actions by sharing data back. And
30:13
again, something where we can really have by
30:15
providing the expertise to say go from snowflake
30:17
to daybreak, and daybreak, class again snowflake,
30:20
means that each person can be operating
30:22
where they have the confidence, the expertise.
30:25
You can provide a career that means that you
30:27
can provide some receipts. And we
30:29
also do abstractions over things like the
30:31
telemetry and the order. So we can
30:34
say, there is a company sharing data,
30:36
someone else, you can go into Bob's, get
30:39
your audit notes, and get the
30:42
data. As far as it's available, depending
30:44
on their station platform, but you can get that in this
30:46
kind of single Bob's that abstracts and you don't have to
30:48
be the linking the audit logs
30:50
of four or five different platforms
30:52
if you need to kind of verify if
30:55
something, some question around it to that data
30:57
itself. Part of our kind of plans, although
30:59
it's not yet something we do as well
31:01
as communicating governance, rules
31:03
and requirements. So that
31:06
one thing is that you need to agree, you
31:09
know, successfully and say, we're gonna have a
31:11
bio-bias rule and how we're gonna process the
31:13
data. And if you're subject to something like
31:15
HIPAA, you know, I
31:17
will know that you get audited with 27,000
31:20
words or two. We're
31:23
working towards HIPAA conclusions. I
31:25
think that in the compliant organization, you have to comply,
31:28
although I can't necessarily, I can't
31:30
audit you directly and say,
31:32
you know, look what you've done,
31:34
I can have confidence that you are audited.
31:36
And one thing that we're looking
31:38
at is providing a way for people to
31:41
say what their kind of governance requirements are
31:43
and have that clearly pass along with the
31:45
data. So that the recipients of the data
31:47
clearly can see the governance
31:49
requirements of this and that they were tested and said
31:51
like, yes, we meet those requirements and
31:54
helping to kind of make that part of the data
31:56
sharing protocol and tying it up with
31:58
that business associated with HIPAA. And
32:01
at what point do point-to-point connections
32:04
for data sharing reach the limitation
32:06
and you need to then step
32:08
into the situation of having a
32:10
data brokerage for escrowing certain data
32:12
sets that multiple organizations need to
32:15
be able to have access to
32:17
and what are some of the
32:19
ways that the data sharing protocols
32:21
can maybe also help to reduce
32:24
friction of populating those data sets
32:26
and consuming those data sets. Yeah,
32:29
one of the things we do is above
32:31
this we kind of combine hopefully what is
32:33
the best of the in-place sharing and
32:35
we're doing some of the work of
32:37
moving data around and achieving efficiency. So
32:39
I talked about when you have a
32:42
kind of shift right approach, every
32:45
consumer of the data has their own copy of
32:47
data and their own compute doing ETL and so
32:50
on. When we do a data
32:52
share from one platform to another, some ETL has
32:54
to happen. Obviously, that in-place sharing requires the data to
32:56
be in place. If you want it in place
32:58
sharing in Google Cloud, it's got to be in Google
33:00
Cloud and if you want it in place sharing in
33:02
Soapweight, it's got to be in place with Soapweight. If
33:05
you're both at once, you have to have two copies of
33:07
data. What we can do is make sure that if
33:10
you're sharing the same data to 10 people
33:13
in AWS US East One, that
33:15
there is only one copy of the
33:17
data in AWS US East One and
33:19
all of those 10 people are then
33:21
consuming a view on that
33:24
data. So we can ensure that
33:26
you're getting the best possible efficiency of
33:28
what they have and as
33:30
we think about invite access and
33:33
things like that as well, what we do
33:35
provide is a very simple way for people to
33:37
do things like remote access. As
33:39
you start to think about the challenges
33:41
that people face, trying to achieve data
33:43
sharing into multiple clouds, multiple
33:45
platforms, it's just if you
33:47
want to say, hey, we want to revoke access
33:49
from someone now, the work that ends
33:52
up evolving right can be quite significant, right? You have
33:54
to go into each platform and where
33:56
they might be used and individually figure out how
33:59
to revoke access. and that's a different platform,
34:01
that's a different person. So
34:03
with Bobstead, you can just go into Bobstead, say a red
34:05
hat, and we'll make sure that that's a different person. It
34:08
is an access. And where
34:10
you've got multiple people sharing from the
34:12
same dataset, you've got 10 consumers in
34:14
one particular region, we can handle things
34:16
like the kind of garbage collection and
34:18
reference dancing. So we'll maintain
34:20
that data in that location until it's not
34:22
being used by anyone in that location. And
34:25
then we've got the ECT data. And there's obviously a
34:27
number of technical challenges in
34:30
terms of orchestration and
34:32
management and things like that for us to do.
34:34
You mentioned kind of escrow as well. Yeah,
34:37
escrow is, there's a ton of
34:39
different understandings now, and it is a new case
34:41
that we talk to various people
34:43
about. At the moment, one
34:46
of the, like, one of our approach there has
34:48
been to say we can help to ensure that
34:51
data is available in different places and
34:53
take advantage. But often, if you
34:56
want to kind of escrow data in certain conditions, the
34:58
first thing to do is use cryptography
35:01
and then manage the cryptography
35:03
key. So we can say
35:05
we can share the encrypted data between a bunch
35:08
of companies. We might experience it personally
35:10
as around source code escrow.
35:12
So when you start up working-size
35:14
prices, they'll say if you go
35:16
bankrupt, or as a business or
35:18
stop serving in some way, we'd like to have
35:20
the possibility that we could maybe continue to operate
35:23
the service. So you need to put your source
35:25
code in escrow. And there are some companies who
35:27
provide that kind of service, but you can also
35:29
do this sort of DIY thing where you're saying
35:31
if you want a large amount of data in
35:33
an escrow, you can encrypt it.
35:36
Bobset can help you then move that data
35:38
around. But the escrow process, and
35:40
then the escrow process can just focus on the
35:42
key, small piece of data, and you can work
35:44
with law firm and account deal, one of the
35:46
people who provide that service to say that we'll
35:49
hold the keys to that in escrow, and then
35:51
everyone just has the encrypted data. In
35:53
terms of the boundaries
35:56
that you're crossing with these
35:58
data transfer technicals. arrangements,
36:00
the organizational arrangements, what are
36:02
some of the typical
36:05
situations in which you encounter those
36:07
types of boundaries and the ways
36:09
that they are defined and delineated?
36:11
And I imagine most of that
36:14
is just purely organizational, but what
36:16
are the cases where technical
36:18
requirements actually necessitate these data
36:21
transfer systems versus just being
36:23
able to do direct integration
36:26
between them? Yeah,
36:28
we see the kind of within
36:30
an organization boundary as
36:33
well as the between organized boundaries you talked about before.
36:35
So sometimes it can be different
36:37
regions within an organization.
36:39
So for an identity
36:42
problem, you might have the UK
36:44
office on one system
36:46
and the South African office on another system. And
36:49
sometimes that can be necessitated by
36:51
things like regional rules
36:53
or regional availability of
36:56
services. Another thing we
36:58
do, we can sometimes see for
37:00
technical requirements is things like AI processing
37:02
availability and stuff like that. So people's
37:05
choices for where they want to analyze
37:07
their data may not just be
37:09
driven by the myriad of reasons that you
37:11
choose a cloud or so on, but it
37:13
might be related to specific AI or blockchain,
37:15
something to technical requirements. So if you want
37:18
to use certain open AI things, you maybe
37:20
need to be on a Microsoft platform. And
37:22
if you want to use certain
37:24
blockchain systems, you may need to
37:26
be better on another platform or
37:29
another location. The other one that
37:31
drives kind of regional things is
37:33
compliance and the rules around that.
37:35
So you want to keep data within
37:37
certain geographical boundaries. One thing
37:40
we allow people to do is control which regions
37:42
and things they allow they to be shared to.
37:44
So you can have data on both sides. Also,
37:46
it could make that data possible to be shared
37:48
with any region, any cloud platform.
37:50
You can limit it down and say, this
37:52
data is only allowed to be shared within
37:54
the EU. It could be on any platform,
37:57
but still on the regions of those platforms
37:59
that are EU. So you
38:01
see boundaries that have geographical and
38:03
regulatory, rather. There are regulatory boundaries
38:06
in the cloud to
38:08
have kind of go cloud services and
38:10
some types of
38:12
healthcare, and they're both like capital separate,
38:14
so it's like cloud for healthcare
38:17
data. So there's obviously some compliance
38:19
boundaries there. I'm trying to think
38:21
through what other things we've seen
38:23
that might come into this. There
38:25
are not cloud to cloud boundaries.
38:27
This is something that we're
38:30
aware of and keep in mind in the future. When
38:32
you have a real asymmetry between
38:34
organizations, so you maybe have quite
38:37
a small organization working with a
38:39
very large, small organization with limited
38:41
capacity to sophisticated things, working
38:44
with a large organization that's both
38:46
to do very sophisticated things or can be very
38:48
sophisticated things, that creates a
38:50
kind of technical boundary of what kind
38:52
of solutions they might use,
38:54
and your small organization might want
38:56
to be using something like Google Sheets. And
38:59
not only what we support today, but
39:01
it is something that BigQuery can do,
39:04
is to say something like, I'd like
39:06
to share from a Google Sheet into
39:09
BigQuery and events from BigQuery into
39:11
using both sides of the anywhere. So yeah, we
39:13
can see those kind of things where someone says,
39:15
I want to go from really a kind of
39:17
different kind of system. And
39:20
we can kind of take away how much
39:22
time I want to take from a very different system into
39:24
or out the cloud to the cloud. And
39:27
in your experience of working in
39:29
this space of data sharing and
39:32
the socio-technical aspects that come into
39:34
play, what are some of the
39:36
most interesting or innovative or unexpected
39:39
applications of that protocol and capability
39:41
that you've seen? There are
39:44
a really interesting thing we've
39:46
seen from customers is auto-fulfillment
39:48
from CIN. We
39:51
allow driving all of this through a
39:54
single API, so you can call Bosnet
39:56
API and set up a share or transfer or make
39:58
a change. and
40:00
we've been really in a discussion with customers where
40:03
they kind of directly connecting a CRM
40:05
into BOSL. So
40:08
you do some activities in your system and
40:11
something like Salesforce or so on and
40:13
like Salesforce can send information so on
40:15
directly into BOSL, BOSL can share, BOSL
40:18
can make web page locations back to
40:20
CRM and you can
40:22
actually achieve auto fulfillment of your
40:25
salesperson or account manager using the
40:27
system that they're familiar with. Having a
40:29
data share be created in action and actually getting
40:31
up to date back in their CRM without the
40:33
person's end of the CRM and without the company
40:35
using it really building kind of bespoke software because
40:38
they now have to run some separate platforms and
40:40
take the servers, they're able to build it into
40:42
the extensibility of a platform like
40:44
Salesforce, which is really cool to see that
40:46
people are able to get these things up
40:48
and running without using their own CRM without
40:50
having to build their own Mac server and
40:52
their own major development process. Another thing that
40:55
I think is very cool that we do internally
40:57
is we do BobShed
40:59
to BobShed so we can send
41:01
data from one
41:04
place in BobShed to another place and
41:06
then use that data destination as a
41:08
source for like further onward BobShedding. We
41:10
use that internally, including
41:12
doing things like, another thing is
41:14
using that to share data back to our
41:17
customers about things like their usage. So
41:20
if you want to get data about your
41:22
usage of BobShed, we're working on providing that
41:25
as a BobShed share that you can then consume in
41:28
BigQuery and the orange smoke plate or some.
41:30
Another thing that we've seen is people do
41:32
is having data they've got
41:34
as a source of something like
41:36
a CSV. We support loading CSVs
41:38
into BigQuery houses. So they
41:40
use BobShed to load data from,
41:43
that they've got in CSVs into
41:45
Snowflake and all day scripts
41:47
of BigQuery. They then set
41:49
that Snowflake or day scripts of BigQuery up as
41:52
a BobShed source and
41:54
then they use the capabilities of Snowflake
41:57
to make views and so on over what
41:59
they've and CSV, and
42:01
then they use that Snowflake as Bobflare source
42:04
to then use further on-screen sharing. So they're
42:06
actually using Bobflare to kind of do an
42:08
ETL process and bootstrap
42:11
themselves from a kind
42:13
of not Cloud native sharing protocol
42:15
world into a Cloud native sharing
42:17
protocol world. By using
42:19
Bobflare as a bootstrap, and Snowflake can then on-screen sharing,
42:22
using Bobflare from that snowflake. And
42:24
in your experience of building Bobflare,
42:26
working closely in this context of
42:28
organizational data sharing, what are some
42:30
of the most interesting or unexpected
42:32
or challenging lessons that you've learned
42:35
in the process? There's
42:37
always a lot of challenging lessons
42:40
from operating at the startup. The
42:43
time we started to call, as a founder,
42:45
you're often dealing with whatever the most serious
42:47
problem in the business is at any given
42:49
time. Yeah, so one of the biggest
42:51
challenges we've seen is the
42:54
complexity of building an abstraction over
42:57
all of these different Cloud systems. So I talked
43:00
with Jake, he's my co-founder, and
43:02
we were observing that Bobflare, as a product, is kind
43:04
of a simple concept, right? One of
43:06
the simplest concepts that either of us have
43:08
worked on in our career is in certain
43:11
ways. Compared to the commitment of a
43:13
graph database. Conceptually, a graph
43:15
database is a really complex thing.
43:17
But Bobflare is a very straightforward
43:19
product for you to share data
43:22
from your storage of a data warehouse to another sample
43:24
storage of a data warehouse. What
43:27
we've seen is a real challenge between the simplicity
43:29
of the concept and the
43:31
challenge of building an abstraction
43:33
over all the different Cloud and warehouse
43:36
platforms. And one of the things for
43:38
me here is that we
43:41
don't own this kind of stack all
43:43
the way down. So what we have to work
43:46
with aren't kind of the theoretical
43:48
limitations that you might have when using
43:50
my building raft. You're building a
43:52
raft system, and you
43:54
can go and read the kind of PhD papers
43:56
and so on related to it, and you can
43:59
understand their constraints. as CAP theorem, that's
44:01
the speed of light. And you basically then
44:04
are up against those challenges. You can try
44:06
and build against that and control it and
44:08
understand it. We don't have that
44:10
kind of deep tech or
44:12
hard tech challenge. There's not at
44:14
our core a really hard challenging
44:17
AI problem or a challenging CAP
44:19
theorem, distributed systems problem, or something that we're solving
44:21
for people in a really smart way. What
44:24
we're challenged with is all of these
44:26
different abstractions that are present in
44:29
Azure, StateBricks, Snowflake, BigQuery,
44:31
and they superficially are
44:33
quite similar. But
44:36
as you get into trying to manage and work
44:38
with all of them, you discover that they are
44:40
different. And you're the devil of data engineering to
44:42
the lives of these details. And yeah, that's been
44:44
a really, you're not entirely unexpected
44:46
challenge, but that's been where we've discovered a lot
44:49
of challenges actually to build an abstraction. Even
44:51
across object storage, we
44:53
find that in AWS, you
44:56
have the access point attraction, which is
44:58
really great, but it's not present in the other
45:00
cloud. And so you build something on AWS, and
45:03
then you realize you can't really build a comparable
45:05
abstraction, but a good cloud structure. So you have
45:07
to do something quite different. Or as we get
45:09
into things like executing serverless functions, you know, our
45:11
work making, we execute serverless functions to do work
45:14
in AWS GCP and Azure. And so we have
45:16
to build out an abstraction for
45:18
managing serverless functions running on different clouds. And
45:20
that's kind of a challenge that in itself,
45:23
some organizations have, like their
45:25
main organization, so it's
45:27
kind of a platform, you're
45:29
building technology that allows people to do that. And
45:32
that is one of the problems that gets us
45:34
solved internally so that we can say, hey, to
45:36
do object storage sharing to all the different clouds,
45:38
we need an abstraction that means we can run
45:40
some serverless compute, means we can
45:42
make some certain assumptions about how data is
45:44
stored, and we can make some straightforward ways
45:46
of saying like how we grant or work
45:48
access to a share. Each of those end
45:50
up being surprisingly different and nuanced between different
45:52
platforms. And for people who
45:55
are exploring the problem
45:57
space of being able to send
46:00
data from one system
46:02
to another, whether that's across organizational boundaries
46:04
or across technical boundaries. What are the
46:06
cases where bobsled is the wrong choice
46:08
or what are the cases where they
46:11
should just reconsider the entire application and
46:13
avoid data sharing entirely? I think the
46:15
biggest time when bobsled is the wrong
46:18
choice is kind of when I know
46:20
I can say, I just
46:22
don't need to. I'm a huge fan
46:24
of identifying we don't need
46:26
to do things and you can often
46:28
find yourself in a situation where you feel like you
46:30
need to do something because that's how it's done. I'll
46:32
tell other people do it and things like that. But
46:36
a bit of analysis can take out and maybe we don't. But
46:38
one of the main cases where bobsled is the wrong choice is
46:41
when something like a data clean room is the
46:43
right choice. So that's when the
46:46
reassurances you want found like what's visible to
46:48
someone and what's done with an atrium, whether
46:50
or not it's kind of been extracted and
46:52
so on. So stringent
46:54
that you need to make use of
46:56
a data clean room. And there's some
46:58
really cool technologies in that space around
47:00
things like differential privacy and things where
47:02
you can have systems that allow you
47:04
to make kind of aggregate queries that
47:07
don't reveal the underlying data but allow you to
47:09
query the data in the aggregate and things
47:11
like that. And for all of
47:13
those, bobsled is the wrong choice or would
47:15
have to be part of a much more
47:17
complicated solution architecture. At the moment we
47:20
talk to people who, at
47:22
Haiti, we talk to people who want
47:24
to do a migration. They want to
47:26
migrate from GCP to AWS. They're
47:28
in time-takes. They say, well, that's something bobsled
47:31
can do. Bobsled can move data from a
47:33
data store in one place to another. Can
47:36
we use bobsled from migration? At the moment
47:38
that's something where we would generally say, bobsled
47:40
is not the right choice. If
47:42
you're doing exactly one, that's exactly another
47:44
place. So it's probably already
47:46
a tool in the destination that's
47:49
good enough for what you want to
47:51
achieve. So it's just an Azure-based
47:53
factory or something. If you're just concerned with
47:55
getting data in to just one platform, then
47:58
you can probably use the native tooling on that platform. And
48:00
we're usually advocates of use
48:02
the native tooling, like use the native sharing
48:05
protocol and this thing, like those little extra
48:07
stuff. Yeah, and then I'm trying to think
48:09
of situations of early situations where we've come
48:11
across where we've sort of said, you know,
48:14
do you even need to do data sharing? Like
48:16
perhaps you should read some to that. I think
48:18
there are cases where we have the reverse that's true
48:20
where you might be like, should this be an API?
48:22
And so I talked to everybody, there are situations where
48:25
people like, we have a
48:27
JSON REST API hammer. So we're gonna
48:29
take some everything with a REST API
48:31
JSON over HTTP. They're
48:34
coming in the case where you need to, should this
48:37
use case be something that you're managing in
48:39
an analytic data sharing? Or should it be
48:41
something that actually an API or
48:44
a webhook or something else? You
48:46
could take a content
48:49
and it makes it right. Another way you
48:51
can do cross-organization or kind of synchronization in
48:53
some of the clouds is using things like
48:56
a content stream where you can do cross-organization
48:58
or listeners on P or
49:20
any particular projects or problem areas you're
49:22
excited to explore. Might give you two
49:24
answers there. One thing I'm really excited
49:26
about or kind of passionate about is
49:29
tackling something that we call modern data
49:31
stack fatigue. So you're probably
49:33
seeing familiar with this. There's
49:35
a whole route after technologies that go
49:37
into modern data stack. I have a
49:39
controversial idea of Kubernetes. Kubernetes
49:42
has this kind of testing,
49:44
which is quite feature, it's quite showing you
49:46
like all of the Kubernetes that technology is. I
49:48
don't know if you're familiar with it, but incredibly
49:51
dense. Unreadable and that's your
49:53
screen with six foot wide kind of
49:55
thing. And the modern data stack
49:57
is going in a similar direction. and
50:00
like urban homes and corporate tools
50:02
for doing each different individual thing that
50:04
you might do. And yes, we sort
50:06
of like past behave as a net
50:09
in the world of data and data
50:11
engineering analytics. We've had an explosion of
50:14
all of the different tools and technologies
50:17
and services and infrastructure service
50:19
and platform service and things
50:21
else. And now we're in
50:23
a situation where the fatigue
50:25
from sort of, oh, I'm
50:27
on my team kind of imagining the time is
50:29
like, well, I have to do like six or
50:31
seven different technologies. And then when you get into
50:33
things like hiring, you're suddenly like, well,
50:36
our stack is this particular combination. When
50:38
you're hiring, you're like, we want to hire someone who
50:40
has this exact combination of experience and you're like, well,
50:42
that person doesn't exist because there are so many different
50:45
combinations of possible things that no one has used during
50:47
that combination. And we're in a
50:49
macro economic environment where there isn't necessarily a
50:51
budget for everyone to have every single tool,
50:53
right? And people are a bit more focused
50:55
on what you can do being
50:57
lead. And I do need to
51:00
have a bunch of services running for this. We
51:02
think something like DuckDB is really cool, right?
51:04
DuckDB has a kind of minimalist
51:06
approach, which has, you know, you do have a
51:09
bunch of service necessarily running and like you can
51:11
use mathematics on your M3 laptop. And
51:15
the thing that's really exciting for us is
51:17
that we can kind of help
51:20
people approach some of that because we don't
51:22
have a kind of horse in any of
51:24
these races, right? Within that fatigue,
51:27
there are sort of different philosophical holy
51:29
wards, the kind of emacs versus them
51:31
type of things, right? Like, should you
51:33
have a lake house or a warehouse
51:35
or? We can
51:37
help people to do data sharing sort of
51:39
regardless of whether you know
51:41
what technology choices they've made. And
51:44
I really hope to kind of help people achieve
51:48
simplicity, rather than in the
51:51
face of all of this complexity of
51:53
options. And yeah, I'm really excited
51:55
to see what we can do to
51:58
cover more of these bases and help people. people
52:00
can't have interesting on incorporating
52:02
things like what you
52:04
do and things like that, but also to
52:06
use cases where people can analyze data and
52:08
kind of directly in box. And
52:11
I think you might need to move the data
52:14
into your cloud so you can
52:16
analyze it with your BigQuery. One
52:18
of the things that I think would be very cool is
52:20
do you need to move
52:22
the data into the destination cloud and like
52:24
the BigQuery or hand kind of issue a
52:27
query directly and using something like .dd on
52:29
the source data and we never need to
52:31
do the ETL part, right? And we have
52:33
people shortcut and provide on their
52:35
work. Yeah, there's a lot of things that I'm excited
52:38
about. I'm also excited about
52:40
things like two-way sharing, as
52:42
I talked about. There's various different use cases and
52:45
they're all quite interesting where people say, I
52:47
want to share something to you and then you
52:49
would, for example, enrich it. You
52:52
attach it to the data you have when
52:54
you perform some analysis or
52:57
printing or some of that data and then
52:59
you kind of send something back to me
53:01
that is meaningfully transformed or enriched. That's one
53:03
of the things that I'm looking
53:06
forward to. Getting into because it starts
53:08
to unlock your high levels of
53:10
value so people enable them to collaborate and
53:13
as I talked about at the beginning, we can
53:15
help the industry in ways to do things more
53:17
efficiently. Talk about like you end up duplicating data
53:19
if you have data being
53:21
copied from one place to another and then
53:23
processed and so on. Helping the
53:26
industry to be efficient but also to achieve
53:28
higher value. First, sharing is
53:30
part of a collaboration process and
53:32
if we can do two-way sharing, we can
53:34
help to unlock higher value collaboration. I agree
53:36
one of the things that was one of
53:39
our founding convictions is that
53:41
enabling collaboration between organizations is kind
53:43
of a next beneficial thing. It
53:47
helps improve efficiency, it helps organizations make
53:49
better decisions and I know things
53:51
are broadly in the interest of
53:53
optimizing consumer as we use the base axis. And
53:57
are there any other aspects of the
53:59
overall space? of data sharing, both
54:01
the technical aspects, the organizational challenges, the
54:03
ways that you're approaching it at Bobsled
54:05
that we didn't discuss yet that you'd
54:07
like to cover before we close out
54:09
the show? I think
54:12
there is stuff
54:14
I love talking about around this like shift left
54:16
and shift right mentality and
54:18
like who has the responsibility for
54:20
doing things. So we talked about in
54:23
data industry shift left is something that
54:25
we talk about and seeing
54:27
the broadly good thing is this idea of
54:30
moving from the responsibility and the work twice as
54:33
to the kind of source of the data and
54:35
saying like let's help the person who sources data
54:37
and does not use inside an organization. So
54:40
you know, we shift left to a kind of data
54:42
team in the organization and have
54:44
a central team who's ensuring
54:46
that data is clean, that it's well set
54:49
up, that it's easy to query, that it
54:51
optimizes stuff like indexes and applications and
54:54
that direct efficiency compared with the
54:56
right mentality where you say we just find
54:58
data and all the consumers have to figure
55:00
out how they're going to use it and
55:03
how they're going to compute everything. And what
55:05
we do, one thing we can help to
55:07
say is not so much within an organization,
55:09
we don't generally want to tackle across organizations
55:11
and within those more complex organizations
55:13
that do have boundaries is a further
55:16
shift left approach where you say the people who
55:20
are best placed for the data
55:22
and the organizations
55:26
and generally manage the ongoing life of
55:28
the data and evolution
55:31
of the schema and attending of
55:33
new data and all those challenges
55:35
that really make up a lot
55:37
of the work of data engineering around
55:39
methods or the kind of the nifty
55:41
gritty stuff like oh no what happens
55:43
when a stock kind of splits or
55:46
one country changes its code or something like
55:48
that and the curve balls that you have
55:50
to deal with when you're managing a schema.
55:53
We can help that to be centralized which is
55:55
more efficient and more effective with us within data
55:57
sharing between organizations and if this is a good
55:59
idea, we can help with and then we can help ship left
56:01
and we can help reduce ETL, which is
56:04
two of the major pain points
56:06
of a lot of data engineering. I spend
56:08
so much time on ETL to write in
56:10
my analysis, or I spend so
56:13
much time on data cleaning and processing a
56:15
film before I actually do my analysis. I
56:17
think the work that we do can really
56:19
help tackle those for a range of organizations.
56:21
In what is a really challenging realm where
56:23
for a lot of people who work there,
56:25
the alternative is some major DIY project. So,
56:28
try and build some
56:30
subset of this functionality yourself, or
56:33
try and persuade a commercial partner
56:35
to make some pretty significant decision
56:37
like working and doing that on different clouds.
56:40
Absolutely. Well, for anybody who wants to get in
56:42
touch with you and follow along with the work
56:44
that you're doing, I'll have you add your preferred
56:47
contact information to the show notes. And as the
56:49
final question, I'd like to get your perspective on
56:51
what you see as being the biggest gap in
56:53
the tooling or technology that's available for data management
56:55
today. There's an interesting question
56:57
in the context of the modern data set.
57:00
So, we have a huge range
57:02
of tools, and in some
57:04
ways, on my part, is that reducing the intensity of the
57:06
phone. So, I have some
57:08
experience in the AI space, and
57:12
obviously, that's extremely hot
57:14
and very busy right now. I think
57:16
that there isn't, I'm not an expert,
57:18
sorry, but one thing that I think
57:21
there isn't is a really
57:23
good approach to vector database
57:25
embeddings. Obviously,
57:29
a lot of people attempting
57:31
to build out good solutions around vector
57:33
databases and managing embeddings. I've
57:35
spoken to quite a lot of startup
57:38
founders and founding engineers based on my
57:40
previous experiences who are trying to do
57:45
things like similarity search and
57:47
KNN and Levenshon
57:50
in distance, and
57:53
all these kind of very
57:55
standard data analytics things over
57:58
vectors that are being produced. from kind
58:00
of AI practices and large language models and deep
58:02
neural networks to computer vision. And that is an
58:04
area where, you know, I spoke to a lot
58:06
of people about what they're doing to challenge it. And
58:09
most of the existing tools that are all
58:11
very new are very
58:14
pricey or very inefficient above
58:16
kind of very small toy projects. And
58:19
a lot of people I spoke to there are
58:21
like running some challenges and building their own. And
58:23
that's also where they're talking to me is around
58:25
like the data infrastructure management. Like how do we
58:27
manage the infrastructure and build our own thing on
58:30
top of something like Spark and start running
58:32
these algorithms at scale over vectors. So
58:36
yeah, I guess maybe it's kind of
58:38
obvious or start up found an
58:40
answer, but I think kind of AI and
58:42
vector database solutions is somewhere I think
58:45
there is a really good tool. All
58:47
right, well, thank you very much for taking
58:49
the time today to join me and share
58:52
your experiences of working in this space of
58:54
data transfer and enabling that for different organizations,
58:56
making that a simpler problem to solve. So
58:58
I appreciate all the time and energy that
59:00
you and your team are putting into that.
59:02
And I hope you enjoy the rest of
59:04
your day. Thank you. Thank you
59:06
very much. It was a pleasure. Thank
59:15
you for listening. Don't forget to check
59:17
out our other shows, podcast.init, which covers
59:19
the Python language, its community and the
59:21
innovative ways it is being used. And
59:23
the machine learning podcast, which
59:25
helps you go from idea to production with machine
59:27
learning. Visit the site
59:29
at dataengineeringpodcast.com to subscribe to
59:32
the show, sign up for the mailing list and read
59:34
the show notes. And if you've learned something
59:36
or tried out a product from the show, then tell us about
59:38
it. You now host
59:40
at dataengineeringpodcast.com with your
59:42
story. And to help other people find the
59:44
show, please leave your view on Apple podcasts
59:46
and tell your friends. Thank
59:55
you.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More