Data Sharing Across Business And Platform Boundaries by Data Engineering Podcast | Podchaser

Episode from the podcastData Engineering Podcast

Data Sharing Across Business And Platform Boundaries

Released Sunday, 11th February 2024

Good episode? Give it some love!

Data Sharing Across Business And Platform Boundaries

Data Sharing Across Business And Platform Boundaries

Sunday, 11th February 2024

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello and welcome to the Data Engineering

0:13

Podcast, the show about modern data management. Data

0:16

lakes are notoriously complex. For

0:19

data engineers who battle to build

0:21

and scale high-quality data workflows on

0:23

the data lake, Starburst powers petabyte-scale

0:25

SQL analytics fast at a fraction

0:27

of the cost of traditional methods

0:29

so that you can meet all

0:31

of your data needs, ranging from

0:33

AI to data applications to complete

0:35

analytics. Trusted by teams of all

0:37

sizes, including Comcast and DoorDash, Starburst

0:39

is a data lake analytics platform

0:41

that delivers the adaptability and flexibility

0:43

a lake has ecosystem promises. And

0:46

Starburst does all of this on an

0:48

open architecture, with first-class support for Apache

0:50

Iceberg, Delta Lake, and Hoodie, so you

0:53

always maintain ownership of your data. Want

0:56

to see Starburst in action? Go

0:58

to dataengineeringpodcast.com/Starburst and get $500

1:00

in credits to try Starburst

1:02

Galaxy today, the easiest and

1:04

fastest way to get started

1:06

using Trino. DAGSTOR

1:08

offers a new approach to building

1:10

and running data platforms and data

1:13

pipelines. It is an open-source, cloud-native

1:15

orchestrator for the whole development lifecycle,

1:17

with integrated lineage and observability, a

1:19

declarative programming model, and best-in-class testability.

1:22

Your team can get up and running

1:25

in minutes thanks to DAGSTOR Cloud, an

1:27

enterprise-class hosted solution that offers serverless and

1:29

hybrid deployments, enhanced security, and on-demand ephemeral

1:32

test deployments. Go to

1:34

dataengineeringpodcast.com/DAGSTOR today to get started, and

1:36

your first 30 days are free.

1:39

Your host is Tobias Macy, and today I'm interviewing

1:41

Andy Jefferson about how to solve the problem of

1:44

data sharing. Can you start by introducing yourself? Yeah,

1:46

hi Tobias. I'm Andy. I'm the CTO

1:48

at Foxled. We're

1:50

a Series A startup solving

1:53

the problem of data sharing for enterprises in

1:55

the cloud. And do you remember how

1:57

you first got started working in data? For me, like, software is a

1:59

very, very important thing. Software engineering has always

2:01

kind of been about moving and processing data,

2:04

whether it's getting a

2:06

tweet or an iMessage from

2:08

my phone on the Yourphone, or

2:10

whether it's controlled software in

2:13

a power plant or a chemical plant

2:15

as it's taking input data from sensors

2:17

and things and then processing

2:20

that data and then creating output data for the

2:22

signal to control software and pumps and

2:24

things like that. I started

2:26

relatively late during my

2:28

PhD, I was doing a PhD in chemical

2:30

engineering and I started with

2:33

work on control software. I wanted to enjoy

2:35

about that was that you actually write software

2:37

that did something very tangible and

2:40

real and it worked in the real world and

2:42

from that I got into computer

2:44

modelling during my PhD and

2:46

then a little later I actually quit

2:48

my PhD to join software and I

2:50

started because I found I was enjoying

2:53

the software engineering more than I was

2:55

enjoying the welding together of tons of

2:57

stainless steel and then from

2:59

there my first job was the database administrator

3:02

on a SQL server, micro SQL server

3:04

as it was then. So my very

3:06

first job in software engineering was in

3:09

the data realm. There are a lot

3:11

of administration, I got

3:13

first-hand experience of the move from on-prem

3:16

to cloud and I started

3:18

starting today when I remember

3:20

when Microsoft SQL Server on Azure was

3:22

launched and then testing out

3:24

and being really excited about it. No

3:26

longer having SQL Server initially just missed me

3:29

and then moved to a really big greenhouse

3:31

and I was going to visit them

3:33

and I got the chance to see

3:36

that transition just kind of really un-benefited. No longer

3:38

having to pay the sector for boxes and I

3:40

think that could go wrong with them and giving

3:42

the upgrade and all that kind of stuff. And

3:44

yeah, from there I think it's a pretty exciting

3:47

and fun career so far and moving through different

3:49

kinds of things. So from there I moved into

3:51

a company that was doing an OLAP

3:54

database built on top of

3:56

Cassandra at a time when no SQL was a big thing.

4:00

No people, no people, no faces with a new hot thing. OMAQ

4:03

is very fashionable. I have a company called

4:05

Akunu who are building a hot solution as

4:07

a standard. And we work

4:10

with some large ride sharing firms and

4:12

things there. So I got, I

4:15

got to see some of the power of like big data

4:17

and things you could do. So

4:20

a real fail. And

4:22

from Akunu, I went to Apple,

4:24

I found. We're

4:27

gonna first experience doing data sharing

4:29

in consumer data sharing. So

4:31

I found I was working on kind

4:34

of pre-sales and base-books that was

4:37

related to how you share, not just between

4:39

devices. I changed

4:41

some of the things like photos, and

4:43

updates between different devices. But we

4:45

also worked on the first pre-sales for

4:47

doing sharing where you could do stuff like share

4:49

photos to another. So

4:52

we also did some big data processing work

4:54

there and we were doing stuff with rich

4:56

and charging at scale. So

4:58

when you're sharing things, you need to keep

5:00

keeping it safe. How many people have shared it?

5:03

And if all of the people who've shared it

5:05

have deleted it, then you can go on to

5:07

collect tips and things. We were doing that with

5:09

very large, Hadoop jobs. That

5:12

was the thing that was of its time. Sure,

5:15

it'll probably be spark today. After

5:17

working at Apple, I

5:19

went on to work at Neo4j. Right

5:23

here, once we're done for track four, we're building an

5:25

AI solution. And

5:27

there I've worked on building data infrastructure

5:29

again, the training of neural networks to

5:32

do computer vision. And after

5:34

that, I worked, Neo4j was a graph

5:36

database. So

5:38

my career has earned quite

5:40

a range of databases and data

5:42

technologies. And therefore, I worked on that. On

5:46

both the database and service products. So

5:48

you have managing Neo4j clusters as

5:51

a service for you to do. And on

5:53

the clustering algorithms, doing things like the

5:55

raft, implementation and working on

5:58

scaling and distributed. And

6:00

this is algorithms for a near-fore day. So you

6:02

could scale out your clusters, like thousands of nodes.

6:05

If you wanted to do kind of big data

6:07

graph, pretty. So yeah, quite a range of graphs.

6:10

I've had a near-fore day where I met Jason,

6:12

my co-founder at Oslet. And now

6:14

for the context of this conversation, I'm

6:17

wondering if you can start by giving

6:19

some scope and framing around what we

6:21

mean when we say data sharing, because

6:23

that can mean any number of a

6:25

broad variety of things. And I'm wondering

6:27

if we can just kind of give

6:30

the proper framing for what we want

6:32

to discuss during the rest of this

6:34

conversation. I think we've been sharing data for

6:36

years. I've always talked about consumers have

6:38

been sharing data for a long time. And even if

6:40

you think about things like a tweet and a form

6:42

of data sharing, you write some data and then you

6:45

share it with the world on Twitter.

6:47

And I'm thinking since I've been doing this

6:50

for years, you can go well

6:52

back to businesses that used to place, set

6:54

the data around on CDs. Again, I

6:56

think that's my very first role. I

6:59

was in the UK in London and

7:02

we used to get a CD

7:04

from company like the police office,

7:07

US Code would be like US Code for the month, and

7:09

we had all the zip codes and

7:11

mapping of all the zip codes into kind

7:13

of addresses and reasons. And that used to

7:15

be something that people provided on a CD.

7:18

You sign up and you pay money, they

7:20

got a good deal with you in the

7:22

mail. And so data sharing between all relations

7:24

has been going on for years

7:26

in lots of ways, from

7:28

CDs in the mail through

7:30

to APIs, different

7:32

kind of cloud based sharing. Techniques, right,

7:35

a lot of people

7:37

use APIs. A lot of people

7:39

use sharing data. I have data. And

7:42

if you call my API and I tell you, you know, tell

7:44

them about some of the data I have, we're

7:47

sharing that data and you're doing something with it. And

7:50

for this conversation, we're concerned with

7:52

data sharing between businesses and

7:55

data that's being shared, really for the purpose

7:57

of analytics. We're actually OLTP versus OLTP. We're

8:01

thinking about a fairly large

8:03

amount of data that you're sharing with someone else so

8:06

that they can use that data in their analytics. And

8:08

the typical usage involves things like joining

8:10

that data, that other data that

8:13

the recipient has, and then doing

8:15

that. So it's pretty rad that someone just

8:17

said, hey, give me some data, I'm gonna

8:19

analyze it in isolation. You can think of

8:21

some things like in the financial world where

8:24

maybe you say, you give me all of

8:26

the stock particular data and I'm just gonna

8:28

analyze it in isolation and then try

8:30

and use that to make predictions about what stock prices will be.

8:33

But in reality, even that usage is quite

8:35

rare and most other use cases you're saying,

8:38

let's share data between our two enterprises and then

8:40

each of us, it might be one

8:42

way or two way, but the recipient is saying, join

8:45

that data with some stuff they have or use it in

8:47

their own applications. But it is a broad

8:49

scope. Any

8:53

other thing that you can do for the decision. And

8:56

so given that context of I

8:59

at organization A want to be able

9:02

to send data to organization B or

9:04

I need to be able to request

9:06

data from organization B to use for

9:08

purposes of some sort of partnership agreement

9:10

or whatever the case might be, what

9:12

is the current state of the art

9:15

and state of the ecosystem for being

9:17

able to enable data sharing across organizational

9:19

boundaries, whether that is separate businesses or

9:21

just different business units within an enterprise

9:24

and some of the complexities that arise

9:26

because of that current state of the ecosystem. Yeah,

9:29

during the cloud, we do see a lot of

9:31

internal organization data sharing as

9:33

well. We speak to a

9:36

number of people who have problems, particularly larger

9:38

organizations, with geography or

9:40

whether you've done things like acquire different business units

9:42

who have different platforms and things. Yeah,

9:44

it's great class. But what's the current state of

9:46

the art theory? I started off

9:48

talking about kind of sending CDs in

9:50

the mail. The kind of follow on

9:52

technology from that really is SDP, SSTP

9:54

and sharing CFPs. And

9:58

we see this is actually... dominant

10:01

mechanism today where

10:03

they should have two organizations.

10:06

If someone maintains an FCP server and

10:09

they put CSV files on it,

10:12

there's usually some dance involving sharing

10:15

RFA certificates so

10:17

that you can SSH connect to someone's

10:19

FCP server and then you can reach

10:21

an H-ROPP. FCP can

10:23

operate in a push or a pull orientation

10:26

so you could push data onto

10:28

my FCP server or a pull benchmark for

10:30

your FCP server. That's been a dominant thing. I

10:32

think some of the shortcomings of

10:34

that are kind of obvious because this is one

10:37

of the most important ways to make this kind

10:40

of environment. So let's think

10:42

a lot. Then the follow-on from

10:44

that is really data

10:46

API. This I think is

10:48

kind of shared with you via HTTP. It's

10:50

not round it too different from shared with

10:53

you on FTP but they're very difficult that

10:55

we can see in a lot of particular

10:57

things like SaaS businesses. You make an API

11:00

call often with some query

11:02

parameters which is just necessary because you

11:04

can't pack that much data into a

11:06

single HTTP request and

11:08

so you say hey give me some data here

11:11

are some parameters that scrape it down for a

11:13

reason before it's honked and I make the API

11:15

call and you return that data. It's often again

11:17

ultimately kind of CSV or JSON formatted. In some

11:19

scenarios you make an API call and you kind

11:22

of get back a parquet file or something and

11:24

that's less common. I talked a bit

11:27

about limitations. That's just really not something

11:29

that HTTP is built for and there

11:31

is a cottage industry of home

11:33

built tools that people have for

11:35

kind of scraping these APIs and

11:37

then reconstructing your complete tables via

11:39

lots and lots of queries. That

11:42

was really I think based on a lot of people who were

11:44

kind of saying like we have a hammer. So

11:46

you know everything that the hammer that people had

11:48

is like REST API and

11:51

JSON that serve JSON data and they kind

11:53

of just took that hammer and applied it

11:55

to carrying it with the data. Then you

11:57

have the major state of the art. At

12:00

the moment, it's connected. Companies

12:04

like Fytern, Stitch, and

12:06

others who provide, connect

12:09

to things either as a service or as

12:11

a software or as a source, a product

12:14

you can use and run yourself. That

12:18

really helps you to, as a

12:20

consumer data, to pull in data that

12:22

is shared with you from a range of

12:25

different sources. If the connectors can

12:27

connect to the Stator API, they

12:30

can connect to things like FTP. And

12:32

they're based on a kind of

12:34

pull principle. The consumer

12:36

of the data takes responsibility and they

12:38

use it for getting hold of the

12:41

data that's been shared to them, and

12:43

they are moving the bytes using the connector

12:45

and then putting it somewhere, whether it's like

12:47

File Store or a Jira Data Warehouse. And

12:51

connectors in an ecosystem

12:54

where there's lots of data sharing through

12:56

that are inherently quite user-efficient because

12:58

every consumer has their own connector

13:01

running, has their own copy of

13:03

the data, and there's kind

13:05

of inefficiency and there's a latency, and there's

13:07

a lot of duplicated compute with lots of

13:09

different people making the same API request to

13:11

pull the same data into different places. It

13:13

puts the responsibility on the consumer of the

13:16

data to kind of operate and maintain and

13:18

run the system. And then there is in-place

13:20

kind of cloud native sharing. Pretty

13:22

much every major data platform or cloud

13:24

platform offers that today. So whether it's

13:26

something like S3, which

13:28

has a feature access point, which is

13:31

particularly designed for sharing data between S3

13:34

rockets, whether it's Snowflake sharing,

13:36

BigQuery sharing through Analytic Hub,

13:40

Big Plate has Delta sharing, Azure has

13:42

Big sharing. All the platforms offer

13:44

these things and they do what we call

13:46

in-place sharing, which the

13:48

key thing of in-place sharing is the data is

13:51

duplicated. So you immediately get a

13:53

huge efficiency bonus, particularly if you start sharing

13:55

safe and all kinds of data to

13:57

different users. And all the data warehouse.

14:00

they allow you to share the

14:02

data you have kind of as

14:04

you can see it you can

14:06

share the tables, not just

14:08

the data but things like the current teaching phrase

14:10

and the indexes and the views and all that

14:12

stuff. So it's really rich

14:14

as well as a more efficient report. When

14:17

we talk to all these things, what it

14:19

shows is that this data sharing isn't just

14:21

a purely technical concern. A huge

14:23

right tied up in the kind

14:25

of business socio-technical arrangement. Or

14:28

like when we share data, who takes

14:30

responsibility for what? Who

14:33

takes responsibility for paying the compute cost?

14:35

Who takes responsibility for maintaining the structure

14:37

of the data and the indexes and

14:40

the foreign teaching phrase and things like

14:42

that? And that's all tied up

14:44

in an approach. So when I talk about connectors,

14:47

there's an implicit expectation that practice

14:49

is a cool thing, that the

14:51

consumer data will be paying for a lot

14:53

of the compute and stuff happens. We

14:56

talk about push FTP, that's a

14:58

reverse expectation on who has responsibility

15:01

things. The consumer maintains an

15:03

FTP server, but the provider pushes it

15:05

to data internet. And that's also true

15:08

in place sharing. And yeah, we think

15:10

the in place sharing provides some of

15:12

the best splits of

15:14

these responsibilities. It allows things like

15:16

the person who's analyzing

15:19

and running compute on the data

15:21

page with the compute, but the

15:23

person generally who's providing data is

15:25

paying things like the storage. And

15:27

it works incredibly well for people

15:30

in terms of the use because we're basically eliminate you.

15:33

Right. Here's my table in so like I like

15:36

to share with you. That's it. There is no

15:38

e-mail computer process right now. You can start analyzing

15:40

straight away. So yeah, that's really the current state

15:42

of. And in

15:44

terms of those socio-technical elements

15:47

of data sharing and the

15:49

methods and motivations behind it,

15:52

one of the other complexities

15:54

also comes into the compliance

15:56

question where as the providing

15:58

organization I need to make

16:00

sure that I am eliding or masking certain

16:02

pieces of information because I can't share it

16:05

externally or I need to ensure that there

16:07

are appropriate controls on that data as it

16:09

is being shared so that it is not

16:11

accessible by some man in the middle or

16:14

a third party that is not supposed to

16:16

be involved in this sharing. And then there

16:18

are also questions of public data sharing where

16:20

I as an organization want to be able

16:23

to create and publish a public data set

16:25

that anybody can use, but I

16:27

don't want to have to pay millions of dollars

16:29

because somebody else is using all of

16:31

my compute to do analysis. And

16:33

I'm wondering if you can talk

16:35

to some of the ways that

16:38

those considerations factor into when and

16:40

how businesses decide that they actually

16:42

want to engage in these data

16:44

sharing agreements and some of the

16:46

ways that those considerations will maybe

16:48

prevent what would otherwise be an

16:50

amenable relationship. Yes, there are

16:52

two things that I think

16:54

can't be heard. The compliance

16:57

and privacy and sensitivity management

16:59

have some strong technical aspects.

17:01

It's also good to think about the

17:04

group that we see in data sharing

17:06

kind of arrangement. So we see

17:08

a real range from

17:10

things like supplier consumer relationships

17:13

in manufacturing where the

17:15

kinds of data being shared and stuff like

17:17

how much stock does

17:19

a manufacturer who is providing parts

17:22

to an assembler like have. So

17:24

we see this in the automotive

17:26

industry where the automotive buying organization

17:29

has huge power and they actually

17:31

have arrangements with their providers. They

17:33

say, hey, you've got to tell us how much stock you

17:35

have and how many parts you have on the shelf

17:37

and manage up the site area. And think

17:40

about that. And that's a positive version

17:42

of making that I want to keep between themselves. It's

17:45

also not subject to kind of compliance that you

17:48

might see at the other end of the day

17:50

when you go to like health care

17:52

data and they come like

17:54

Hitler and health insurance company

17:56

in the United States that wants to share

17:58

data with a farmer. a policy

18:01

company or a hospital

18:03

organization. And that's a

18:05

very different concern. The

18:07

healthcare data you start to have to turn

18:10

not just of like what data is accessible,

18:13

how is it access data, what is hidden

18:15

in tracks. And in

18:17

Europe you also have things like right to

18:19

be forgotten, where you wanna say not maybe

18:22

you need to change to this data, but

18:24

you need to have a way to re-enact

18:26

data. And doing in-place sharing, you're getting on

18:28

the technical side of these are kind of

18:31

impact sharing with quite a lot

18:33

of them because the cloud platforms provide things

18:36

that allow you to do things like audit to run the

18:39

app or the data. And if you remove

18:41

data from in-place and you know it's gone where

18:43

there's someone copied your CSV files, here's

18:47

some routes and you can look

18:49

first and place. But a

18:51

lot of this comes down to

18:53

the kind of relationships

18:58

between organizations as well. So all

19:00

those alternates where they've got the right legal things

19:03

and the right compliance, in-place before

19:05

all can do, say, sharing based on their industry

19:07

and the kind of data they wanna share. Yeah,

19:09

does that answer the question on that? Yeah,

19:12

and digging more into the

19:14

mechanical aspects of data sharing, as you mentioned, there

19:16

are a few different ways to think about it

19:19

where one is I have this data, I am

19:21

going to extract it from the system that I

19:23

used to maintain and I'm gonna push it into

19:26

some other system, whether that's S3 or FTP, you

19:29

can take it, do whatever it is you

19:31

want with it, I have no more visibility

19:33

or control over that data versus on the

19:35

other end of the spectrum, you have the

19:37

snowflake and BigQuery approach of I have this

19:39

table, I'm going to make it available to

19:41

you as long as you have

19:43

an account with that same provider, you can

19:45

query it, do whatever you want and I

19:47

have some level of visibility of how it's

19:49

being utilized. But I also still don't maintain

19:51

control over it once you use it because

19:54

maybe you're extracting it elsewhere and I'm wondering

19:56

if you can talk to what

19:58

are maybe some of the shortcomings. of

20:00

even that more sophisticated approach of

20:02

the sharing the entire table and

20:04

its context and history and some

20:06

of the technical capabilities that need

20:08

to be present for the data

20:10

sharing solution to be

20:13

effective, whatever effective might mean

20:15

given the context. Yeah,

20:17

you particularly around sensitive data and

20:19

things like that. It is a

20:22

business-specific technical area, so some

20:24

of it comes down to just the

20:26

contractual agreement. If you take a look

20:28

at the data sharing, there is a

20:30

level of trust and legal enforcement and

20:32

places have to carry you have to

20:34

agree not to do certain things. Outside

20:36

of getting into data sceneries and differential

20:38

privacy, the, as I said, there's not

20:40

a lot you can do to technically

20:42

prevent people from your exacting data from

20:44

different products. So, some of the things that

20:48

you can do to exhibit is, you can make

20:50

use of things in views, make me

20:52

comment back and say you can have

20:55

a lot of views, make mistakes, make

20:57

decisions, you can have views with legal

20:59

enforcement. And using in place sharing

21:01

means you can do a lot more of

21:04

that than you can with kind of older

21:06

techniques because the data doesn't need to be

21:08

duplicated for every view. When you do carrying

21:10

with something like a CSV file is extracted,

21:12

if you want to share different views of

21:15

data, you have to extract all the different

21:17

possible combinations into different types of CSV files

21:19

for different consumers. And that

21:22

obviously means it uses a lot of resources

21:24

and computing stories and stuff. Whereas, so like

21:26

or a database or BigQuery, you can create

21:28

a view that is exactly the data that

21:30

is seen. And you can use that to

21:32

apply things like obfuscation

21:35

or some of the kind of differential privacy

21:37

you might be when you're exchanging like tokens.

21:39

So, one of the things that we can,

21:41

some people do is two-way carrying where

21:44

I share with you an

21:46

obfuscated token and use

21:48

that, allow you to identify the data that you

21:50

have that you would then share back to

21:52

me. You join on the obfuscated token and

21:55

then you share the data back with me

21:57

of only the rows that match the obfuscated

21:59

token. It's a little technical, but

22:01

it means that we can say, where

22:03

we have data related to things, we

22:05

can ensure that we share the joint

22:07

of those data without necessarily sharing the

22:09

details of what we know about those

22:12

individuals. So there is something that

22:14

could be done there. But

22:18

at the limit, you do get into your clean

22:20

rooms. Once you get beyond your

22:22

confidence to operate with an

22:25

unbitness based on kind of contractual agreement

22:27

and legal homework that's in place, you

22:30

get into the morale of clean rooms which

22:32

are fully controlled in rooms. And

22:35

they often maintain the probability

22:37

of data. And the clean

22:39

room solution is a little bit different from, and that's

22:41

what we do, where you're actually saying like, we set

22:43

up this environment, you log into it and you have

22:45

a married control access of what you can do in

22:47

that environment and like whether you can express

22:50

data out of it. And in

22:52

terms of the work that

22:54

you're doing at BobSled and

22:56

some of the specific problem

22:58

areas that you're trying to

23:00

solve for, what is

23:02

the kind of unique set of capabilities

23:04

that you're enabling that aren't present in

23:07

these other platforms or some of the

23:09

ways that you are approaching the problem

23:12

that is maybe vendor agnostic

23:14

or removes the constraint of

23:16

everybody having to use the

23:18

same technology platform? Yeah,

23:21

the largest problem that

23:24

people face and I've told them times,

23:26

it's not coming that people face using

23:28

templates to make sure they

23:30

can perform the receiving dates

23:33

and to make sure they can be using the

23:36

platform. I always talk hate with you actually

23:38

on the same kind of cloud platform and

23:40

region. So we talked about

23:42

things like snowflakes, we're a real leader

23:44

here, but snowflake sharing is only truly

23:46

straightforward if we're both

23:48

in the same snowflake region on

23:50

the same platform. So we're both in US East 1

23:52

on AWS. If you're in snowflake,

23:55

if you're using snowflakes on GCP, in

23:58

EU Central, it is... It's impossible, but

24:00

you then have to do database reputation with Snowflake

24:02

and it's no longer, I want to share this

24:05

table for you, it's actually a whole process, we

24:07

have to replicate the database and then do a

24:09

share and different reason. So even within the same

24:11

platform, there are challenges, but the major challenge in

24:13

shortcoming is to do a different page

24:16

sharing, is that we have

24:18

to agree on the pattern that we're

24:20

gonna use, and that's the extremely difficult

24:22

impractic. If you're in a practical, if

24:25

you're a person providing an initial base test,

24:27

you can relocate your data operation into

24:30

another cloud. Anyways, data is

24:33

often the result or the whole process of

24:35

what's done to collect that data so you

24:37

want data to tie up with other things

24:39

and ask you have the place, so you

24:41

don't relocate your data onto Azure because you've

24:43

got someone who wants to run Azure. And

24:45

it's almost, now

24:47

it's actually practical for a recipient of data to

24:49

relocate their usage. We said, it's very rare for

24:51

data to be in isolation, you want to join

24:54

it on other dates you have and

24:56

see it into your existing, your

24:58

processes, whether they're analytical or

25:00

transactional, and to

25:03

relocate your application to a different platform

25:05

just to receive some basis data and

25:07

share a few. So in this many

25:10

to many environment where it's really high

25:12

patternality, particularly when you take into account

25:14

cloud regions, you could be on

25:16

site, you could be on base scripts, you

25:18

could be on BigQuery, and then you could

25:21

easily be in different regions in the same

25:23

platform, and there's a huge temptation

25:25

that just isn't solved for you unless

25:28

someone makes the move to a different platform.

25:30

And that's one of the massive things that

25:32

we're solving for in BobSays. So our

25:34

aim is to provide that really

25:36

simple, straightforward experience when

25:39

you say, I want to share these

25:41

specific views or tables or this specific

25:43

data from my own tech storage to

25:45

this person. And with BobSays, you

25:47

say, this is where I want to share it to. And

25:50

so you can say, I want to share

25:52

it to BigQuery, I want to share it

25:54

to Azure blog storage, and what

25:57

the recipient experiences with BobSays in

25:59

beta six. that same straightforward share

26:01

in the cloud-native way of

26:03

their platform that they're on. And what

26:06

the provider experiences is that we either access

26:08

their data directly or we access their native

26:10

via simple share. And we

26:12

solve the problem of like how does the

26:15

data move from one place to the other.

26:18

And how do we maintain efficient

26:20

sharing if you're doing sharing to

26:22

multiple people in the same destination

26:25

and region without

26:27

doing things like replicate more of your data. So

26:29

yeah, this allows people to maintain

26:32

that kind of shift-left of simplicity,

26:35

taking responsibility for structuring their data, making

26:38

it analytics ready and usable

26:41

with the ability to

26:43

straightforwardly share it to someone else without anyone having

26:45

to think about all the ETL and some of

26:48

that's involved. And that's what Bob said

26:50

does under the hood. And then another challenge

26:53

to this data sharing question that

26:55

we touched on a little bit,

26:57

but is that question of auditability

26:59

and governance when you are sending

27:01

data to another entity because at

27:03

a certain point, there's no way

27:06

for you to maintain control anymore

27:08

because once somebody has access to

27:10

the data, even if you want

27:12

them to be analyzing it in

27:14

situ, there's always the possibility

27:16

that they're going to extract it and do

27:18

some other thing with it. And I'm wondering

27:20

how that factors into the ways that

27:23

the sociotechnical aspect comes into play

27:25

with some of the sharing agreements

27:27

and some of the regulation and

27:29

compliance aspects of doing data sharing,

27:31

particularly when you're dealing with something

27:33

like healthcare data and you're maybe

27:35

a medical provider sharing patient data

27:38

with a medical researcher for being

27:40

able to develop some new sort

27:42

of therapy, etc. And

27:44

I think that's one of the ways

27:46

that the sharing protocol maybe can and

27:48

should incorporate that audit and access control

27:50

and governance enforcement in

27:52

the process of that access and

27:55

sharing. Okay, a lot of that

27:57

comes down to the agreements and what

27:59

the protocol can do. new is help to note

28:01

people at very clear and have a shared

28:03

understanding of a group. So for example, with

28:06

something like right to be

28:08

forgotten, we can help people to standardise

28:10

on the way that they communicate things

28:12

that need to be deleted. So

28:15

if we are both signed

28:17

up and compliant to the

28:19

European data pricing rules around

28:22

that, if

28:25

one of my, and you're a subprocessor for my data, if

28:28

I pass on a right to be forgotten, request

28:31

to you, you need to process that and

28:33

delete the relevant data. And if that's the

28:35

contract you're in between us, we get into

28:37

the practical level, you get into

28:39

practical questions like, well, okay, how do we do that?

28:41

How do we communicate to you the information so that

28:43

we can be reasonably confident that we've given it to

28:45

you and that you know what to do with it

28:47

and that you then process with the

28:50

leaks. And there are some interesting challenges with that particular

28:52

thing of how do you keep track of the

28:54

fact that you have deleted something and

28:57

also that you have deleted it, right? And you have

28:59

to keep track of like, we know that we did

29:01

delete this and can prove we deleted it, but we

29:03

don't actually have the data because we deleted it. The,

29:05

and so I think we can really help people to

29:08

standardise around, you know, how they

29:10

communicate and share that data and things like

29:12

whether that's something that people also want to

29:14

sign up to you with an API. Like,

29:16

we don't maybe just want, you

29:18

know, the kind of token identifiers and things you

29:21

need to delete and they're in a table, but

29:23

there's also an API call that

29:26

you can do to process that, that's just to

29:28

trigger that. And we also

29:30

like sharing being more

29:32

rich than just the data. So

29:35

some of the things that you can

29:37

share in the platform, the things like

29:39

you find functions, stores

29:41

for team goes, make a path and things

29:43

like that. So there you can help people

29:45

to share a function that can do things

29:48

like carry out and delete and

29:50

all of the different things you run the

29:52

function and generate the output. And

29:55

two-way sharing can be something that people

29:57

can use as part of a compliance

29:59

process. Can you share back to us?

30:02

And a data or something that's computed

30:04

over the data like a checksup. So

30:06

can you provide us some kind of

30:08

receipt that shows that you've carried out

30:10

certain actions by sharing data back. And

30:13

again, something where we can really have by

30:15

providing the expertise to say go from snowflake

30:17

to daybreak, and daybreak, class again snowflake,

30:20

means that each person can be operating

30:22

where they have the confidence, the expertise.

30:25

You can provide a career that means that you

30:27

can provide some receipts. And we

30:29

also do abstractions over things like the

30:31

telemetry and the order. So we can

30:34

say, there is a company sharing data,

30:36

someone else, you can go into Bob's, get

30:39

your audit notes, and get the

30:42

data. As far as it's available, depending

30:44

on their station platform, but you can get that in this

30:46

kind of single Bob's that abstracts and you don't have to

30:48

be the linking the audit logs

30:50

of four or five different platforms

30:52

if you need to kind of verify if

30:55

something, some question around it to that data

30:57

itself. Part of our kind of plans, although

30:59

it's not yet something we do as well

31:01

as communicating governance, rules

31:03

and requirements. So that

31:06

one thing is that you need to agree, you

31:09

know, successfully and say, we're gonna have a

31:11

bio-bias rule and how we're gonna process the

31:13

data. And if you're subject to something like

31:15

HIPAA, you know, I

31:17

will know that you get audited with 27,000

31:20

words or two. We're

31:23

working towards HIPAA conclusions. I

31:25

think that in the compliant organization, you have to comply,

31:28

although I can't necessarily, I can't

31:30

audit you directly and say,

31:32

you know, look what you've done,

31:34

I can have confidence that you are audited.

31:36

And one thing that we're looking

31:38

at is providing a way for people to

31:41

say what their kind of governance requirements are

31:43

and have that clearly pass along with the

31:45

data. So that the recipients of the data

31:47

clearly can see the governance

31:49

requirements of this and that they were tested and said

31:51

like, yes, we meet those requirements and

31:54

helping to kind of make that part of the data

31:56

sharing protocol and tying it up with

31:58

that business associated with HIPAA. And

32:01

at what point do point-to-point connections

32:04

for data sharing reach the limitation

32:06

and you need to then step

32:08

into the situation of having a

32:10

data brokerage for escrowing certain data

32:12

sets that multiple organizations need to

32:15

be able to have access to

32:17

and what are some of the

32:19

ways that the data sharing protocols

32:21

can maybe also help to reduce

32:24

friction of populating those data sets

32:26

and consuming those data sets. Yeah,

32:29

one of the things we do is above

32:31

this we kind of combine hopefully what is

32:33

the best of the in-place sharing and

32:35

we're doing some of the work of

32:37

moving data around and achieving efficiency. So

32:39

I talked about when you have a

32:42

kind of shift right approach, every

32:45

consumer of the data has their own copy of

32:47

data and their own compute doing ETL and so

32:50

on. When we do a data

32:52

share from one platform to another, some ETL has

32:54

to happen. Obviously, that in-place sharing requires the data to

32:56

be in place. If you want it in place

32:58

sharing in Google Cloud, it's got to be in Google

33:00

Cloud and if you want it in place sharing in

33:02

Soapweight, it's got to be in place with Soapweight. If

33:05

you're both at once, you have to have two copies of

33:07

data. What we can do is make sure that if

33:10

you're sharing the same data to 10 people

33:13

in AWS US East One, that

33:15

there is only one copy of the

33:17

data in AWS US East One and

33:19

all of those 10 people are then

33:21

consuming a view on that

33:24

data. So we can ensure that

33:26

you're getting the best possible efficiency of

33:28

what they have and as

33:30

we think about invite access and

33:33

things like that as well, what we do

33:35

provide is a very simple way for people to

33:37

do things like remote access. As

33:39

you start to think about the challenges

33:41

that people face, trying to achieve data

33:43

sharing into multiple clouds, multiple

33:45

platforms, it's just if you

33:47

want to say, hey, we want to revoke access

33:49

from someone now, the work that ends

33:52

up evolving right can be quite significant, right? You have

33:54

to go into each platform and where

33:56

they might be used and individually figure out how

33:59

to revoke access. and that's a different platform,

34:01

that's a different person. So

34:03

with Bobstead, you can just go into Bobstead, say a red

34:05

hat, and we'll make sure that that's a different person. It

34:08

is an access. And where

34:10

you've got multiple people sharing from the

34:12

same dataset, you've got 10 consumers in

34:14

one particular region, we can handle things

34:16

like the kind of garbage collection and

34:18

reference dancing. So we'll maintain

34:20

that data in that location until it's not

34:22

being used by anyone in that location. And

34:25

then we've got the ECT data. And there's obviously a

34:27

number of technical challenges in

34:30

terms of orchestration and

34:32

management and things like that for us to do.

34:34

You mentioned kind of escrow as well. Yeah,

34:37

escrow is, there's a ton of

34:39

different understandings now, and it is a new case

34:41

that we talk to various people

34:43

about. At the moment, one

34:46

of the, like, one of our approach there has

34:48

been to say we can help to ensure that

34:51

data is available in different places and

34:53

take advantage. But often, if you

34:56

want to kind of escrow data in certain conditions, the

34:58

first thing to do is use cryptography

35:01

and then manage the cryptography

35:03

key. So we can say

35:05

we can share the encrypted data between a bunch

35:08

of companies. We might experience it personally

35:10

as around source code escrow.

35:12

So when you start up working-size

35:14

prices, they'll say if you go

35:16

bankrupt, or as a business or

35:18

stop serving in some way, we'd like to have

35:20

the possibility that we could maybe continue to operate

35:23

the service. So you need to put your source

35:25

code in escrow. And there are some companies who

35:27

provide that kind of service, but you can also

35:29

do this sort of DIY thing where you're saying

35:31

if you want a large amount of data in

35:33

an escrow, you can encrypt it.

35:36

Bobset can help you then move that data

35:38

around. But the escrow process, and

35:40

then the escrow process can just focus on the

35:42

key, small piece of data, and you can work

35:44

with law firm and account deal, one of the

35:46

people who provide that service to say that we'll

35:49

hold the keys to that in escrow, and then

35:51

everyone just has the encrypted data. In

35:53

terms of the boundaries

35:56

that you're crossing with these

35:58

data transfer technicals. arrangements,

36:00

the organizational arrangements, what are

36:02

some of the typical

36:05

situations in which you encounter those

36:07

types of boundaries and the ways

36:09

that they are defined and delineated?

36:11

And I imagine most of that

36:14

is just purely organizational, but what

36:16

are the cases where technical

36:18

requirements actually necessitate these data

36:21

transfer systems versus just being

36:23

able to do direct integration

36:26

between them? Yeah,

36:28

we see the kind of within

36:30

an organization boundary as

36:33

well as the between organized boundaries you talked about before.

36:35

So sometimes it can be different

36:37

regions within an organization.

36:39

So for an identity

36:42

problem, you might have the UK

36:44

office on one system

36:46

and the South African office on another system. And

36:49

sometimes that can be necessitated by

36:51

things like regional rules

36:53

or regional availability of

36:56

services. Another thing we

36:58

do, we can sometimes see for

37:00

technical requirements is things like AI processing

37:02

availability and stuff like that. So people's

37:05

choices for where they want to analyze

37:07

their data may not just be

37:09

driven by the myriad of reasons that you

37:11

choose a cloud or so on, but it

37:13

might be related to specific AI or blockchain,

37:15

something to technical requirements. So if you want

37:18

to use certain open AI things, you maybe

37:20

need to be on a Microsoft platform. And

37:22

if you want to use certain

37:24

blockchain systems, you may need to

37:26

be better on another platform or

37:29

another location. The other one that

37:31

drives kind of regional things is

37:33

compliance and the rules around that.

37:35

So you want to keep data within

37:37

certain geographical boundaries. One thing

37:40

we allow people to do is control which regions

37:42

and things they allow they to be shared to.

37:44

So you can have data on both sides. Also,

37:46

it could make that data possible to be shared

37:48

with any region, any cloud platform.

37:50

You can limit it down and say, this

37:52

data is only allowed to be shared within

37:54

the EU. It could be on any platform,

37:57

but still on the regions of those platforms

37:59

that are EU. So you

38:01

see boundaries that have geographical and

38:03

regulatory, rather. There are regulatory boundaries

38:06

in the cloud to

38:08

have kind of go cloud services and

38:10

some types of

38:12

healthcare, and they're both like capital separate,

38:14

so it's like cloud for healthcare

38:17

data. So there's obviously some compliance

38:19

boundaries there. I'm trying to think

38:21

through what other things we've seen

38:23

that might come into this. There

38:25

are not cloud to cloud boundaries.

38:27

This is something that we're

38:30

aware of and keep in mind in the future. When

38:32

you have a real asymmetry between

38:34

organizations, so you maybe have quite

38:37

a small organization working with a

38:39

very large, small organization with limited

38:41

capacity to sophisticated things, working

38:44

with a large organization that's both

38:46

to do very sophisticated things or can be very

38:48

sophisticated things, that creates a

38:50

kind of technical boundary of what kind

38:52

of solutions they might use,

38:54

and your small organization might want

38:56

to be using something like Google Sheets. And

38:59

not only what we support today, but

39:01

it is something that BigQuery can do,

39:04

is to say something like, I'd like

39:06

to share from a Google Sheet into

39:09

BigQuery and events from BigQuery into

39:11

using both sides of the anywhere. So yeah, we

39:13

can see those kind of things where someone says,

39:15

I want to go from really a kind of

39:17

different kind of system. And

39:20

we can kind of take away how much

39:22

time I want to take from a very different system into

39:24

or out the cloud to the cloud. And

39:27

in your experience of working in

39:29

this space of data sharing and

39:32

the socio-technical aspects that come into

39:34

play, what are some of the

39:36

most interesting or innovative or unexpected

39:39

applications of that protocol and capability

39:41

that you've seen? There are

39:44

a really interesting thing we've

39:46

seen from customers is auto-fulfillment

39:48

from CIN. We

39:51

allow driving all of this through a

39:54

single API, so you can call Bosnet

39:56

API and set up a share or transfer or make

39:58

a change. and

40:00

we've been really in a discussion with customers where

40:03

they kind of directly connecting a CRM

40:05

into BOSL. So

40:08

you do some activities in your system and

40:11

something like Salesforce or so on and

40:13

like Salesforce can send information so on

40:15

directly into BOSL, BOSL can share, BOSL

40:18

can make web page locations back to

40:20

CRM and you can

40:22

actually achieve auto fulfillment of your

40:25

salesperson or account manager using the

40:27

system that they're familiar with. Having a

40:29

data share be created in action and actually getting

40:31

up to date back in their CRM without the

40:33

person's end of the CRM and without the company

40:35

using it really building kind of bespoke software because

40:38

they now have to run some separate platforms and

40:40

take the servers, they're able to build it into

40:42

the extensibility of a platform like

40:44

Salesforce, which is really cool to see that

40:46

people are able to get these things up

40:48

and running without using their own CRM without

40:50

having to build their own Mac server and

40:52

their own major development process. Another thing that

40:55

I think is very cool that we do internally

40:57

is we do BobShed

40:59

to BobShed so we can send

41:01

data from one

41:04

place in BobShed to another place and

41:06

then use that data destination as a

41:08

source for like further onward BobShedding. We

41:10

use that internally, including

41:12

doing things like, another thing is

41:14

using that to share data back to our

41:17

customers about things like their usage. So

41:20

if you want to get data about your

41:22

usage of BobShed, we're working on providing that

41:25

as a BobShed share that you can then consume in

41:28

BigQuery and the orange smoke plate or some.

41:30

Another thing that we've seen is people do

41:32

is having data they've got

41:34

as a source of something like

41:36

a CSV. We support loading CSVs

41:38

into BigQuery houses. So they

41:40

use BobShed to load data from,

41:43

that they've got in CSVs into

41:45

Snowflake and all day scripts

41:47

of BigQuery. They then set

41:49

that Snowflake or day scripts of BigQuery up as

41:52

a BobShed source and

41:54

then they use the capabilities of Snowflake

41:57

to make views and so on over what

41:59

they've and CSV, and

42:01

then they use that Snowflake as Bobflare source

42:04

to then use further on-screen sharing. So they're

42:06

actually using Bobflare to kind of do an

42:08

ETL process and bootstrap

42:11

themselves from a kind

42:13

of not Cloud native sharing protocol

42:15

world into a Cloud native sharing

42:17

protocol world. By using

42:19

Bobflare as a bootstrap, and Snowflake can then on-screen sharing,

42:22

using Bobflare from that snowflake. And

42:24

in your experience of building Bobflare,

42:26

working closely in this context of

42:28

organizational data sharing, what are some

42:30

of the most interesting or unexpected

42:32

or challenging lessons that you've learned

42:35

in the process? There's

42:37

always a lot of challenging lessons

42:40

from operating at the startup. The

42:43

time we started to call, as a founder,

42:45

you're often dealing with whatever the most serious

42:47

problem in the business is at any given

42:49

time. Yeah, so one of the biggest

42:51

challenges we've seen is the

42:54

complexity of building an abstraction over

42:57

all of these different Cloud systems. So I talked

43:00

with Jake, he's my co-founder, and

43:02

we were observing that Bobflare, as a product, is kind

43:04

of a simple concept, right? One of

43:06

the simplest concepts that either of us have

43:08

worked on in our career is in certain

43:11

ways. Compared to the commitment of a

43:13

graph database. Conceptually, a graph

43:15

database is a really complex thing.

43:17

But Bobflare is a very straightforward

43:19

product for you to share data

43:22

from your storage of a data warehouse to another sample

43:24

storage of a data warehouse. What

43:27

we've seen is a real challenge between the simplicity

43:29

of the concept and the

43:31

challenge of building an abstraction

43:33

over all the different Cloud and warehouse

43:36

platforms. And one of the things for

43:38

me here is that we

43:41

don't own this kind of stack all

43:43

the way down. So what we have to work

43:46

with aren't kind of the theoretical

43:48

limitations that you might have when using

43:50

my building raft. You're building a

43:52

raft system, and you

43:54

can go and read the kind of PhD papers

43:56

and so on related to it, and you can

43:59

understand their constraints. as CAP theorem, that's

44:01

the speed of light. And you basically then

44:04

are up against those challenges. You can try

44:06

and build against that and control it and

44:08

understand it. We don't have that

44:10

kind of deep tech or

44:12

hard tech challenge. There's not at

44:14

our core a really hard challenging

44:17

AI problem or a challenging CAP

44:19

theorem, distributed systems problem, or something that we're solving

44:21

for people in a really smart way. What

44:24

we're challenged with is all of these

44:26

different abstractions that are present in

44:29

Azure, StateBricks, Snowflake, BigQuery,

44:31

and they superficially are

44:33

quite similar. But

44:36

as you get into trying to manage and work

44:38

with all of them, you discover that they are

44:40

different. And you're the devil of data engineering to

44:42

the lives of these details. And yeah, that's been

44:44

a really, you're not entirely unexpected

44:46

challenge, but that's been where we've discovered a lot

44:49

of challenges actually to build an abstraction. Even

44:51

across object storage, we

44:53

find that in AWS, you

44:56

have the access point attraction, which is

44:58

really great, but it's not present in the other

45:00

cloud. And so you build something on AWS, and

45:03

then you realize you can't really build a comparable

45:05

abstraction, but a good cloud structure. So you have

45:07

to do something quite different. Or as we get

45:09

into things like executing serverless functions, you know, our

45:11

work making, we execute serverless functions to do work

45:14

in AWS GCP and Azure. And so we have

45:16

to build out an abstraction for

45:18

managing serverless functions running on different clouds. And

45:20

that's kind of a challenge that in itself,

45:23

some organizations have, like their

45:25

main organization, so it's

45:27

kind of a platform, you're

45:29

building technology that allows people to do that. And

45:32

that is one of the problems that gets us

45:34

solved internally so that we can say, hey, to

45:36

do object storage sharing to all the different clouds,

45:38

we need an abstraction that means we can run

45:40

some serverless compute, means we can

45:42

make some certain assumptions about how data is

45:44

stored, and we can make some straightforward ways

45:46

of saying like how we grant or work

45:48

access to a share. Each of those end

45:50

up being surprisingly different and nuanced between different

45:52

platforms. And for people who

45:55

are exploring the problem

45:57

space of being able to send

46:00

data from one system

46:02

to another, whether that's across organizational boundaries

46:04

or across technical boundaries. What are the

46:06

cases where bobsled is the wrong choice

46:08

or what are the cases where they

46:11

should just reconsider the entire application and

46:13

avoid data sharing entirely? I think the

46:15

biggest time when bobsled is the wrong

46:18

choice is kind of when I know

46:20

I can say, I just

46:22

don't need to. I'm a huge fan

46:24

of identifying we don't need

46:26

to do things and you can often

46:28

find yourself in a situation where you feel like you

46:30

need to do something because that's how it's done. I'll

46:32

tell other people do it and things like that. But

46:36

a bit of analysis can take out and maybe we don't. But

46:38

one of the main cases where bobsled is the wrong choice is

46:41

when something like a data clean room is the

46:43

right choice. So that's when the

46:46

reassurances you want found like what's visible to

46:48

someone and what's done with an atrium, whether

46:50

or not it's kind of been extracted and

46:52

so on. So stringent

46:54

that you need to make use of

46:56

a data clean room. And there's some

46:58

really cool technologies in that space around

47:00

things like differential privacy and things where

47:02

you can have systems that allow you

47:04

to make kind of aggregate queries that

47:07

don't reveal the underlying data but allow you to

47:09

query the data in the aggregate and things

47:11

like that. And for all of

47:13

those, bobsled is the wrong choice or would

47:15

have to be part of a much more

47:17

complicated solution architecture. At the moment we

47:20

talk to people who, at

47:22

Haiti, we talk to people who want

47:24

to do a migration. They want to

47:26

migrate from GCP to AWS. They're

47:28

in time-takes. They say, well, that's something bobsled

47:31

can do. Bobsled can move data from a

47:33

data store in one place to another. Can

47:36

we use bobsled from migration? At the moment

47:38

that's something where we would generally say, bobsled

47:40

is not the right choice. If

47:42

you're doing exactly one, that's exactly another

47:44

place. So it's probably already

47:46

a tool in the destination that's

47:49

good enough for what you want to

47:51

achieve. So it's just an Azure-based

47:53

factory or something. If you're just concerned with

47:55

getting data in to just one platform, then

47:58

you can probably use the native tooling on that platform. And

48:00

we're usually advocates of use

48:02

the native tooling, like use the native sharing

48:05

protocol and this thing, like those little extra

48:07

stuff. Yeah, and then I'm trying to think

48:09

of situations of early situations where we've come

48:11

across where we've sort of said, you know,

48:14

do you even need to do data sharing? Like

48:16

perhaps you should read some to that. I think

48:18

there are cases where we have the reverse that's true

48:20

where you might be like, should this be an API?

48:22

And so I talked to everybody, there are situations where

48:25

people like, we have a

48:27

JSON REST API hammer. So we're gonna

48:29

take some everything with a REST API

48:31

JSON over HTTP. They're

48:34

coming in the case where you need to, should this

48:37

use case be something that you're managing in

48:39

an analytic data sharing? Or should it be

48:41

something that actually an API or

48:44

a webhook or something else? You

48:46

could take a content

48:49

and it makes it right. Another way you

48:51

can do cross-organization or kind of synchronization in

48:53

some of the clouds is using things like

48:56

a content stream where you can do cross-organization

48:58

or listeners on P or

49:20

any particular projects or problem areas you're

49:22

excited to explore. Might give you two

49:24

answers there. One thing I'm really excited

49:26

about or kind of passionate about is

49:29

tackling something that we call modern data

49:31

stack fatigue. So you're probably

49:33

seeing familiar with this. There's

49:35

a whole route after technologies that go

49:37

into modern data stack. I have a

49:39

controversial idea of Kubernetes. Kubernetes

49:42

has this kind of testing,

49:44

which is quite feature, it's quite showing you

49:46

like all of the Kubernetes that technology is. I

49:48

don't know if you're familiar with it, but incredibly

49:51

dense. Unreadable and that's your

49:53

screen with six foot wide kind of

49:55

thing. And the modern data stack

49:57

is going in a similar direction. and

50:00

like urban homes and corporate tools

50:02

for doing each different individual thing that

50:04

you might do. And yes, we sort

50:06

of like past behave as a net

50:09

in the world of data and data

50:11

engineering analytics. We've had an explosion of

50:14

all of the different tools and technologies

50:17

and services and infrastructure service

50:19

and platform service and things

50:21

else. And now we're in

50:23

a situation where the fatigue

50:25

from sort of, oh, I'm

50:27

on my team kind of imagining the time is

50:29

like, well, I have to do like six or

50:31

seven different technologies. And then when you get into

50:33

things like hiring, you're suddenly like, well,

50:36

our stack is this particular combination. When

50:38

you're hiring, you're like, we want to hire someone who

50:40

has this exact combination of experience and you're like, well,

50:42

that person doesn't exist because there are so many different

50:45

combinations of possible things that no one has used during

50:47

that combination. And we're in a

50:49

macro economic environment where there isn't necessarily a

50:51

budget for everyone to have every single tool,

50:53

right? And people are a bit more focused

50:55

on what you can do being

50:57

lead. And I do need to

51:00

have a bunch of services running for this. We

51:02

think something like DuckDB is really cool, right?

51:04

DuckDB has a kind of minimalist

51:06

approach, which has, you know, you do have a

51:09

bunch of service necessarily running and like you can

51:11

use mathematics on your M3 laptop. And

51:15

the thing that's really exciting for us is

51:17

that we can kind of help

51:20

people approach some of that because we don't

51:22

have a kind of horse in any of

51:24

these races, right? Within that fatigue,

51:27

there are sort of different philosophical holy

51:29

wards, the kind of emacs versus them

51:31

type of things, right? Like, should you

51:33

have a lake house or a warehouse

51:35

or? We can

51:37

help people to do data sharing sort of

51:39

regardless of whether you know

51:41

what technology choices they've made. And

51:44

I really hope to kind of help people achieve

51:48

simplicity, rather than in the

51:51

face of all of this complexity of

51:53

options. And yeah, I'm really excited

51:55

to see what we can do to

51:58

cover more of these bases and help people. people

52:00

can't have interesting on incorporating

52:02

things like what you

52:04

do and things like that, but also to

52:06

use cases where people can analyze data and

52:08

kind of directly in box. And

52:11

I think you might need to move the data

52:14

into your cloud so you can

52:16

analyze it with your BigQuery. One

52:18

of the things that I think would be very cool is

52:20

do you need to move

52:22

the data into the destination cloud and like

52:24

the BigQuery or hand kind of issue a

52:27

query directly and using something like .dd on

52:29

the source data and we never need to

52:31

do the ETL part, right? And we have

52:33

people shortcut and provide on their

52:35

work. Yeah, there's a lot of things that I'm excited

52:38

about. I'm also excited about

52:40

things like two-way sharing, as

52:42

I talked about. There's various different use cases and

52:45

they're all quite interesting where people say, I

52:47

want to share something to you and then you

52:49

would, for example, enrich it. You

52:52

attach it to the data you have when

52:54

you perform some analysis or

52:57

printing or some of that data and then

52:59

you kind of send something back to me

53:01

that is meaningfully transformed or enriched. That's one

53:03

of the things that I'm looking

53:06

forward to. Getting into because it starts

53:08

to unlock your high levels of

53:10

value so people enable them to collaborate and

53:13

as I talked about at the beginning, we can

53:15

help the industry in ways to do things more

53:17

efficiently. Talk about like you end up duplicating data

53:19

if you have data being

53:21

copied from one place to another and then

53:23

processed and so on. Helping the

53:26

industry to be efficient but also to achieve

53:28

higher value. First, sharing is

53:30

part of a collaboration process and

53:32

if we can do two-way sharing, we can

53:34

help to unlock higher value collaboration. I agree

53:36

one of the things that was one of

53:39

our founding convictions is that

53:41

enabling collaboration between organizations is kind

53:43

of a next beneficial thing. It

53:47

helps improve efficiency, it helps organizations make

53:49

better decisions and I know things

53:51

are broadly in the interest of

53:53

optimizing consumer as we use the base axis. And

53:57

are there any other aspects of the

53:59

overall space? of data sharing, both

54:01

the technical aspects, the organizational challenges, the

54:03

ways that you're approaching it at Bobsled

54:05

that we didn't discuss yet that you'd

54:07

like to cover before we close out

54:09

the show? I think

54:12

there is stuff

54:14

I love talking about around this like shift left

54:16

and shift right mentality and

54:18

like who has the responsibility for

54:20

doing things. So we talked about in

54:23

data industry shift left is something that

54:25

we talk about and seeing

54:27

the broadly good thing is this idea of

54:30

moving from the responsibility and the work twice as

54:33

to the kind of source of the data and

54:35

saying like let's help the person who sources data

54:37

and does not use inside an organization. So

54:40

you know, we shift left to a kind of data

54:42

team in the organization and have

54:44

a central team who's ensuring

54:46

that data is clean, that it's well set

54:49

up, that it's easy to query, that it

54:51

optimizes stuff like indexes and applications and

54:54

that direct efficiency compared with the

54:56

right mentality where you say we just find

54:58

data and all the consumers have to figure

55:00

out how they're going to use it and

55:03

how they're going to compute everything. And what

55:05

we do, one thing we can help to

55:07

say is not so much within an organization,

55:09

we don't generally want to tackle across organizations

55:11

and within those more complex organizations

55:13

that do have boundaries is a further

55:16

shift left approach where you say the people who

55:20

are best placed for the data

55:22

and the organizations

55:26

and generally manage the ongoing life of

55:28

the data and evolution

55:31

of the schema and attending of

55:33

new data and all those challenges

55:35

that really make up a lot

55:37

of the work of data engineering around

55:39

methods or the kind of the nifty

55:41

gritty stuff like oh no what happens

55:43

when a stock kind of splits or

55:46

one country changes its code or something like

55:48

that and the curve balls that you have

55:50

to deal with when you're managing a schema.

55:53

We can help that to be centralized which is

55:55

more efficient and more effective with us within data

55:57

sharing between organizations and if this is a good

55:59

idea, we can help with and then we can help ship left

56:01

and we can help reduce ETL, which is

56:04

two of the major pain points

56:06

of a lot of data engineering. I spend

56:08

so much time on ETL to write in

56:10

my analysis, or I spend so

56:13

much time on data cleaning and processing a

56:15

film before I actually do my analysis. I

56:17

think the work that we do can really

56:19

help tackle those for a range of organizations.

56:21

In what is a really challenging realm where

56:23

for a lot of people who work there,

56:25

the alternative is some major DIY project. So,

56:28

try and build some

56:30

subset of this functionality yourself, or

56:33

try and persuade a commercial partner

56:35

to make some pretty significant decision

56:37

like working and doing that on different clouds.

56:40

Absolutely. Well, for anybody who wants to get in

56:42

touch with you and follow along with the work

56:44

that you're doing, I'll have you add your preferred

56:47

contact information to the show notes. And as the

56:49

final question, I'd like to get your perspective on

56:51

what you see as being the biggest gap in

56:53

the tooling or technology that's available for data management

56:55

today. There's an interesting question

56:57

in the context of the modern data set.

57:00

So, we have a huge range

57:02

of tools, and in some

57:04

ways, on my part, is that reducing the intensity of the

57:06

phone. So, I have some

57:08

experience in the AI space, and

57:12

obviously, that's extremely hot

57:14

and very busy right now. I think

57:16

that there isn't, I'm not an expert,

57:18

sorry, but one thing that I think

57:21

there isn't is a really

57:23

good approach to vector database

57:25

embeddings. Obviously,

57:29

a lot of people attempting

57:31

to build out good solutions around vector

57:33

databases and managing embeddings. I've

57:35

spoken to quite a lot of startup

57:38

founders and founding engineers based on my

57:40

previous experiences who are trying to do

57:45

things like similarity search and

57:47

KNN and Levenshon

57:50

in distance, and

57:53

all these kind of very

57:55

standard data analytics things over

57:58

vectors that are being produced. from kind

58:00

of AI practices and large language models and deep

58:02

neural networks to computer vision. And that is an

58:04

area where, you know, I spoke to a lot

58:06

of people about what they're doing to challenge it. And

58:09

most of the existing tools that are all

58:11

very new are very

58:14

pricey or very inefficient above

58:16

kind of very small toy projects. And

58:19

a lot of people I spoke to there are

58:21

like running some challenges and building their own. And

58:23

that's also where they're talking to me is around

58:25

like the data infrastructure management. Like how do we

58:27

manage the infrastructure and build our own thing on

58:30

top of something like Spark and start running

58:32

these algorithms at scale over vectors. So

58:36

yeah, I guess maybe it's kind of

58:38

obvious or start up found an

58:40

answer, but I think kind of AI and

58:42

vector database solutions is somewhere I think

58:45

there is a really good tool. All

58:47

right, well, thank you very much for taking

58:49

the time today to join me and share

58:52

your experiences of working in this space of

58:54

data transfer and enabling that for different organizations,

58:56

making that a simpler problem to solve. So

58:58

I appreciate all the time and energy that

59:00

you and your team are putting into that.

59:02

And I hope you enjoy the rest of

59:04

your day. Thank you. Thank you

59:06

very much. It was a pleasure. Thank

59:15

you for listening. Don't forget to check

59:17

out our other shows, podcast.init, which covers

59:19

the Python language, its community and the

59:21

innovative ways it is being used. And

59:23

the machine learning podcast, which

59:25

helps you go from idea to production with machine

59:27

learning. Visit the site

59:29

at dataengineeringpodcast.com to subscribe to

59:32

the show, sign up for the mailing list and read

59:34

the show notes. And if you've learned something

59:36

or tried out a product from the show, then tell us about

59:38

it. You now host

59:40

at dataengineeringpodcast.com with your

59:42

story. And to help other people find the

59:44

show, please leave your view on Apple podcasts

59:46

and tell your friends. Thank

59:55

you.

Rate

Get this podcast via API

From The Podcast

Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More