Defining A Strategy For Your Data Products by Data Engineering Podcast | Podchaser

Episode from the podcastData Engineering Podcast

Defining A Strategy For Your Data Products

Released Monday, 23rd October 2023

Good episode? Give it some love!

Defining A Strategy For Your Data Products

Defining A Strategy For Your Data Products

Monday, 23rd October 2023

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello,

0:11

and welcome to the Data Engineering Podcast,

0:13

the show about modern data management.

0:17

Introducing Routerstack Profiles. Routerstack

0:19

Profiles takes the SAS guesswork and SQL

0:22

grunt work out of building complete customer profiles

0:24

so you can quickly ship actionable, enriched

0:26

data to every downstream team. You

0:29

specify the customer traits, then Profiles

0:32

runs the joins and computations for you to create

0:34

complete customer profiles. Get

0:36

all of the details and try the new product today

0:39

at DataEngineeringPodcast.com slash Routerstack.

0:42

You shouldn't have to throw away the database to build

0:44

with fast-changing data. You should be

0:46

able to keep the familiarity of SQL and the

0:49

proven architecture of cloud warehouses but

0:51

swap the decades-old batch computation model

0:53

for an efficient incremental engine to get complex

0:56

queries that are always up to date.

0:58

With Materialize, you can. It's the

1:00

only true SQL streaming database built

1:02

from the ground up to meet the needs of modern data

1:04

products. Whether it's real-time

1:06

dashboarding and analytics, personalization

1:08

and segmentation, or automation and alerting,

1:11

Materialize gives you the ability to work with fresh,

1:13

correct, and scalable results, all in

1:15

a familiar SQL interface. Go to

1:17

DataEngineeringPodcast.com slash

1:19

Materialize today to get two weeks free.

1:22

Your host is Tobias Macy, and today

1:24

I'm interviewing Ranjit Raghunath about tactical

1:27

elements of a data product strategy. So

1:29

Ranjit, can you start by introducing yourself?

1:31

Absolutely. Firstly, Tobias, thanks for the opportunity

1:34

to have me and as a delegate

1:36

of CX Data Labs on your podcast,

1:38

a big fan. So thank you. So

1:40

my name is Ranjit Raghunath. I'm a managing principal

1:43

over at a company called CX Data Labs. We're

1:45

a data and analytics strategy and implementation

1:47

services company, and we focus

1:51

on optimizing customer experiences

1:53

in the retail, life sciences, and financial

1:55

services verticals

1:57

using data engineering and

1:59

data platform. form is our core set

2:01

of peckles and pecks and shawls

2:04

effectively to kind of tie these systems together

2:07

so that businesses can see a

2:09

holistic view of the customer and then action

2:11

on it. And some of the things that they do as

2:13

a result of the work that we've done is increase

2:16

their ability to personalize on certain content

2:19

that they present or better understand

2:21

their marketing spend in terms of what

2:23

resonates well with customer acquisition costs

2:26

or simply optimizing wait times

2:29

as people call into a call center. And so

2:31

those are some of the examples. And for me

2:33

personally, this has been a long time

2:35

coming and I've been in the

2:38

data analytics field for roughly 17 years.

2:40

I've done nothing but various forms

2:43

of engineering software and data all

2:45

under the vicinity of either

2:48

producing data solutions or data products.

2:51

And just an overall geek and then

2:53

a nerd as it comes to data. Do you remember

2:55

how you first got started working in data? Yeah,

2:57

I do. I was an intern over at a company

2:59

called USAA and they were

3:02

working on a build back model. And

3:05

the core problem that they were trying to solve was they had

3:07

a set of infrastructure that they

3:09

wanted to go through and build back all

3:12

the way to the business

3:14

teams utilizing those applications

3:17

so that they were getting value

3:19

from it. And one of my

3:21

tasks was to come in and help the

3:24

team go through and provide

3:26

this costing model. And so as I

3:28

came in, they were using Excel and

3:30

they were using Access and to do

3:32

some of these computations.

3:34

And I kind of looked at them and I said, hey, you know, what

3:37

did we start writing data pipelines to do this?

3:39

Which I didn't know they were called data pipelines, but

3:41

I was an electrical engineering graduate coming

3:43

in as an intern. All I knew is well,

3:46

maybe we can optimize it and do it differently. And

3:48

then soon got introduced to dimensional

3:50

modeling and said facts and dimensions is how

3:52

you can do that. Oh, well, what if we

3:54

turn like, why are we sending these reports to

3:57

them? Can we bring them over and then have them take

3:59

a look at it? self-service reporting with business

4:02

intelligence. So a lot of it, I didn't have

4:04

the names for it per se, but

4:07

that's how I started cutting my teeth into it and just

4:09

started kind of navigating it, all to optimize

4:12

and to kind of lower the ratio in terms of the

4:14

work done for people getting the value that

4:16

they need. Yeah, it's amazing how

4:18

much in the technology industry

4:21

in particular, but probably any industry really,

4:23

that if you don't know the right terms that

4:25

people are using, then you just end up rebuilding

4:27

it yourself because you didn't know that it was already done.

4:30

100%, 100%. And

4:32

a lot of it also is, the

4:35

thing that I've always loved about data and data

4:37

analytics is, it's an

4:39

objective way to make decisions. It

4:41

also provides some

4:44

eye-opening opportunities when you put things in front

4:46

of people and say, hey, these are the

4:48

observations, right? I mean, we could

4:50

debate what, how

4:52

we use it and what that means in the context

4:55

of the scenario, but this

4:57

is what we're observing,

4:58

right?

4:59

And oftentimes in the real world, if

5:02

you contrast that experience, we

5:04

both could be seeing the same events, but we couldn't be

5:06

interpreting it very differently. And in

5:08

the data and analytics world, we have observations

5:10

and we can see that, and then we have inferences

5:13

that we can draw on it, but we have that dissected

5:15

framework to lean on. So

5:18

that's another kind of thing that has always

5:21

motivated me to kind of be

5:23

a discipline, be a disciple

5:25

of

5:25

the field, so to speak. And

5:28

to that point of shared definitions,

5:31

shared vocabulary, before we get too far

5:33

into the conversation at hand today of

5:35

data product strategies, let's

5:37

just start by identifying a shared understanding

5:40

of what we mean when we say data products

5:42

and how those might differ from data assets,

5:44

like a dashboard or a table or

5:47

a report that gets delivered quarterly.

5:50

And what is necessary? What

5:53

are the surrounding attributes for a piece

5:55

of data or a grouping of data

5:57

for it to be qualified as a product? Sure.

5:59

Sure, sure, sure. And I mean, there's

6:02

probably multiple definitions around it. So I'm going to give

6:04

you my rendition of what that means. And

6:06

so the disclaimer here is, you know, it is

6:08

not the definition, but it's a perspective and a point

6:10

of view. When I think about a product, a

6:13

product is services a need

6:15

that a customer has, or a certain segment

6:18

of people that fit into a persona, right? So

6:20

you have a need, and then there is

6:22

an outlay for that need and

6:24

get service. And the way that it gets services

6:26

through a set of features. And then you have different

6:29

set of products that help you

6:31

get to the end of job that

6:33

you may have for a particular experience that you

6:36

want to deliver. So okay, all of that

6:38

sounds very nebulous. But what does that mean in the context

6:40

of data? You use the word data asset. For

6:43

me, an asset means something that you can harness

6:45

value from, and log transactions

6:47

against. So there's a cost, and then

6:50

there's revenue that comes in.

6:52

A data asset, as you talked about, is

6:54

a type of data product. A dashboard

6:57

is a type of data product. A model

7:00

is also a type of data product.

7:02

And so these are interfaces

7:04

that you have customers use to

7:07

harness value, and then also assimilate

7:09

costs across it, right? For me, simply

7:11

put, a data product is thinking about

7:14

the customer and then the way that they

7:16

use data and its attributes

7:18

to make decisions and cataloging them into

7:21

a set of features that you

7:23

can then have expanded teams

7:25

put together that deliver it. But then

7:28

can also have long runtime roadmaps

7:30

that you kind of have, that you kind of can nurture

7:32

and then kind of grow over time. So what does that

7:34

mean? So let's say that we say that we're going to

7:36

build a C360 data

7:38

asset, right? So we're going to break

7:40

that down and try to identify the different features

7:43

that we would need to correctly depict

7:46

what a customer would be, and we would

7:48

think about it in a product. So what does that mean? A product

7:50

has a life cycle, you know,

7:52

it has release notes, it has releases,

7:55

it has a team that's long-lived that goes

7:57

through and produces this. We also

7:59

take care of of regressions, we also take

8:01

care of things that we may need to deprecate

8:04

over time. We may think of

8:06

features that add on to these different modules.

8:08

And so thinking about the customer 360

8:11

data asset as a product and then

8:16

putting together a release roadmap that says these

8:19

are the features that are coming

8:21

out in this quarter, who's

8:23

going to be a technology evangelist

8:25

versus an adventure

8:27

enthusiast, those could be multiple variables

8:30

that come in. And they could be spread across two

8:32

different releases. And so I look

8:35

at the concept of a data

8:37

product as one that builds on top of each

8:39

other

8:40

and really thinks about the customer and the

8:42

way they use it and

8:44

how they use it

8:45

and provides them with interfaces

8:48

so that it's easier for them to use. So

8:51

the last thing that I'll say before I hand it back over

8:53

to you is the usage modality here

8:55

could be that we would like to

8:58

give a customer idea and we would like to get

9:00

to know if this person is a technology enthusiast

9:02

or not. And the best way to do that may

9:05

be consuming that data set through a restful

9:07

interface where it has a certain set

9:10

of specifications in the term of a contract

9:12

so that I can go through and enable

9:15

real time decision making. Great, that's

9:17

an interface into an asset that we have

9:20

culminating into a product that we go through and sell.

9:22

There could be other interfaces to it, which

9:24

is, hey, I want to consume all

9:26

of these records and batch and then make decisions

9:28

and go through and drive. Well, that's another interface,

9:31

again, into the same asset. So

9:33

it's just breaking this concept

9:36

of data usage and

9:39

what it means into a set of constructs

9:42

that we just kind of talked about.

9:44

And so with that shared

9:46

definition of what it means

9:48

to have a data product, what

9:50

are the pieces that we need to strategize about?

9:53

Why do we need a strategy? What purpose

9:55

does that strategy serve? And how does that

9:57

inform the work to be done?

9:58

Very good, very good.

9:59

because this is something that I think about

10:02

quite a bit. We talked about different types

10:04

of data products. We talked about type model.

10:06

We talked about type dashboard. We

10:08

talked about type data asset. You

10:10

can have further categories like information asset

10:12

as well. And so if you were to hydrate these

10:15

into or condense these into

10:17

patterns, then you start taking a look

10:20

at a value chain that comes from

10:22

a set of activities when done in unison,

10:25

produce an artifact, right? That

10:28

being a data asset. Okay, so

10:30

then the thinking here is you have patterns,

10:33

you have a set of activities that line up to

10:35

these patterns and you have a set of artifacts that are produced

10:38

as a result of that. Effectively what we

10:40

think of when we say a data product

10:42

strategy is the formulation

10:46

of that so that

10:48

we can go through and industrialize its production,

10:50

right? So that from the concept of inception

10:53

all the way to industrialization,

10:55

you can utilize this up model,

10:58

so to speak, to kind of produce this artifact

11:00

in a very streamlined way, okay?

11:03

So what does that mean? You know, you're producing a data

11:05

set. Let's just say it's a data asset that

11:07

you're producing. Let's go back to the other example of customer 360.

11:11

In order for you to source that information, you

11:14

may need to go to a CRM

11:16

system. And that CRM system, let's just say

11:19

it's Salesforce. The ingestion pattern associated

11:21

with ingesting data from Salesforce doesn't need to

11:23

be recreated again and again for sourcing,

11:26

let's say from one entity such as Contact or

11:28

another entity such as Account. You define

11:30

the pattern of the ingestion once and

11:33

then you go through and leverage

11:35

that highway, so to speak, for different objects as

11:37

they come along. And so you slowly start condensing

11:40

those set of patterns into

11:44

broader capabilities and then you

11:46

free up the development cycle for

11:49

producing these products and hone

11:52

in on these capabilities, right? And so

11:54

effectively what you end up doing is you make

11:56

the marginal cost of producing the next product,

11:59

simpler and faster. simpler. And so effectively,

12:01

you you you harden these set of capabilities,

12:03

right. And so that's kind of that

12:06

whole piece of the puzzle is we

12:08

produce and develop a strategy. And that's why you

12:10

need one. Otherwise, what ends up happening

12:12

if you don't have one is the cost of producing

12:15

a product is just the same

12:17

or expensive again and again and again,

12:19

right. And so what you want to do is you want that cost grip

12:22

to come down. So hopefully that that answered

12:24

the question. Yeah. And for those

12:26

elements of defining

12:28

or establishing what the strategy should be,

12:30

who is responsible for

12:33

guiding that process? Who are the people that need to be

12:35

involved from a kind of roles and

12:38

persona perspective? And

12:40

what are the things that might trigger

12:42

the development of a given strategy?

12:45

Yeah, yeah. So I think

12:47

you always got to start off with the with

12:49

the consumers of analytics in

12:52

mind. So they're very important stakeholders.

12:54

These are the folks who consume the analytics

12:58

being produced and action on it. So

13:00

think about somebody in the office of the CEO,

13:03

think about a chief credit risk officer who

13:05

takes a look at, you know, the

13:07

analytics being produced and says, Hey, this

13:10

is my overall risk for my portfolio within

13:12

the sector that I manage. This is how I

13:14

can curtail

13:17

my bookends with respect to, you know, certain hedges

13:19

that I'm performing. But that's, that's

13:21

a cohort, right? That's, that's a segment of the population

13:23

that provides you with, Hey, here's what

13:25

I'm going to do with the analytics

13:28

that you provide me. And this is what I decision

13:30

on and action on. And oh, by the way, this is

13:33

why I need what I need what

13:35

this is. And that typically for

13:37

us is use cases, right? And they come from our stakeholders.

13:40

And the stakeholders kind of closer in the business, I

13:42

go through and drive that out. Those needs,

13:45

and that level of dialogue that goes on

13:47

that says, where in this business process,

13:49

do you actually embed this level of analytics?

13:52

How do you use it? Oh, what is the time

13:54

taken for you to go through and provide it? Is

13:56

there any sensitivity to the information being

13:58

provided? all of those kinds

14:01

of questions and answers that need to

14:03

kind of bring the use case to life in

14:05

our world is brought together through

14:07

the lens of a data product manager.

14:10

And in some organizations, you know, that could be

14:13

further bolstered with a product owner that's

14:15

a little bit more tactical, kind of taking

14:17

those needs and helping them kind of see the

14:20

technical definitions around it, or

14:22

it's fully owned by the product manager themselves.

14:25

And then what we have is we have a

14:28

set of, you know, software data

14:30

engineering managers who kind of sort of go

14:32

through and break this down in terms of, hey,

14:34

here's what the needs means in

14:37

the way of thinking about non-functions. How

14:39

do these come into play? And that's where

14:41

we really see the software engineering manager, data

14:44

engineering managers come in. They got to hear the

14:46

functional needs and then start saying, well, here are the non-functions.

14:49

This is why this is what we need to do. Okay,

14:51

well, we're going to produce it in this way. We

14:53

should have some logging measures being

14:55

put into place. We need to have some telemetry. We

14:57

need to have some monitoring. And then they also take

15:00

all of the needs being articulated and

15:02

put them into functional requirements, and

15:04

then they start breaking them down. And the breaking

15:06

them down part really is where we

15:08

see, you know, TPMs

15:10

or scrum masters, or, you know, however you

15:12

want to call them, but effectively folks who can

15:14

take a set of functional requirements,

15:17

a set of non-functional requirements, and then

15:19

kind of devise them into

15:21

a plan of action that the team can execute

15:23

on. And then you have a set of developers, right?

15:27

Now they fit into multiple different brackets.

15:29

You know, it could be platform engineers.

15:31

It could be data engineers. It could be software engineers,

15:33

but they all kind of listen to these needs

15:36

that have been kind of dissected into stories.

15:38

And they start saying, okay, well, this is, if

15:41

we do this, then we can achieve this. Do

15:43

you agree with this? And that whole negotiation

15:47

going back and forth happens internally

15:49

to the team, and then also with the product manager,

15:52

and then ultimately signed off by the

15:55

software development manager or the data engineering

15:57

manager, and then it gets formulated into a set

15:59

of... release artifacts that we go

16:01

through and produce and provide out that

16:04

ultimately gets embedded into the business

16:06

workflow. Now all of this stuff

16:09

is going to be useless if we don't

16:11

have a

16:12

really good

16:13

business enablement, customer success

16:16

driven viewpoint in which we're doing change

16:18

management both on the technology side

16:20

but then also on the business side which is now you're going to get

16:22

this new analytical component.

16:25

How are you going to use it? So for example,

16:27

let's say there's a propensity for failure of paying

16:30

back a loan. How are you going to use

16:32

it when you make the loan origination decision?

16:35

When should you pull the lever to

16:37

say this model doesn't make sense, these answers

16:39

don't make sense? And oh by the way,

16:41

how do you tune it? Where do we monitor

16:44

that and how do you make decisions on

16:46

it? So it's a combination of

16:49

different items coming together along with

16:51

different roles and they encompass

16:53

all the way from change managers. Sometimes

16:57

these are played by the product manager and then some

17:00

analytical translators or folks or

17:02

business analysts directly in the business. It just really

17:04

depends on the company that you're in

17:06

and the role that they play. But then those

17:09

are all the left to right side

17:11

of the equations that would look like to produce a

17:14

data product

17:14

and the different activities

17:16

that would go into it.

17:20

This episode is brought to you by DataFold, a

17:22

testing automation platform for data engineers

17:25

that finds data quality issues before the code

17:27

and data are deployed to production. DataFold

17:30

leverages data diffing to compare production and development

17:33

environments and column level lineage to

17:35

show you the exact impact of every code change

17:37

on data, metrics and BI tools, keeping

17:40

your team productive and stakeholders happy.

17:43

DataFold integrates with DBT, the modern data

17:45

stack, and seamlessly plugs in your data

17:48

CI for team-wide and automated testing.

17:51

If you are migrating to a modern data stack, DataFold

17:53

can also help you automate data and code validation

17:56

to speed up the migration. Learn more about

17:58

DataFold by visiting dataengineering.com. podcast.com

18:01

slash data folds today. And

18:05

regardless of whether you actively engage

18:08

in defining and implementing a particular

18:10

strategy, there's always going to be

18:12

a strategy. It's just a matter of whether you are

18:15

explicit and purposeful about it,

18:17

or if it is just something that is emergent. And I'm

18:19

curious what you have seen as some of the

18:22

juxtaposition of teams that are very deliberate

18:24

about the definition and execution

18:27

of a product strategy for their

18:29

data assets versus teams that

18:31

just leave it to, well, this is what we're doing.

18:34

It'll just

18:34

emerge and we'll figure it out as we go

18:36

kind of approach. Yeah, I think it's a good question.

18:39

I think when you

18:41

have very product centric teams

18:44

that are exclusively focused on

18:46

enabling, let's say a product

18:48

that they're going through and releasing,

18:51

and they have analytics as a tie-in

18:53

to that product, I see them leveraging

18:56

and kind of latching on to the product strategy

18:58

itself and analytics and data kind of

19:01

sort of, they don't fall by the wayside, but

19:03

they're secondary actors within

19:05

that overall equation, right? Which means

19:07

what? You typically have one business intelligence engineer,

19:09

you have one data engineer that's within the group and

19:11

their entire purpose of existence is to help the product

19:14

manager rationalize decisions based

19:16

on either funding, customer decisioning journey,

19:19

churn, whatever it may be, ARR, like

19:21

whatever it is that they want to do, the flavor of the day

19:24

is what they work on, right?

19:26

So in that case, they're

19:28

not really

19:29

coming up for air as much and thinking about and saying,

19:31

hey, here are the 14 or 15 different

19:33

questions that I get asked. Here's how I

19:35

can start laying out the foundation so

19:38

that I don't need to do the same amount of work that

19:40

I do for answering those 14 different questions.

19:43

Let me start formulating and sourcing

19:45

data that will create these core

19:48

entities that I can then use to mash up

19:50

and oh, by the way, let me build a dashboard on top of it so

19:53

that the product manager can do it themselves,

19:55

right? So that side of

19:57

the equation is where I see less of that

19:59

and it It's more where the product

20:01

is center and the product manager

20:04

are driving all of the kind of work

20:07

and centering it on the product itself, right?

20:09

So I don't see a strategy that coherently

20:11

kind of describes anything in scenarios

20:14

like that. Where I've seen companies

20:16

use or kind of dive in really

20:18

into data product strategy

20:21

is where there is a focus

20:25

on building data products. But

20:27

there is an aspect of doing it in a centralized

20:29

fashion. And it's not that everything

20:32

is centralized, but it could be that a core set

20:34

of the infrastructure

20:35

is centralized, a core set of the

20:38

assets being brought in as centralized.

20:40

So what does that mean? Let's go back to the example

20:42

that I gave about Salesforce. It's a searing system.

20:46

It's got a ton of assets

20:48

within it. What does that mean? It's got contact.

20:50

It's got account. It's got

20:52

leads. There's leads that

20:55

are being qualified there. There are sales. There's

20:57

tons of information there. Do we want

20:59

every single team to go through and source

21:01

that information again and again and again? Probably

21:04

not. I mean, if you think about it on an ingress and egress perspective,

21:07

it doesn't make sense.

21:09

So

21:10

you have a set of teams that go through and say, hey,

21:12

here. We're going to model how

21:14

to particularly use contact and

21:16

account and the relationships between it. And

21:19

we're going to manifest it in a place that

21:22

makes it easy for teams to go through

21:24

and source it. OK. Well,

21:26

when they do that, they're effectively

21:29

centralizing that

21:31

capability of data ingest and

21:33

data rationalization. So when you're

21:35

a consumption-driven team coming

21:38

through, you have to learn that mnemonic

21:40

and you have to go forward. So in

21:42

cases like that where people are looking for efficiency

21:44

gain through central harmonization

21:49

of data, I see those

21:51

kinds of companies do data product

21:53

strategy more and more.

21:55

And then

21:57

the tier that sits in the middle

22:00

They don't necessarily agree on the centralization

22:03

or the decentralization, but they agree

22:05

a ton on

22:08

the ways of working and the standardization of

22:10

the ways of working. So if you think about, hey,

22:14

what does continuous integration look like in

22:17

the concept of data engineering? What

22:19

does continuous deployment look like? And what does that mean?

22:22

And so if you have teams that are really software

22:25

focused but then are trying to enable data

22:27

products, they

22:30

hedge on the ways of working. They say, hey, well, let's

22:32

have a repo structure that's

22:34

conducive to working on data engineering

22:36

efforts. So they go

22:39

through and drive out a set of standardization

22:41

there. And that's hedged on kind

22:43

of what is a data product and how do we enable

22:46

that? So those are the three kind of verticals

22:48

that I see as I've

22:50

kind of scavenged the field.

22:52

You mentioned the different roles and responsibilities

22:55

throughout the process of designing

22:57

and implementing a product strategy.

23:00

But of course, not every company is going to

23:02

have the same sets of people, the

23:04

same titles, or even a given title

23:06

might not even exist across separate people. And

23:09

I'm curious how you have seen the

23:11

size and structure of

23:14

different teams, both within and

23:17

adjacent to the data

23:19

capabilities influence the ways

23:21

that people approach the concept of how

23:24

to strategize what the scope of

23:26

a given product looks like, et cetera. Yeah,

23:28

and I think this really depends on verticals

23:30

and the kind of vertical that you belong to and

23:33

the importance given to data within

23:36

that vertical. So if you take a look at the insurance

23:38

business, they've been using data to make decisions

23:41

for a very, very long time. So their maturity

23:44

around data

23:46

management and the need for it is

23:49

super high because when you make great changes on insurance

23:51

policies, you need to follow this data. So guess

23:53

what? Like you automatically are thinking about data

23:55

retention. You're automatically thinking about

23:58

the

23:58

governance around that.

23:59

that data that's used to make those decisions.

24:02

You're automatically thinking about the

24:04

fact that, you know, you have a deadline to submit

24:07

these things and you have SLAs

24:09

in place, okay, well, they need to be done

24:11

in a particular order so that it can be good. It

24:13

can go through and be deployed. Every,

24:16

like, you know, if you take an actuarial

24:18

scientist, right, like there's a particular way that they

24:20

go through and do their business. So there's a

24:23

certain hygiene around the way that they

24:25

think about processing the data so that

24:27

they can then answer the questions that they seek.

24:30

So based on the vertical that

24:32

you're in, or the industry vertical that

24:34

you're in, and the importance that data has within

24:36

its own, its relevance, right,

24:38

will dictate how much you think about

24:41

data and how much you think about

24:43

all of the illities that come with it. And so

24:46

in the insurance example that I gave you, they

24:48

inherently have a strategy, but it's embedded

24:50

in the way that the vertical exists. So

24:53

you may not need a business analyst. You could have an actuarial

24:55

scientist, you know, maybe a junior one who

24:57

functions as one. And they write documents,

24:59

they write requirements in the sense that that

25:02

depicts the process flow where things need to happen

25:04

or not. And so that's one example, right?

25:06

In another example, you could have, and

25:09

they may need to have, you know, separation

25:12

of roles, because they're probably in a regulated

25:14

business where, you know, the person doing the

25:16

math cannot be the one that checks the math. And

25:18

therefore, you know, the way that they've kind of written these rules

25:21

says, you know, we need to physically have them as

25:23

being separate so as to guarantee a level

25:25

of quality that they go through and drive out. This

25:27

is that way in the life sciences business,

25:29

for example, right? There's a large

25:32

focus on QC, because imagine you

25:34

getting a drug that hasn't

25:36

been QC'd as much as it should

25:39

have. So there's a certain

25:41

set of operating protocols

25:43

and procedures that have been

25:45

grounded on mitigating risk and

25:48

increasing quality. And that's kind of

25:50

led to an op model where you have different people

25:53

doing a different set of roles. And that's dictating

25:55

the way that the industry operates and

25:57

drives.

25:58

So that's another factor.

25:59

which is the industry drives the kind of

26:02

roles based on the way that they segment things. The

26:04

third is where data is used

26:06

as an enabler, but the cost of getting it wrong is not that high,

26:09

and you need it for directional

26:11

correctness rather than, you know,

26:13

exactness, right?

26:15

So in the example of life sciences or even

26:17

in cyber, right, like, or in security,

26:20

you cannot get

26:22

things right on average. Those things don't

26:24

happen in those verticals, right? You have to get

26:26

it right every single time. Versus

26:29

in retail, for example, right? You're

26:31

less likely to give your address

26:35

to a person who says, hey, can I get your address right

26:37

at the checkout desk, versus,

26:39

you know, like, you're within a financial services

26:41

institution and they ask me what your address is and they

26:44

can send you statements. I guarantee you,

26:46

one has a higher likelihood of you giving the most accurate

26:48

information compared to the other. So

26:51

you get a ton of garbage in, right,

26:54

in some of these verticals, like, you know, for example, in retail. So

26:57

then you start saying, okay, well, you

27:00

know, you're going to have to formulate

27:02

a ton of rules to get it

27:04

right, and so you start to say,

27:07

okay, well, there's a lot of definition here. There

27:10

isn't a lot of criteria

27:13

that we put on the docket. We just need to

27:15

do a ton of iterations and go

27:17

through and get the answer right. So what I've seen in industries

27:19

like that is you don't have a ton of rules. You

27:22

have, you know, one developer, one solution

27:24

that goes deep, right? They may do the business

27:26

analysis reporting. They may do the data engineering.

27:29

They may also help the business

27:31

in doing the governance itself by

27:33

flagging elements that are out of sync

27:36

or out of place. So there's

27:38

a very long-winded way to say to buyers, like,

27:41

depending on the industry vertical that you're in and

27:44

the place that data has in

27:46

the relevance of the decision-making process

27:50

and the kind of inputs that they get, you know,

27:52

high quality versus not, the cost

27:54

of getting it wrong versus being directionally correct, all

27:57

define the number of rules that are being put

27:59

in place. It's almost like a spectrum, right?

28:02

So that's kind of one way to take

28:04

a look at it. And so

28:06

now, what is the commonality

28:08

that you see regardless of the industry that you're in?

28:11

I think it comes down to the artifacts. Regardless

28:13

of how many roles you have and

28:16

which industry that you're in, a

28:18

technique that I've seen work well, at

28:21

least from my perspective, when

28:23

you formulate these kinds of strategies is

28:25

a set of interviews with stakeholders. These

28:27

are people typically who are making

28:29

decisions using the analytics that you're providing. Taking

28:33

those set of use cases, bowling

28:36

that up into a set of capabilities

28:40

that need to be invested in,

28:43

dissecting that building programs which

28:46

have projects within it that then get

28:48

executed on, that

28:51

then kind of tie into metrics that

28:53

say, this is why we did what we did and

28:55

this is the value that we're going to get. I know

28:57

by the way, through that process,

28:59

this is the data sets that we're governing and this is

29:01

how we're governing it without maybe using

29:04

those words is what I've seen work

29:06

well. And it also disarms

29:09

organizations because many times when you

29:11

go in and you say, hey, we're gonna stand up, like

29:14

the outcome of a data product

29:16

strategies is a team that we need

29:18

to build up of like 50 people.

29:20

In an organization that doesn't have data

29:22

within the decision-making nomenclature,

29:26

right? That's gonna be a tough one

29:28

to stomach. But even in one that is

29:31

driven in that way, 50 is

29:34

a large number, they're gonna buck anyways and

29:36

they're gonna say, well, we're getting efficiency with

29:38

one person, why would we need to do it differently? So

29:40

I think just focusing on the artifacts and

29:42

really thinking about how do you take these

29:45

use cases, hydrate that up

29:48

into a set of portfolio and programs

29:51

that you can execute on draft trial with metrics

29:53

and governance is the way to

29:55

go regardless of who does it.

29:57

The other interesting element of data.

30:00

product is the audience

30:02

where because data

30:04

has so many different potential stakeholders and

30:06

consumers that will drastically influence

30:09

the overall user experience

30:12

they're trying to optimize for because the

30:15

core element of something being a product

30:18

is that it is consumable out of

30:20

the box versus just here's

30:22

some data good luck you know you can pull

30:24

it from this s3 bucket if you want you

30:27

know as a product you know you go

30:29

to like a Netflix that's a product you

30:31

go to Amazon that's a product for e-commerce

30:34

if you are a data consumer and

30:36

you know if I give you a CSV

30:38

file and I'm an average person

30:41

who is just trying to answer a question what

30:43

is the CSV file gonna do for me but if I

30:45

have a search box where I can type a question

30:47

and then you're using that underlying data to give an answer

30:50

and that's a better experience whereas

30:52

if I'm a data engineer and you give me a CSV

30:54

and a little bit of documentation what to do with it

30:57

and so I'm curious what

30:59

are some of the useful questions that teams need to be

31:01

asking in the development of that product

31:03

strategy that will inform the

31:06

implementation details and the types of technologies

31:08

that they need to bring to bear on the solution

31:10

yeah thank you thank you for kind of highlighting

31:13

the importance that interfaces play

31:16

in the role of a data product so I think

31:18

one of the things that you kind of mentioned and

31:20

the examples you gave about Netflix and Amazon

31:22

everything else is you know let's just take

31:25

the example of maybe Amazon right you come in you search

31:27

for a product and you buy it right but

31:29

why did you search for that product you had a need you

31:32

know let's say you're buying household you know

31:34

you're buying a household cleaner right

31:37

you look going in there and you're trying to search for something

31:39

because you want to clean your house right

31:42

and you want to do it in a self-sufficient way

31:44

that you know you want to buy a product but then

31:46

the floor says you drive okay well there's different

31:48

choices that you have you know but the

31:50

point I'm trying to make is that

31:53

whole concept of Amazon and search

31:56

is in the context of a need that

31:58

the customer has

31:59

is

32:00

in the life cycle that they're

32:03

in that this fits into.

32:06

And so what

32:08

does that mean in the context of data products?

32:12

As we start collecting these use cases, a big thing

32:15

that we do and we emphasize on is how

32:17

are they going to be used and how

32:19

often are they going to be used and in what context

32:21

are they going to be used. So

32:23

for example, if someone says, I need this information,

32:27

let's say that they produce a propensity score

32:29

for the person's ability

32:31

to either default or not on the loan. I

32:34

need it within five seconds. I get five seconds, it

32:36

needs to be refreshed. My question always

32:39

is, let's say that I give it to you in four seconds.

32:42

What are you going to do with the one second that you save? Let's

32:45

say I'm going to do that. What are you going to do the next five seconds

32:47

before the data comes in? And at least a very

32:49

interesting conversation because what effectively

32:52

you're trying to unravel is

32:54

what's next? What do you do next? In

32:56

the context of the Amazon example that you gave, I

33:00

take it and then that

33:02

spray comes home and then I clean my

33:05

table with it. Well,

33:08

that's good. What do you do?

33:09

Well, yeah, and then I store it. Well,

33:12

in the context of data products, in the

33:14

context of the example that I gave you, well,

33:17

I take that output that

33:19

you provide and then I make a decision of

33:21

it. What do you do with that decision? Well,

33:24

basically, in the flow of the

33:26

application, the loan origination application,

33:28

the customer is going to be able to see if

33:31

they got a yes or no in terms

33:33

of the loan that they were asking for because I take

33:35

this variable and I weight that

33:37

by 70%

33:38

because I heavily weight

33:41

this to say, if this is a yes

33:43

or no,

33:44

it kind of determines if they get the loan or not.

33:46

Oh, wow. OK, that's interesting. So

33:48

now you start walking backwards from there and you start

33:50

saying, OK, well, a CSV

33:53

file probably won't scale for that. How

33:55

are you going to do a reach in for this? Well,

33:57

hey, we typically, like within

33:59

the application.

33:59

that we have.

34:01

We use RESTful interfaces for doing everything

34:03

that we go through and drive

34:05

out. Okay. All right. So now you start

34:07

saying, okay, well, now we need to start using APIs. Okay.

34:10

They need to be discoverable. Well, how, like,

34:13

what kind of validation do you do on this to make

34:15

sure that it isn't something that's so wild?

34:17

What happens as a result of that? Okay. Well, then

34:19

as you start having these conversations with

34:22

your customer in the way that they are going to be

34:24

using that analytics, you start formulating

34:27

the interfaces that they're going to be using their channels

34:29

that they're beginning to use to, to, to soup

34:31

up the intelligence that you're providing, whether

34:33

it's core data or insights

34:36

or information, knowledge, you name it.

34:39

You know, that's what it is. Right. And so

34:41

that starts to formulate the way that,

34:43

you know, you start providing these interfaces

34:46

and

34:46

the same data set or

34:48

information asset or data

34:50

asset, these different types of data products could

34:53

have multiple channels. Right. For example,

34:55

one of the things here could be that, you

34:58

know, in the context of the persona that

35:00

you gave, right, of a data engineer, they could

35:02

be wanting that data set

35:04

through an S3 interface. Yeah. Like something that they

35:07

can consume and batch and then do some reconciliation

35:09

on. So the way that the consumer

35:11

utilizes it in the context of the decision-making

35:14

will dictate the interfaces. And

35:16

those interfaces is what we build that

35:18

then says, Hey, here's the product

35:20

that we're building. Here's how we, here's

35:23

how we deliver it to you so that you

35:25

can consume it. What are your consumption patterns?

35:27

And you got to keep that front and center as you walk in and through,

35:30

because the last mile optimization

35:32

on that is driven off of those items. And

35:34

then there's some interesting nuances as well. Right.

35:37

In the last mile consumption piece, you're less worried

35:39

about duplication. You're less worried about,

35:41

you know, Oh my God, am I copying this data or

35:43

am I copying this in 14 different ways? You're

35:45

more worried about is, is the interface

35:48

optimal for the

35:50

consumption, right? Versus optimal

35:52

for storage and distribution.

35:57

As more people start using AI for projects,

35:59

two things are important. clear. It's a rapidly

36:01

advancing field, and it's tough to navigate.

36:04

How can you get the best results for your use case?

36:06

Instead of being subjected to a bunch

36:09

of buzzword bingo, hear directly from

36:11

pioneers in the developer and data science

36:13

space on how they use GraphTech to build

36:15

AI-powered apps. Attend

36:17

the Dev and ML Talks at Nodes 2023, a free online conference

36:22

on October 26th featuring some of

36:24

the brightest minds in tech. Check

36:26

out the agenda and register today at neo4j.com

36:30

slash nodes. That's N-E-O,

36:33

the number four, j.com

36:35

slash N-O-D-E-S. I'm

36:40

curious how technical debt

36:42

factors into the overall process

36:45

of the development

36:47

and consideration around what

36:49

the strategy is and how to approach

36:52

implementation, both in terms of,

36:54

I already have this existing technical debt, and

36:56

so that constrains the available

36:59

set of capabilities that I have, or it will extend

37:01

the delivery timeline. But

37:03

also, this is the strategy that

37:05

I want to implement. This is the timeline I'm committing

37:07

to. So now I need to consciously take on

37:10

this additional technical debt. I'm just curious

37:12

how that plays out in the overall process.

37:14

It's a good question. And I say that

37:16

only because we all accumulate

37:18

technical debt, and I haven't quite seen, to

37:21

the extent that I would like to, out in

37:23

the wild, including

37:26

when I used to be on the

37:28

other side of the fence, like leading data teams in

37:30

corporations, both in tech and non-tech, do

37:33

it well. And so here's

37:35

one of the ways that I've seen get

37:40

close to doing it well. It's really

37:42

negotiating a percentage of your

37:44

execution backlog to be dedicated

37:47

to technical debt that the

37:50

engineering team has accountability

37:52

for, in terms of prioritizing so

37:54

that the overall cost of delivery comes down.

37:57

So what does that mean? In the backlog, you could have

37:59

now... new features, bug fixes,

38:02

and technical debt all be

38:05

commingled. And so what we've seen work

38:07

well is you take about 30% of that backlog

38:10

and you say, hey, we're going to dedicate

38:12

this to technical debt and we're going to give the accountability

38:14

to the data engineering managers or the software

38:16

engineering managers to go through and drive it. They

38:19

prioritize it, they put it on there so that you go

38:21

through and see it, you go through and move it accordingly. The

38:23

product manager should be able to see if they're doing

38:25

a good job of it or not by tracking

38:28

the overall cost of production. And

38:30

operating and maintenance costs associated with

38:33

the product itself. So

38:35

the lower amount of tech debt that you have, I

38:38

think you can see it in a couple of different ways.

38:40

One is your O&M costs are going to go lower.

38:43

And typically, O&M costs, operating and maintenance

38:46

costs are roughly in the 50% mark. So if you can bring

38:48

that down by 20 and kind

38:50

of bring it even into 30 or if you're super

38:53

optimal into the 15% range, that's

38:55

a good indicator that you're resolving

38:57

your tech debt as much as you can. The

39:00

other nascent thing to look at

39:02

is attrition. When you have

39:04

a really poorly built product

39:06

and it's going to be tough for

39:09

you to maintain people on the operating

39:11

and maintenance side of the equation to go through and drive that

39:13

out. So that's one other

39:15

side of the equation. The other thing that I've seen

39:17

work really well is in terms of

39:19

tech tech, because when people go through

39:22

and provide

39:24

these strategies or

39:26

even these patterns and drive

39:29

them out, at that point in time,

39:31

they were probably the best

39:34

and the greatest. But over time,

39:36

like anything else, everything deteriorates. Technology

39:39

is moving at a faster clip rate. So

39:41

having a dedicated time during

39:44

your execution mechanism, like one

39:46

sprint out of seven in

39:48

a classic PI type setting with agile,

39:51

sorry, safe agile, could be one

39:53

where you kind of take a step back and you allow the practitioners

39:56

on the floor to drive and step forward

39:58

who are actually the ones that are closest. to

40:00

the pain to say, hey, there are

40:03

these new ways of building things. Can we go through

40:05

and try and implement them and see where they

40:07

go? And that kind of raises

40:10

the bar in terms of making sure

40:12

that not just tech debt stays in check,

40:14

but then you're innovating.

40:16

In some cases, what I've seen is teams

40:19

completely stop delivery of net new

40:21

features. And saying that, you

40:23

know what, the way that we're gonna resolve this is we're gonna

40:25

take care of all the bugs, right? So we're gonna have something

40:27

called a bug bash, and then take

40:29

that completely down,

40:31

right? Maybe they do like a month long

40:33

worth effort there.

40:35

And then they go through and throttle

40:38

their backlog, so to speak, so to

40:40

make sure that they can get back in line. So these are different

40:42

ways that I've seen teams go through

40:44

and manage this concept of tech debt. And

40:46

then the last thing that I'll mention is, the concept

40:48

that I talked about,

40:50

use cases being hydrated up into

40:52

a set of patterns and these patterns kind

40:54

of going into capabilities.

40:56

It's really important to kind of go through and score

40:59

those capabilities on a yearly basis to say, how

41:01

well are we doing, right? And sanitize

41:03

that and say, and that's another way to measure architecture

41:05

as well. And that I

41:08

have yet to see teams do a good job in because

41:10

they just don't think of

41:12

architecture and scoring the architecture in that way.

41:14

You know, someone writes a blueprint, you know, it's super

41:17

high level, somebody goes through and implements it, and

41:19

we never score those capabilities. Like for example,

41:22

how is our data introduction capabilities? Is it a nine

41:24

out of 10?

41:25

Why is it a nine out of 10?

41:26

Well, guess what folks, we can't ingest TSVs,

41:28

okay. How important is it?

41:31

Do we have any use cases that go through that? Well,

41:33

yeah, we do. You know, we have five

41:35

out of 20 use cases that are doing that, okay.

41:38

Well, how much time are we spending as a result of that? Well,

41:40

our sprint points are X, you know, for

41:43

these kinds of things. That kind of telemetry

41:45

walking that backwards and then saying, hey, this

41:47

is how we score architecture. I haven't

41:49

seen that as much in the wild,

41:52

if any. But I think that's another

41:54

way to kind of score architecture based on the capabilities

41:56

that you've driven and to make sure that these

41:59

tech debt items.

41:59

kind of get brought to the surface.

42:02

And circling back on the interface

42:05

of the product, there's also the question

42:07

of customer education of how

42:10

much context and how

42:12

much familiarity do they need to have

42:14

of the data, of the statistical

42:17

aspects of that data in order to be able

42:19

to use it to effectively

42:22

make decisions or is the

42:24

understanding that they're reaching actually accurate

42:26

based on their background? And

42:29

I'm wondering how you've seen teams try

42:31

to approach that element of delivering

42:34

the data product, delivering the

42:36

guardrails or surrounding

42:39

capabilities so that the

42:41

end user is able to actually

42:43

effectively make use of that product

42:45

without having to have somebody sitting beside

42:47

them saying, okay, this is what you need to know. These

42:50

are the steps to actually use this thing. These are

42:52

the other things that you need to do after the fact,

42:53

etc.

42:54

Good question. I'll start off with a story,

42:56

right? I think all of us will be

42:58

very familiar with this one. That number is

43:00

incorrect. And they're like, why is that

43:03

number incorrect? Because the person did the

43:05

roll up in the wrong way. Okay,

43:07

well, it was obvious that the column

43:09

was there, so I ended up rolling it up. Well, what you didn't

43:12

do is you didn't apply a filter because

43:14

it's not a column actually, you have to apply

43:16

a filter for this column and then do an aggregation

43:19

and then you'll get the right number because effectively what you've

43:21

done right now is you've made it 10x the

43:23

number that it is. And so these

43:26

kinds of stories, right, I've genericized

43:28

it, but these kinds of stories are pervasive, like all

43:30

of us have heard it, right?

43:32

And so if you think about it and say

43:34

it's saying, well, how did that come to fruition?

43:37

People think just because you have the data, you can just

43:39

kind of give it out and not knowing

43:42

the persona group that the person belongs to

43:44

and how the consumption experience has

43:47

been defined for that persona.

43:50

You'll often hear people say, hey, just give me access to the

43:52

data, I'll figure it out, you know? And oftentimes

43:54

you end up with stories like this. So I've

43:57

seen well and done well and kind

43:59

of something that...

43:59

we practice and both preach

44:02

is that the interface that sits on top

44:04

of the data

44:05

needs to walk backwards from the set of questions

44:07

that we're trying to answer.

44:09

What are the kind of roll-ups that we're trying to do?

44:11

What is it that we need

44:13

to do in order to make sure that we put a definition

44:15

around the roll-ups so that it's

44:18

relevant? What are the filter conditions

44:20

that are relevant for those roll-ups

44:22

versus not? And in

44:25

this particular instance, I'm talking strictly about

44:27

dashboards so that you have

44:30

those items outlined so that when people

44:32

come through and consume this, the

44:35

number of toggles or

44:36

inputs that you can use that

44:38

you can get an outcome with is limited

44:41

so that you can go through and drive that out. And

44:44

so that level of metering is super, super important.

44:47

Now on the other aspect of educating

44:50

the user about the data and

44:52

what it means, what I've

44:54

seen specifically

44:56

in the modeling arena is boundary

44:59

conditions and self-trotting even

45:01

before you get the results out,

45:03

right? To say, hey, this kind of breaches are

45:05

out of bound conditions

45:07

and therefore this needs a second set of review.

45:10

What I have seen, or the worst,

45:12

is a ton of very

45:15

detailed documents spanning

45:17

multiple pages that exactly explains

45:19

what that is or in fact

45:21

even a user session, you know, that every time

45:24

you get on board, I sit with you and I walk you through what that means.

45:26

That's another thing that I don't see used

45:28

very well. So our preference

45:30

and what we typically like to do is a set

45:33

of tests that are run to make sure that

45:35

the data that you're actually consuming is

45:37

accurate and of high quality and of integrity.

45:39

And then on the consumption side, really limiting

45:41

the inputs to the outputs, right? Like, you

45:44

know, like if there's a country where they don't use zip

45:46

code or they use

45:48

another form of zip code, then don't show that

45:50

option, you know? Just limiting it considerably

45:53

and then lining that up to the questions that you're asking.

45:56

And in your experience of

45:58

working in this space, of helping

46:01

data teams understand what is the

46:03

customer experience that they're trying to

46:05

satisfy, how can they actually go about

46:08

delivering those capabilities? What

46:10

are some of the most interesting or innovative or unexpected

46:12

ways that you've seen teams either go

46:15

through the process of developing and

46:17

executing a given strategy or some

46:19

of the most interesting formulations

46:22

of that strategy that you've seen?

46:24

Yeah, I think when we think about

46:26

customer experience, let's just

46:28

kind of ground ourselves a little bit on the definition of

46:31

how we bring that to life.

46:33

The inputs to customer experience is really

46:35

kind of taking a look at your business and saying, these

46:37

are the different touch points that

46:40

our customers produce as

46:42

they interact with our digital

46:44

as well as our analog real estate. And

46:47

so right there, you can take the analog

46:49

real estate out, you know, and

46:51

you pretty much have the digital real estate and you said, okay,

46:53

well, these are the different interaction

46:56

points that we have. All right, so now

46:58

that we have that, we use those as

47:00

the input to then drive

47:02

decisions and that then the

47:05

customer experiences. And

47:09

that whole process could be, how do we optimize

47:11

the loan registration process for the

47:14

lowest number of clicks, right, to get to a decision? It

47:16

could be, you know, how do we make sure that,

47:18

you know, Tobias gets the

47:20

most relevant content that gets presented on screen

47:22

so that he quickly makes a decision on

47:25

buying a product that is relevant to

47:27

their need, right? So what

47:29

I'm trying to get at is the way that I've

47:31

seen teams do a really, really good

47:33

job of that is asking

47:35

the question as to what is the core metric that we need to hedge

47:38

on that clearly defines is the customer

47:40

experience optimal or not? Is

47:42

it the number of clicks?

47:44

Is the time taken per page? Is it

47:46

the number of items that he's left on

47:48

a basket? What is that? Data teams and

47:51

I haven't seen many data teams do it, but I've seen a lot

47:53

of business intelligence teams do it, which is they

47:55

really, really anchor and they ask the question as to what is the

47:57

metric that we need to be really optimizing for.

47:59

and getting that formulated, getting that listed

48:02

out accurately and done well, right? The

48:04

next thing from there that I have

48:06

seen data teams do well

48:09

is take that and think about all

48:12

of the data elements that come through

48:14

and formulate that answer and start

48:16

putting in early signs

48:18

of failure, right? So for example,

48:21

in order to determine the number of clicks, we get that from

48:23

five different systems. And we know that this

48:25

one system, when we get it from

48:27

that one system, we have to

48:29

make sure that the integrity and the quality is extremely

48:31

high. Okay. But we produce

48:33

this on a weekly basis. Should we flag

48:36

this at the end of the week? Or can we

48:38

flag this as and when the

48:40

data is coming in to say this is out

48:42

of bounds and this doesn't make any sense. There's

48:44

a new ordinal value that we need to flag. Oh,

48:47

these two systems are no longer in sync because our joint

48:49

structures are gonna be off. And oh, by the way,

48:51

now this is gonna lead to a massive skew. So

48:54

to summarize where I've seen data

48:56

teams do really, really well is build those capabilities

48:58

around observability and monitoring.

49:01

And for me, there are two distinct things. Monitoring is the

49:03

things that you actually know that you can monitor for. And

49:05

then observability is everything else that you see

49:08

coming through the panacea that you're

49:10

able to kind of decipher and kind of understand.

49:12

And then using

49:13

machine learning almost to help you understand

49:16

the patterns and behaviors, the slow drift that's

49:18

going on. And relying less on the operational

49:21

systems to tell you where the problems are. Because

49:24

the operational systems, if they have issues

49:26

going on, they can easily flag it. But otherwise, they

49:28

kind of go through and drive whatever

49:30

it is they need to do. And they can kind

49:33

of go through and keep producing the results, right? So

49:35

having a lot of that infrastructure built on the data engineering

49:37

side to drive that out is where I've seen

49:39

data engineering teams innovate in

49:41

Excel, right? Because the

49:44

alternate is, oh, why

49:46

don't we see a lot of data teams innovate on the

49:48

KPI side or pushing the business to think

49:50

more about that? They don't, it's almost

49:52

like having the spark plugs to find the car.

49:55

It doesn't work that way. So I think it's an

49:57

unfair expectation to

49:59

have.

49:59

that of data teams, what I think they do really,

50:02

really well is optimizing on the infrastructure pieces

50:04

that I mentioned.

50:05

And in your experience of working

50:07

in this space, what are the most interesting or

50:09

unexpected or challenging lessons that you've learned

50:11

in the process?

50:13

Always question the

50:15

core set of assumptions coming in. Also,

50:18

you know, people will hand

50:20

you over code. I mean, oftentimes

50:23

what really happens is, you know, you're trying to build an analytics product.

50:26

And, you know, like you're trying to go through

50:28

and walk all the way back to the source system.

50:30

You're trying to analyze the data. And

50:33

you've got people telling you

50:35

how the data is manifested

50:38

in these systems. And

50:41

they will talk about it. They give

50:43

you these diagrams and all those different things. I

50:45

think taking a synthetic transaction all the way from

50:47

the left to the right in terms of, hey, here's how the

50:49

data originates. This is how it gets manifested

50:52

in these systems. These are all the assumptions that we're

50:54

making. These are the edge cases. Documenting

50:57

all those items and seeing it and living through

50:59

it, I think is not just key, but

51:01

it's paramount. Because one of the things that always

51:03

shocks me is you kind of come in and then, you know,

51:06

people will say in the operational side, right, they will

51:08

say, let's just take the example

51:10

of a trucking company. They'll say, hey, whenever our trucks

51:12

leave late, our drivers always enter the information.

51:15

And it's a part of our SOP, but we don't see that in the system.

51:18

And so why don't you see that in the system? Well, the thing is they

51:21

tried entering it in this field before.

51:23

It didn't quite work for them.

51:25

So they started using the comment field afterwards.

51:28

So yes, they are doing it, right? So

51:30

the SOP is still active and relevant. However,

51:34

and that data is in the system.

51:36

It's just not where they said that it would be. One

51:38

of the good mitigation strategies that I've discovered

51:40

for this is to go out and

51:43

see and take a walk, you know, with

51:45

the actual executioners of the process

51:48

and see what that means. And that's another piece

51:50

that I also kind of bring to the top of the surface

51:53

is business process and understanding

51:55

business process and walking

51:57

that into where the data is

51:59

manifested. and what operational system and

52:01

how it's manifested. That top to bottom kind

52:04

of viewpoint is important so that

52:06

you can tease these kinds of things

52:08

out.

52:09

And for teams who are starting

52:11

down the path of trying to incorporate

52:14

these strategic processes into

52:17

their delivery workflow, what are

52:19

the cases where going through the whole

52:21

process of building a data product

52:24

strategy, using that as the means to

52:26

identify and prioritize

52:29

work to be done is overkill and

52:31

you just need to focus on the technical

52:33

aspects and that that is actually

52:35

the core capability that you need to deliver.

52:38

I think when you're fairly small, when

52:42

I said fairly small, like you've got a team of, let's

52:44

say, a team of five people

52:46

and then you kind of provide analytics to the organization

52:50

and you formulate

52:52

and work through it in a solution by solution basis

52:54

and that's all that you have. You

52:57

can still start thinking about data and

52:59

the concept of a product and defining a strategy

53:01

but your throughput or the ammunition

53:04

that you bring to the table is gonna be far less. So

53:06

you're gonna accumulate a ton of technical debt as you go through

53:08

it. And honestly, in the beginning, it's

53:11

gonna be par for the course, right? So

53:12

in that case, the team may

53:14

not think that it's overkill

53:17

but your stakeholders may because the initial

53:19

cost of you building a data ingestion

53:22

pattern-based framework that will automatically

53:24

auto ingest data, man, the cost of that

53:26

initiative for a single use case will be extremely high.

53:29

So my suggestion is for places where

53:31

you don't have a lot of executive leadership

53:33

support, i.e. those leaders haven't come

53:36

from a very strong data background and

53:38

they can't see the need

53:40

for it but need to see hard numbers in

53:42

the context of a single use case that's very,

53:45

very myopic, this will be overkill 100%. So

53:48

then the question is, well, how do you, is

53:50

it still not right for the organization and what should we do

53:52

about it? And so I think this is where making

53:55

sure that as you work through the

53:57

use cases, you carve out a certain

53:59

set. of your backlog and

54:02

use that in a very nuanced

54:04

way to start building some shared

54:06

capabilities, right? And so this

54:09

is kind of the point that I had made earlier about the fact

54:11

that your acceleration is gonna be less,

54:13

which means you're gonna travel, you're

54:16

not gonna travel as fast as you normally would. I

54:18

think those are apart for the course, but that's

54:20

kind of what I would do in cases like

54:22

that. And those are the places where this

54:25

would be overkill, in areas

54:27

where you've got executive

54:29

support, you've got a set of people around

54:31

you who actually have seen the need

54:34

for building data products

54:36

at scale, and you have multiple teams

54:38

that are all producing data products

54:40

of different variety. There

54:43

may be a big aspiration to

54:46

provide some of these central capability

54:48

source to lower the overall cost of production. Building

54:51

the use case for that, showcasing what the ROI

54:53

looks like, and doing something that

54:56

product managers do day in and day out, right? In

54:58

organizations like that, right? Where

55:01

you have 50 people all producing products,

55:04

right? Or solutions, so to speak, right? And

55:06

I go through and get serviced by consumers.

55:09

You could start seeing these kinds of concepts

55:12

accepted more so than not. Just

55:14

to summarize, I think it's relevant in

55:16

either set of organizations,

55:18

but it's more pertinent

55:21

than investments. It's a lot easier to make, where

55:23

you have a lot of people just working

55:25

through providing data solutions,

55:27

and you kind of take a look at it, and you said, hey, didn't we just produce

55:29

that data set like last week? Yeah, that had

55:31

four columns, but this has five. So why is that other

55:34

team doing it? Why don't we just kind of take this data set and

55:36

make it into an asset, and then put that

55:38

on there? And oh, by the way, why don't we put privacy

55:40

treatments on it as well? Because that other team

55:42

did that too. How do we mix it? Oh, you're

55:44

spinning up, you know, you're spinning up like

55:47

an S3 bucket in this way, right? Why

55:50

don't we use Terraform to go through and do that? Oh,

55:52

well, you know, our centers are different, or our naming

55:54

conventions are different. And so

55:56

I think these kinds of problems come at scale,

55:59

right? Because not to bias.

55:59

can't move from team

56:00

A to team B because even though

56:02

they use the same cloud provider, the way that

56:05

they do business is different. And

56:07

so the op model is different. So these

56:09

are problems at scale versus, you know,

56:11

in smaller sizes, smaller

56:13

teams, it's less forgiving because,

56:16

you know, like the telephone problem is

56:18

not that high. And for teams and

56:20

individuals who are trying

56:22

to upskill into this space

56:25

of managing data product strategy

56:27

or understanding how best to integrate it into

56:29

their work. What are

56:30

some of the resources that you have

56:32

found useful and that you recommend people dig

56:35

into to be able to understand

56:37

more of the tactical elements of

56:39

how to bring data product strategy

56:42

into the work that they're doing for delivering data

56:44

to their various end consumers?

56:46

Really honing in and understanding software development

56:49

practices and what they mean, I think

56:51

is a good space to start off in. So

56:53

this involves everything from what

56:55

does CINCD mean, you know, what

56:58

does what does really, you know,

57:00

building services look like, what does

57:03

contracts mean in this space, like,

57:05

you know, like, you know, API contracts, what does discoverable

57:07

services look like? And this is very, very software

57:10

engineering oriented. And then and then that's where

57:12

that's where I assume there's got to be a little bit of learning,

57:14

right, kind of coming to the table. The other

57:16

part that I think data engineering teams

57:19

and practitioners currently providing

57:21

data and analytic solutions will bring to the table

57:24

by themselves is the inherent nature where

57:26

data is different. The data assets being

57:28

produced, information assets being produced, they're

57:31

different than just core services, right? So how do

57:33

you think about the app model there? What does

57:35

that look like? And how do you take these these

57:38

concepts and build them into

57:40

this? So for us, for

57:42

example, when we produce data pipeline, do we have a baseline

57:45

data set that we can test against every single

57:47

time? Right? How do we measure drift? What

57:49

does that mean? Like, you know, should we build leaderboards

57:51

or not?

57:52

And then using that kind

57:54

of set of introspective Q&A to

57:56

start building out capabilities to say, okay,

57:59

well, this is what it means.

57:59

is what it looks like and start leveraging

58:02

and deep diving on those items. That's

58:04

what I would suggest now. Tactically, there

58:06

are a lot of thinkers in this space, right? Who

58:09

have all kind of provided their own

58:11

perspective on what it means. I mean, I thought Works

58:13

with a company I think has done a lot of time in the space

58:16

of data products. Sanjeev Mohan,

58:18

you know, has done a lot of thinking

58:20

on the data product space. You've

58:22

got data contracts with

58:25

Chad, you know, so on and so on. So I

58:27

think they're staying close to

58:30

all of these different vectors coming up is a

58:32

big one as well. What I found exceptionally

58:35

helpful is staying close to all the Slack channels, you

58:37

know, where different people are like really

58:39

ideating and thinking about what this means. And

58:43

our space is constantly evolving as well, right?

58:45

So if you think about metric stores, if you think

58:47

about, you know, the concept of obviously

58:50

the data mesh and Faber-Brock have kind

58:52

of come to fruition and different people are

58:54

working on different things in that arena. But

58:56

if you think about data observability, if you

58:58

think about data contracts, like so these are all kind of

59:01

relatively new concepts coming up, right?

59:03

Like, you know, over the past three years. So they've started

59:05

to take shape and they started to take hold in

59:07

and thinking about how this impacts

59:10

our space is going to be the biggest one. And for us, what that

59:12

means is there's a ton of change, right? And

59:14

so when you are in these Slack channels, whether

59:16

it's for data quality, whether it's

59:19

for data observability, you know, provide

59:22

a big eye or any of the other companies, you tend

59:24

to start hearing people talk about these

59:26

interdisciplinary concepts and bringing them together.

59:29

And then obviously, you know, the shameless plug

59:31

for your own podcast device. I mean, like,

59:34

I think, you know, if you're a data engineer and you're not kind of listening

59:36

to some of these things, you're probably missing the

59:38

beat on the trends going on and then kind of incorporating

59:40

that back into your own set

59:42

of practices. Right. So tactically, those are the places that

59:44

I would look for.

59:46

Are there any other aspects of this

59:48

space of data product strategy,

59:50

how to think about it from a technical

59:53

perspective, how to incorporate it into

59:55

your overall work processes that we didn't

59:57

discuss yet that you'd like to cover before we close out the show?

59:59

I think we did

1:00:02

touch on it, but let me double click

1:00:04

on it further. I think this concept of

1:00:06

metrics and really gauging to see if your

1:00:08

strategy is headed in the direction that

1:00:10

it needs to head is core. When

1:00:13

we start thinking about a data product strategy,

1:00:16

the question that we need to ask ourselves is what

1:00:18

are we going to get as a result of that? Is it going to be

1:00:21

lowering the cost of producing products?

1:00:23

Is it going to be increasing throughput on

1:00:25

capabilities that we already have? Depending

1:00:28

on that and really understanding

1:00:29

why and what that means is

1:00:32

going to be key and core.

1:00:34

Also understanding if you're doing this for defense

1:00:37

or offense purposes. If you're

1:00:39

doing this to optimize a cost, you're trying to increase

1:00:41

top line. Answering those questions initially

1:00:44

and grounding yourself in why

1:00:46

you're doing what you're doing is going to be super important.

1:00:48

Otherwise,

1:00:49

this will be just like another flavor of the day.

1:00:51

You will be producing solutions and nothing more and

1:00:54

probably at twice the cost and for

1:00:57

one half the value.

1:00:58

All right. Well, for anybody who wants to get

1:01:00

in touch with you and follow along with the work that you're

1:01:02

doing, I'll have you add your preferred contact information

1:01:04

to the show notes. As the final question,

1:01:07

I'd like to get your perspective on what you see as

1:01:09

being the biggest gap in the tooling or technology

1:01:11

that's available for data management today.

1:01:13

I think the biggest, well, firstly, I

1:01:15

think there's still the biggest problems that we have

1:01:18

is about comprehension of how we use things more

1:01:20

than the technologies. One aspect

1:01:23

that I see that I see we completely

1:01:25

lack on is this ability

1:01:27

to learn from the way that others are using

1:01:30

the tooling and the data within the

1:01:32

ecosystem that we have and then making

1:01:34

our systems more intelligent. One

1:01:37

of the things that we always

1:01:39

think about with respect to data management is

1:01:42

it's kind of like being a cartographer.

1:01:45

There are many cartographers all towards your organization

1:01:47

that are doing these queries more, merging

1:01:50

or culling through data and then formulating these side

1:01:52

roads. And oftentimes, whenever you start

1:01:55

looking at it, they're interpreting how this data

1:01:57

is being assimilated together and then creating

1:01:59

this map. of the organization. When

1:02:02

one person does it, how can another

1:02:04

person not take advantage of it? And when one person

1:02:06

does it, how do we have enough confidence that that

1:02:08

end road or that side road

1:02:10

can have the right level of throughput that way we

1:02:12

can actually go through and

1:02:15

use it for other purposes, right? And then how do we

1:02:17

kind of

1:02:17

auto-migrate that up? That whole building

1:02:20

an intelligent ecosystem, right? Where

1:02:22

you have data that helps you

1:02:24

derive the way to use new data, I

1:02:27

think it's completely lacking in this business. And I

1:02:29

don't know if we're doing as much in

1:02:31

that arena or not, right? So intelligent systems

1:02:34

and using AI for BI, I think

1:02:36

is a big one that I see us having

1:02:38

a gap in.

1:02:39

All right, well, thank you very much for

1:02:41

taking the time today to join me and

1:02:43

share the work that you are doing and

1:02:46

your experience of building

1:02:48

and executing on data product

1:02:50

strategies. It's definitely a very important

1:02:53

area, one that has been growing in

1:02:55

visibility and adoption. So I appreciate

1:02:58

the time you've taken to share that with us and

1:03:00

I hope you enjoy the rest of your day. Thanks, Tobias, appreciate

1:03:02

it. Thanks for listening. Don't forget to check out

1:03:04

our other shows, podcast.init, which covers the Python language, its community, and the

1:03:06

innovative ways it is being used, and the Machine Learning Podcast,

1:03:09

which helps you go from idea to production with machine learning.

1:03:11

Visit the site at dataengineeringpodcast.com to subscribe to the show, find

1:03:13

out the products, and then you can find out the products at

1:03:16

dataengineeringpodcast.com. And if you're interested in more information about the

1:03:18

machine learning podcast, and subscribe

1:03:20

to the show. And if you're interested in more information about the machine learning

1:03:22

podcast, visit the site at dataengineeringpodcast.com, subscribe to the show,

1:03:25

find out the mailing list, and read the show notes. And if you've learned something

1:03:27

or tried out a product from the show, then tell us about it. Email

1:03:29

host at dataengineeringpodcast.com with your story. And to

1:03:31

help other people find the show, please leave a review on Apple Podcasts.

Rate

Get this podcast via API

From The Podcast

Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More