Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello,
0:11
and welcome to the Data Engineering Podcast,
0:13
the show about modern data management.
0:17
Introducing Routerstack Profiles. Routerstack
0:19
Profiles takes the SAS guesswork and SQL
0:22
grunt work out of building complete customer profiles
0:24
so you can quickly ship actionable, enriched
0:26
data to every downstream team. You
0:29
specify the customer traits, then Profiles
0:32
runs the joins and computations for you to create
0:34
complete customer profiles. Get
0:36
all of the details and try the new product today
0:39
at DataEngineeringPodcast.com slash Routerstack.
0:42
You shouldn't have to throw away the database to build
0:44
with fast-changing data. You should be
0:46
able to keep the familiarity of SQL and the
0:49
proven architecture of cloud warehouses but
0:51
swap the decades-old batch computation model
0:53
for an efficient incremental engine to get complex
0:56
queries that are always up to date.
0:58
With Materialize, you can. It's the
1:00
only true SQL streaming database built
1:02
from the ground up to meet the needs of modern data
1:04
products. Whether it's real-time
1:06
dashboarding and analytics, personalization
1:08
and segmentation, or automation and alerting,
1:11
Materialize gives you the ability to work with fresh,
1:13
correct, and scalable results, all in
1:15
a familiar SQL interface. Go to
1:17
DataEngineeringPodcast.com slash
1:19
Materialize today to get two weeks free.
1:22
Your host is Tobias Macy, and today
1:24
I'm interviewing Ranjit Raghunath about tactical
1:27
elements of a data product strategy. So
1:29
Ranjit, can you start by introducing yourself?
1:31
Absolutely. Firstly, Tobias, thanks for the opportunity
1:34
to have me and as a delegate
1:36
of CX Data Labs on your podcast,
1:38
a big fan. So thank you. So
1:40
my name is Ranjit Raghunath. I'm a managing principal
1:43
over at a company called CX Data Labs. We're
1:45
a data and analytics strategy and implementation
1:47
services company, and we focus
1:51
on optimizing customer experiences
1:53
in the retail, life sciences, and financial
1:55
services verticals
1:57
using data engineering and
1:59
data platform. form is our core set
2:01
of peckles and pecks and shawls
2:04
effectively to kind of tie these systems together
2:07
so that businesses can see a
2:09
holistic view of the customer and then action
2:11
on it. And some of the things that they do as
2:13
a result of the work that we've done is increase
2:16
their ability to personalize on certain content
2:19
that they present or better understand
2:21
their marketing spend in terms of what
2:23
resonates well with customer acquisition costs
2:26
or simply optimizing wait times
2:29
as people call into a call center. And so
2:31
those are some of the examples. And for me
2:33
personally, this has been a long time
2:35
coming and I've been in the
2:38
data analytics field for roughly 17 years.
2:40
I've done nothing but various forms
2:43
of engineering software and data all
2:45
under the vicinity of either
2:48
producing data solutions or data products.
2:51
And just an overall geek and then
2:53
a nerd as it comes to data. Do you remember
2:55
how you first got started working in data? Yeah,
2:57
I do. I was an intern over at a company
2:59
called USAA and they were
3:02
working on a build back model. And
3:05
the core problem that they were trying to solve was they had
3:07
a set of infrastructure that they
3:09
wanted to go through and build back all
3:12
the way to the business
3:14
teams utilizing those applications
3:17
so that they were getting value
3:19
from it. And one of my
3:21
tasks was to come in and help the
3:24
team go through and provide
3:26
this costing model. And so as I
3:28
came in, they were using Excel and
3:30
they were using Access and to do
3:32
some of these computations.
3:34
And I kind of looked at them and I said, hey, you know, what
3:37
did we start writing data pipelines to do this?
3:39
Which I didn't know they were called data pipelines, but
3:41
I was an electrical engineering graduate coming
3:43
in as an intern. All I knew is well,
3:46
maybe we can optimize it and do it differently. And
3:48
then soon got introduced to dimensional
3:50
modeling and said facts and dimensions is how
3:52
you can do that. Oh, well, what if we
3:54
turn like, why are we sending these reports to
3:57
them? Can we bring them over and then have them take
3:59
a look at it? self-service reporting with business
4:02
intelligence. So a lot of it, I didn't have
4:04
the names for it per se, but
4:07
that's how I started cutting my teeth into it and just
4:09
started kind of navigating it, all to optimize
4:12
and to kind of lower the ratio in terms of the
4:14
work done for people getting the value that
4:16
they need. Yeah, it's amazing how
4:18
much in the technology industry
4:21
in particular, but probably any industry really,
4:23
that if you don't know the right terms that
4:25
people are using, then you just end up rebuilding
4:27
it yourself because you didn't know that it was already done.
4:30
100%, 100%. And
4:32
a lot of it also is, the
4:35
thing that I've always loved about data and data
4:37
analytics is, it's an
4:39
objective way to make decisions. It
4:41
also provides some
4:44
eye-opening opportunities when you put things in front
4:46
of people and say, hey, these are the
4:48
observations, right? I mean, we could
4:50
debate what, how
4:52
we use it and what that means in the context
4:55
of the scenario, but this
4:57
is what we're observing,
4:58
right?
4:59
And oftentimes in the real world, if
5:02
you contrast that experience, we
5:04
both could be seeing the same events, but we couldn't be
5:06
interpreting it very differently. And in
5:08
the data and analytics world, we have observations
5:10
and we can see that, and then we have inferences
5:13
that we can draw on it, but we have that dissected
5:15
framework to lean on. So
5:18
that's another kind of thing that has always
5:21
motivated me to kind of be
5:23
a discipline, be a disciple
5:25
of
5:25
the field, so to speak. And
5:28
to that point of shared definitions,
5:31
shared vocabulary, before we get too far
5:33
into the conversation at hand today of
5:35
data product strategies, let's
5:37
just start by identifying a shared understanding
5:40
of what we mean when we say data products
5:42
and how those might differ from data assets,
5:44
like a dashboard or a table or
5:47
a report that gets delivered quarterly.
5:50
And what is necessary? What
5:53
are the surrounding attributes for a piece
5:55
of data or a grouping of data
5:57
for it to be qualified as a product? Sure.
5:59
Sure, sure, sure. And I mean, there's
6:02
probably multiple definitions around it. So I'm going to give
6:04
you my rendition of what that means. And
6:06
so the disclaimer here is, you know, it is
6:08
not the definition, but it's a perspective and a point
6:10
of view. When I think about a product, a
6:13
product is services a need
6:15
that a customer has, or a certain segment
6:18
of people that fit into a persona, right? So
6:20
you have a need, and then there is
6:22
an outlay for that need and
6:24
get service. And the way that it gets services
6:26
through a set of features. And then you have different
6:29
set of products that help you
6:31
get to the end of job that
6:33
you may have for a particular experience that you
6:36
want to deliver. So okay, all of that
6:38
sounds very nebulous. But what does that mean in the context
6:40
of data? You use the word data asset. For
6:43
me, an asset means something that you can harness
6:45
value from, and log transactions
6:47
against. So there's a cost, and then
6:50
there's revenue that comes in.
6:52
A data asset, as you talked about, is
6:54
a type of data product. A dashboard
6:57
is a type of data product. A model
7:00
is also a type of data product.
7:02
And so these are interfaces
7:04
that you have customers use to
7:07
harness value, and then also assimilate
7:09
costs across it, right? For me, simply
7:11
put, a data product is thinking about
7:14
the customer and then the way that they
7:16
use data and its attributes
7:18
to make decisions and cataloging them into
7:21
a set of features that you
7:23
can then have expanded teams
7:25
put together that deliver it. But then
7:28
can also have long runtime roadmaps
7:30
that you kind of have, that you kind of can nurture
7:32
and then kind of grow over time. So what does that
7:34
mean? So let's say that we say that we're going to
7:36
build a C360 data
7:38
asset, right? So we're going to break
7:40
that down and try to identify the different features
7:43
that we would need to correctly depict
7:46
what a customer would be, and we would
7:48
think about it in a product. So what does that mean? A product
7:50
has a life cycle, you know,
7:52
it has release notes, it has releases,
7:55
it has a team that's long-lived that goes
7:57
through and produces this. We also
7:59
take care of of regressions, we also take
8:01
care of things that we may need to deprecate
8:04
over time. We may think of
8:06
features that add on to these different modules.
8:08
And so thinking about the customer 360
8:11
data asset as a product and then
8:16
putting together a release roadmap that says these
8:19
are the features that are coming
8:21
out in this quarter, who's
8:23
going to be a technology evangelist
8:25
versus an adventure
8:27
enthusiast, those could be multiple variables
8:30
that come in. And they could be spread across two
8:32
different releases. And so I look
8:35
at the concept of a data
8:37
product as one that builds on top of each
8:39
other
8:40
and really thinks about the customer and the
8:42
way they use it and
8:44
how they use it
8:45
and provides them with interfaces
8:48
so that it's easier for them to use. So
8:51
the last thing that I'll say before I hand it back over
8:53
to you is the usage modality here
8:55
could be that we would like to
8:58
give a customer idea and we would like to get
9:00
to know if this person is a technology enthusiast
9:02
or not. And the best way to do that may
9:05
be consuming that data set through a restful
9:07
interface where it has a certain set
9:10
of specifications in the term of a contract
9:12
so that I can go through and enable
9:15
real time decision making. Great, that's
9:17
an interface into an asset that we have
9:20
culminating into a product that we go through and sell.
9:22
There could be other interfaces to it, which
9:24
is, hey, I want to consume all
9:26
of these records and batch and then make decisions
9:28
and go through and drive. Well, that's another interface,
9:31
again, into the same asset. So
9:33
it's just breaking this concept
9:36
of data usage and
9:39
what it means into a set of constructs
9:42
that we just kind of talked about.
9:44
And so with that shared
9:46
definition of what it means
9:48
to have a data product, what
9:50
are the pieces that we need to strategize about?
9:53
Why do we need a strategy? What purpose
9:55
does that strategy serve? And how does that
9:57
inform the work to be done?
9:58
Very good, very good.
9:59
because this is something that I think about
10:02
quite a bit. We talked about different types
10:04
of data products. We talked about type model.
10:06
We talked about type dashboard. We
10:08
talked about type data asset. You
10:10
can have further categories like information asset
10:12
as well. And so if you were to hydrate these
10:15
into or condense these into
10:17
patterns, then you start taking a look
10:20
at a value chain that comes from
10:22
a set of activities when done in unison,
10:25
produce an artifact, right? That
10:28
being a data asset. Okay, so
10:30
then the thinking here is you have patterns,
10:33
you have a set of activities that line up to
10:35
these patterns and you have a set of artifacts that are produced
10:38
as a result of that. Effectively what we
10:40
think of when we say a data product
10:42
strategy is the formulation
10:46
of that so that
10:48
we can go through and industrialize its production,
10:50
right? So that from the concept of inception
10:53
all the way to industrialization,
10:55
you can utilize this up model,
10:58
so to speak, to kind of produce this artifact
11:00
in a very streamlined way, okay?
11:03
So what does that mean? You know, you're producing a data
11:05
set. Let's just say it's a data asset that
11:07
you're producing. Let's go back to the other example of customer 360.
11:11
In order for you to source that information, you
11:14
may need to go to a CRM
11:16
system. And that CRM system, let's just say
11:19
it's Salesforce. The ingestion pattern associated
11:21
with ingesting data from Salesforce doesn't need to
11:23
be recreated again and again for sourcing,
11:26
let's say from one entity such as Contact or
11:28
another entity such as Account. You define
11:30
the pattern of the ingestion once and
11:33
then you go through and leverage
11:35
that highway, so to speak, for different objects as
11:37
they come along. And so you slowly start condensing
11:40
those set of patterns into
11:44
broader capabilities and then you
11:46
free up the development cycle for
11:49
producing these products and hone
11:52
in on these capabilities, right? And so
11:54
effectively what you end up doing is you make
11:56
the marginal cost of producing the next product,
11:59
simpler and faster. simpler. And so effectively,
12:01
you you you harden these set of capabilities,
12:03
right. And so that's kind of that
12:06
whole piece of the puzzle is we
12:08
produce and develop a strategy. And that's why you
12:10
need one. Otherwise, what ends up happening
12:12
if you don't have one is the cost of producing
12:15
a product is just the same
12:17
or expensive again and again and again,
12:19
right. And so what you want to do is you want that cost grip
12:22
to come down. So hopefully that that answered
12:24
the question. Yeah. And for those
12:26
elements of defining
12:28
or establishing what the strategy should be,
12:30
who is responsible for
12:33
guiding that process? Who are the people that need to be
12:35
involved from a kind of roles and
12:38
persona perspective? And
12:40
what are the things that might trigger
12:42
the development of a given strategy?
12:45
Yeah, yeah. So I think
12:47
you always got to start off with the with
12:49
the consumers of analytics in
12:52
mind. So they're very important stakeholders.
12:54
These are the folks who consume the analytics
12:58
being produced and action on it. So
13:00
think about somebody in the office of the CEO,
13:03
think about a chief credit risk officer who
13:05
takes a look at, you know, the
13:07
analytics being produced and says, Hey, this
13:10
is my overall risk for my portfolio within
13:12
the sector that I manage. This is how I
13:14
can curtail
13:17
my bookends with respect to, you know, certain hedges
13:19
that I'm performing. But that's, that's
13:21
a cohort, right? That's, that's a segment of the population
13:23
that provides you with, Hey, here's what
13:25
I'm going to do with the analytics
13:28
that you provide me. And this is what I decision
13:30
on and action on. And oh, by the way, this is
13:33
why I need what I need what
13:35
this is. And that typically for
13:37
us is use cases, right? And they come from our stakeholders.
13:40
And the stakeholders kind of closer in the business, I
13:42
go through and drive that out. Those needs,
13:45
and that level of dialogue that goes on
13:47
that says, where in this business process,
13:49
do you actually embed this level of analytics?
13:52
How do you use it? Oh, what is the time
13:54
taken for you to go through and provide it? Is
13:56
there any sensitivity to the information being
13:58
provided? all of those kinds
14:01
of questions and answers that need to
14:03
kind of bring the use case to life in
14:05
our world is brought together through
14:07
the lens of a data product manager.
14:10
And in some organizations, you know, that could be
14:13
further bolstered with a product owner that's
14:15
a little bit more tactical, kind of taking
14:17
those needs and helping them kind of see the
14:20
technical definitions around it, or
14:22
it's fully owned by the product manager themselves.
14:25
And then what we have is we have a
14:28
set of, you know, software data
14:30
engineering managers who kind of sort of go
14:32
through and break this down in terms of, hey,
14:34
here's what the needs means in
14:37
the way of thinking about non-functions. How
14:39
do these come into play? And that's where
14:41
we really see the software engineering manager, data
14:44
engineering managers come in. They got to hear the
14:46
functional needs and then start saying, well, here are the non-functions.
14:49
This is why this is what we need to do. Okay,
14:51
well, we're going to produce it in this way. We
14:53
should have some logging measures being
14:55
put into place. We need to have some telemetry. We
14:57
need to have some monitoring. And then they also take
15:00
all of the needs being articulated and
15:02
put them into functional requirements, and
15:04
then they start breaking them down. And the breaking
15:06
them down part really is where we
15:08
see, you know, TPMs
15:10
or scrum masters, or, you know, however you
15:12
want to call them, but effectively folks who can
15:14
take a set of functional requirements,
15:17
a set of non-functional requirements, and then
15:19
kind of devise them into
15:21
a plan of action that the team can execute
15:23
on. And then you have a set of developers, right?
15:27
Now they fit into multiple different brackets.
15:29
You know, it could be platform engineers.
15:31
It could be data engineers. It could be software engineers,
15:33
but they all kind of listen to these needs
15:36
that have been kind of dissected into stories.
15:38
And they start saying, okay, well, this is, if
15:41
we do this, then we can achieve this. Do
15:43
you agree with this? And that whole negotiation
15:47
going back and forth happens internally
15:49
to the team, and then also with the product manager,
15:52
and then ultimately signed off by the
15:55
software development manager or the data engineering
15:57
manager, and then it gets formulated into a set
15:59
of... release artifacts that we go
16:01
through and produce and provide out that
16:04
ultimately gets embedded into the business
16:06
workflow. Now all of this stuff
16:09
is going to be useless if we don't
16:11
have a
16:12
really good
16:13
business enablement, customer success
16:16
driven viewpoint in which we're doing change
16:18
management both on the technology side
16:20
but then also on the business side which is now you're going to get
16:22
this new analytical component.
16:25
How are you going to use it? So for example,
16:27
let's say there's a propensity for failure of paying
16:30
back a loan. How are you going to use
16:32
it when you make the loan origination decision?
16:35
When should you pull the lever to
16:37
say this model doesn't make sense, these answers
16:39
don't make sense? And oh by the way,
16:41
how do you tune it? Where do we monitor
16:44
that and how do you make decisions on
16:46
it? So it's a combination of
16:49
different items coming together along with
16:51
different roles and they encompass
16:53
all the way from change managers. Sometimes
16:57
these are played by the product manager and then some
17:00
analytical translators or folks or
17:02
business analysts directly in the business. It just really
17:04
depends on the company that you're in
17:06
and the role that they play. But then those
17:09
are all the left to right side
17:11
of the equations that would look like to produce a
17:14
data product
17:14
and the different activities
17:16
that would go into it.
17:20
This episode is brought to you by DataFold, a
17:22
testing automation platform for data engineers
17:25
that finds data quality issues before the code
17:27
and data are deployed to production. DataFold
17:30
leverages data diffing to compare production and development
17:33
environments and column level lineage to
17:35
show you the exact impact of every code change
17:37
on data, metrics and BI tools, keeping
17:40
your team productive and stakeholders happy.
17:43
DataFold integrates with DBT, the modern data
17:45
stack, and seamlessly plugs in your data
17:48
CI for team-wide and automated testing.
17:51
If you are migrating to a modern data stack, DataFold
17:53
can also help you automate data and code validation
17:56
to speed up the migration. Learn more about
17:58
DataFold by visiting dataengineering.com. podcast.com
18:01
slash data folds today. And
18:05
regardless of whether you actively engage
18:08
in defining and implementing a particular
18:10
strategy, there's always going to be
18:12
a strategy. It's just a matter of whether you are
18:15
explicit and purposeful about it,
18:17
or if it is just something that is emergent. And I'm
18:19
curious what you have seen as some of the
18:22
juxtaposition of teams that are very deliberate
18:24
about the definition and execution
18:27
of a product strategy for their
18:29
data assets versus teams that
18:31
just leave it to, well, this is what we're doing.
18:34
It'll just
18:34
emerge and we'll figure it out as we go
18:36
kind of approach. Yeah, I think it's a good question.
18:39
I think when you
18:41
have very product centric teams
18:44
that are exclusively focused on
18:46
enabling, let's say a product
18:48
that they're going through and releasing,
18:51
and they have analytics as a tie-in
18:53
to that product, I see them leveraging
18:56
and kind of latching on to the product strategy
18:58
itself and analytics and data kind of
19:01
sort of, they don't fall by the wayside, but
19:03
they're secondary actors within
19:05
that overall equation, right? Which means
19:07
what? You typically have one business intelligence engineer,
19:09
you have one data engineer that's within the group and
19:11
their entire purpose of existence is to help the product
19:14
manager rationalize decisions based
19:16
on either funding, customer decisioning journey,
19:19
churn, whatever it may be, ARR, like
19:21
whatever it is that they want to do, the flavor of the day
19:24
is what they work on, right?
19:26
So in that case, they're
19:28
not really
19:29
coming up for air as much and thinking about and saying,
19:31
hey, here are the 14 or 15 different
19:33
questions that I get asked. Here's how I
19:35
can start laying out the foundation so
19:38
that I don't need to do the same amount of work that
19:40
I do for answering those 14 different questions.
19:43
Let me start formulating and sourcing
19:45
data that will create these core
19:48
entities that I can then use to mash up
19:50
and oh, by the way, let me build a dashboard on top of it so
19:53
that the product manager can do it themselves,
19:55
right? So that side of
19:57
the equation is where I see less of that
19:59
and it It's more where the product
20:01
is center and the product manager
20:04
are driving all of the kind of work
20:07
and centering it on the product itself, right?
20:09
So I don't see a strategy that coherently
20:11
kind of describes anything in scenarios
20:14
like that. Where I've seen companies
20:16
use or kind of dive in really
20:18
into data product strategy
20:21
is where there is a focus
20:25
on building data products. But
20:27
there is an aspect of doing it in a centralized
20:29
fashion. And it's not that everything
20:32
is centralized, but it could be that a core set
20:34
of the infrastructure
20:35
is centralized, a core set of the
20:38
assets being brought in as centralized.
20:40
So what does that mean? Let's go back to the example
20:42
that I gave about Salesforce. It's a searing system.
20:46
It's got a ton of assets
20:48
within it. What does that mean? It's got contact.
20:50
It's got account. It's got
20:52
leads. There's leads that
20:55
are being qualified there. There are sales. There's
20:57
tons of information there. Do we want
20:59
every single team to go through and source
21:01
that information again and again and again? Probably
21:04
not. I mean, if you think about it on an ingress and egress perspective,
21:07
it doesn't make sense.
21:09
So
21:10
you have a set of teams that go through and say, hey,
21:12
here. We're going to model how
21:14
to particularly use contact and
21:16
account and the relationships between it. And
21:19
we're going to manifest it in a place that
21:22
makes it easy for teams to go through
21:24
and source it. OK. Well,
21:26
when they do that, they're effectively
21:29
centralizing that
21:31
capability of data ingest and
21:33
data rationalization. So when you're
21:35
a consumption-driven team coming
21:38
through, you have to learn that mnemonic
21:40
and you have to go forward. So in
21:42
cases like that where people are looking for efficiency
21:44
gain through central harmonization
21:49
of data, I see those
21:51
kinds of companies do data product
21:53
strategy more and more.
21:55
And then
21:57
the tier that sits in the middle
22:00
They don't necessarily agree on the centralization
22:03
or the decentralization, but they agree
22:05
a ton on
22:08
the ways of working and the standardization of
22:10
the ways of working. So if you think about, hey,
22:14
what does continuous integration look like in
22:17
the concept of data engineering? What
22:19
does continuous deployment look like? And what does that mean?
22:22
And so if you have teams that are really software
22:25
focused but then are trying to enable data
22:27
products, they
22:30
hedge on the ways of working. They say, hey, well, let's
22:32
have a repo structure that's
22:34
conducive to working on data engineering
22:36
efforts. So they go
22:39
through and drive out a set of standardization
22:41
there. And that's hedged on kind
22:43
of what is a data product and how do we enable
22:46
that? So those are the three kind of verticals
22:48
that I see as I've
22:50
kind of scavenged the field.
22:52
You mentioned the different roles and responsibilities
22:55
throughout the process of designing
22:57
and implementing a product strategy.
23:00
But of course, not every company is going to
23:02
have the same sets of people, the
23:04
same titles, or even a given title
23:06
might not even exist across separate people. And
23:09
I'm curious how you have seen the
23:11
size and structure of
23:14
different teams, both within and
23:17
adjacent to the data
23:19
capabilities influence the ways
23:21
that people approach the concept of how
23:24
to strategize what the scope of
23:26
a given product looks like, et cetera. Yeah,
23:28
and I think this really depends on verticals
23:30
and the kind of vertical that you belong to and
23:33
the importance given to data within
23:36
that vertical. So if you take a look at the insurance
23:38
business, they've been using data to make decisions
23:41
for a very, very long time. So their maturity
23:44
around data
23:46
management and the need for it is
23:49
super high because when you make great changes on insurance
23:51
policies, you need to follow this data. So guess
23:53
what? Like you automatically are thinking about data
23:55
retention. You're automatically thinking about
23:58
the
23:58
governance around that.
23:59
that data that's used to make those decisions.
24:02
You're automatically thinking about the
24:04
fact that, you know, you have a deadline to submit
24:07
these things and you have SLAs
24:09
in place, okay, well, they need to be done
24:11
in a particular order so that it can be good. It
24:13
can go through and be deployed. Every,
24:16
like, you know, if you take an actuarial
24:18
scientist, right, like there's a particular way that they
24:20
go through and do their business. So there's a
24:23
certain hygiene around the way that they
24:25
think about processing the data so that
24:27
they can then answer the questions that they seek.
24:30
So based on the vertical that
24:32
you're in, or the industry vertical that
24:34
you're in, and the importance that data has within
24:36
its own, its relevance, right,
24:38
will dictate how much you think about
24:41
data and how much you think about
24:43
all of the illities that come with it. And so
24:46
in the insurance example that I gave you, they
24:48
inherently have a strategy, but it's embedded
24:50
in the way that the vertical exists. So
24:53
you may not need a business analyst. You could have an actuarial
24:55
scientist, you know, maybe a junior one who
24:57
functions as one. And they write documents,
24:59
they write requirements in the sense that that
25:02
depicts the process flow where things need to happen
25:04
or not. And so that's one example, right?
25:06
In another example, you could have, and
25:09
they may need to have, you know, separation
25:12
of roles, because they're probably in a regulated
25:14
business where, you know, the person doing the
25:16
math cannot be the one that checks the math. And
25:18
therefore, you know, the way that they've kind of written these rules
25:21
says, you know, we need to physically have them as
25:23
being separate so as to guarantee a level
25:25
of quality that they go through and drive out. This
25:27
is that way in the life sciences business,
25:29
for example, right? There's a large
25:32
focus on QC, because imagine you
25:34
getting a drug that hasn't
25:36
been QC'd as much as it should
25:39
have. So there's a certain
25:41
set of operating protocols
25:43
and procedures that have been
25:45
grounded on mitigating risk and
25:48
increasing quality. And that's kind of
25:50
led to an op model where you have different people
25:53
doing a different set of roles. And that's dictating
25:55
the way that the industry operates and
25:57
drives.
25:58
So that's another factor.
25:59
which is the industry drives the kind of
26:02
roles based on the way that they segment things. The
26:04
third is where data is used
26:06
as an enabler, but the cost of getting it wrong is not that high,
26:09
and you need it for directional
26:11
correctness rather than, you know,
26:13
exactness, right?
26:15
So in the example of life sciences or even
26:17
in cyber, right, like, or in security,
26:20
you cannot get
26:22
things right on average. Those things don't
26:24
happen in those verticals, right? You have to get
26:26
it right every single time. Versus
26:29
in retail, for example, right? You're
26:31
less likely to give your address
26:35
to a person who says, hey, can I get your address right
26:37
at the checkout desk, versus,
26:39
you know, like, you're within a financial services
26:41
institution and they ask me what your address is and they
26:44
can send you statements. I guarantee you,
26:46
one has a higher likelihood of you giving the most accurate
26:48
information compared to the other. So
26:51
you get a ton of garbage in, right,
26:54
in some of these verticals, like, you know, for example, in retail. So
26:57
then you start saying, okay, well, you
27:00
know, you're going to have to formulate
27:02
a ton of rules to get it
27:04
right, and so you start to say,
27:07
okay, well, there's a lot of definition here. There
27:10
isn't a lot of criteria
27:13
that we put on the docket. We just need to
27:15
do a ton of iterations and go
27:17
through and get the answer right. So what I've seen in industries
27:19
like that is you don't have a ton of rules. You
27:22
have, you know, one developer, one solution
27:24
that goes deep, right? They may do the business
27:26
analysis reporting. They may do the data engineering.
27:29
They may also help the business
27:31
in doing the governance itself by
27:33
flagging elements that are out of sync
27:36
or out of place. So there's
27:38
a very long-winded way to say to buyers, like,
27:41
depending on the industry vertical that you're in and
27:44
the place that data has in
27:46
the relevance of the decision-making process
27:50
and the kind of inputs that they get, you know,
27:52
high quality versus not, the cost
27:54
of getting it wrong versus being directionally correct, all
27:57
define the number of rules that are being put
27:59
in place. It's almost like a spectrum, right?
28:02
So that's kind of one way to take
28:04
a look at it. And so
28:06
now, what is the commonality
28:08
that you see regardless of the industry that you're in?
28:11
I think it comes down to the artifacts. Regardless
28:13
of how many roles you have and
28:16
which industry that you're in, a
28:18
technique that I've seen work well, at
28:21
least from my perspective, when
28:23
you formulate these kinds of strategies is
28:25
a set of interviews with stakeholders. These
28:27
are people typically who are making
28:29
decisions using the analytics that you're providing. Taking
28:33
those set of use cases, bowling
28:36
that up into a set of capabilities
28:40
that need to be invested in,
28:43
dissecting that building programs which
28:46
have projects within it that then get
28:48
executed on, that
28:51
then kind of tie into metrics that
28:53
say, this is why we did what we did and
28:55
this is the value that we're going to get. I know
28:57
by the way, through that process,
28:59
this is the data sets that we're governing and this is
29:01
how we're governing it without maybe using
29:04
those words is what I've seen work
29:06
well. And it also disarms
29:09
organizations because many times when you
29:11
go in and you say, hey, we're gonna stand up, like
29:14
the outcome of a data product
29:16
strategies is a team that we need
29:18
to build up of like 50 people.
29:20
In an organization that doesn't have data
29:22
within the decision-making nomenclature,
29:26
right? That's gonna be a tough one
29:28
to stomach. But even in one that is
29:31
driven in that way, 50 is
29:34
a large number, they're gonna buck anyways and
29:36
they're gonna say, well, we're getting efficiency with
29:38
one person, why would we need to do it differently? So
29:40
I think just focusing on the artifacts and
29:42
really thinking about how do you take these
29:45
use cases, hydrate that up
29:48
into a set of portfolio and programs
29:51
that you can execute on draft trial with metrics
29:53
and governance is the way to
29:55
go regardless of who does it.
29:57
The other interesting element of data.
30:00
product is the audience
30:02
where because data
30:04
has so many different potential stakeholders and
30:06
consumers that will drastically influence
30:09
the overall user experience
30:12
they're trying to optimize for because the
30:15
core element of something being a product
30:18
is that it is consumable out of
30:20
the box versus just here's
30:22
some data good luck you know you can pull
30:24
it from this s3 bucket if you want you
30:27
know as a product you know you go
30:29
to like a Netflix that's a product you
30:31
go to Amazon that's a product for e-commerce
30:34
if you are a data consumer and
30:36
you know if I give you a CSV
30:38
file and I'm an average person
30:41
who is just trying to answer a question what
30:43
is the CSV file gonna do for me but if I
30:45
have a search box where I can type a question
30:47
and then you're using that underlying data to give an answer
30:50
and that's a better experience whereas
30:52
if I'm a data engineer and you give me a CSV
30:54
and a little bit of documentation what to do with it
30:57
and so I'm curious what
30:59
are some of the useful questions that teams need to be
31:01
asking in the development of that product
31:03
strategy that will inform the
31:06
implementation details and the types of technologies
31:08
that they need to bring to bear on the solution
31:10
yeah thank you thank you for kind of highlighting
31:13
the importance that interfaces play
31:16
in the role of a data product so I think
31:18
one of the things that you kind of mentioned and
31:20
the examples you gave about Netflix and Amazon
31:22
everything else is you know let's just take
31:25
the example of maybe Amazon right you come in you search
31:27
for a product and you buy it right but
31:29
why did you search for that product you had a need you
31:32
know let's say you're buying household you know
31:34
you're buying a household cleaner right
31:37
you look going in there and you're trying to search for something
31:39
because you want to clean your house right
31:42
and you want to do it in a self-sufficient way
31:44
that you know you want to buy a product but then
31:46
the floor says you drive okay well there's different
31:48
choices that you have you know but the
31:50
point I'm trying to make is that
31:53
whole concept of Amazon and search
31:56
is in the context of a need that
31:58
the customer has
31:59
is
32:00
in the life cycle that they're
32:03
in that this fits into.
32:06
And so what
32:08
does that mean in the context of data products?
32:12
As we start collecting these use cases, a big thing
32:15
that we do and we emphasize on is how
32:17
are they going to be used and how
32:19
often are they going to be used and in what context
32:21
are they going to be used. So
32:23
for example, if someone says, I need this information,
32:27
let's say that they produce a propensity score
32:29
for the person's ability
32:31
to either default or not on the loan. I
32:34
need it within five seconds. I get five seconds, it
32:36
needs to be refreshed. My question always
32:39
is, let's say that I give it to you in four seconds.
32:42
What are you going to do with the one second that you save? Let's
32:45
say I'm going to do that. What are you going to do the next five seconds
32:47
before the data comes in? And at least a very
32:49
interesting conversation because what effectively
32:52
you're trying to unravel is
32:54
what's next? What do you do next? In
32:56
the context of the Amazon example that you gave, I
33:00
take it and then that
33:02
spray comes home and then I clean my
33:05
table with it. Well,
33:08
that's good. What do you do?
33:09
Well, yeah, and then I store it. Well,
33:12
in the context of data products, in the
33:14
context of the example that I gave you, well,
33:17
I take that output that
33:19
you provide and then I make a decision of
33:21
it. What do you do with that decision? Well,
33:24
basically, in the flow of the
33:26
application, the loan origination application,
33:28
the customer is going to be able to see if
33:31
they got a yes or no in terms
33:33
of the loan that they were asking for because I take
33:35
this variable and I weight that
33:37
by 70%
33:38
because I heavily weight
33:41
this to say, if this is a yes
33:43
or no,
33:44
it kind of determines if they get the loan or not.
33:46
Oh, wow. OK, that's interesting. So
33:48
now you start walking backwards from there and you start
33:50
saying, OK, well, a CSV
33:53
file probably won't scale for that. How
33:55
are you going to do a reach in for this? Well,
33:57
hey, we typically, like within
33:59
the application.
33:59
that we have.
34:01
We use RESTful interfaces for doing everything
34:03
that we go through and drive
34:05
out. Okay. All right. So now you start
34:07
saying, okay, well, now we need to start using APIs. Okay.
34:10
They need to be discoverable. Well, how, like,
34:13
what kind of validation do you do on this to make
34:15
sure that it isn't something that's so wild?
34:17
What happens as a result of that? Okay. Well, then
34:19
as you start having these conversations with
34:22
your customer in the way that they are going to be
34:24
using that analytics, you start formulating
34:27
the interfaces that they're going to be using their channels
34:29
that they're beginning to use to, to, to soup
34:31
up the intelligence that you're providing, whether
34:33
it's core data or insights
34:36
or information, knowledge, you name it.
34:39
You know, that's what it is. Right. And so
34:41
that starts to formulate the way that,
34:43
you know, you start providing these interfaces
34:46
and
34:46
the same data set or
34:48
information asset or data
34:50
asset, these different types of data products could
34:53
have multiple channels. Right. For example,
34:55
one of the things here could be that, you
34:58
know, in the context of the persona that
35:00
you gave, right, of a data engineer, they could
35:02
be wanting that data set
35:04
through an S3 interface. Yeah. Like something that they
35:07
can consume and batch and then do some reconciliation
35:09
on. So the way that the consumer
35:11
utilizes it in the context of the decision-making
35:14
will dictate the interfaces. And
35:16
those interfaces is what we build that
35:18
then says, Hey, here's the product
35:20
that we're building. Here's how we, here's
35:23
how we deliver it to you so that you
35:25
can consume it. What are your consumption patterns?
35:27
And you got to keep that front and center as you walk in and through,
35:30
because the last mile optimization
35:32
on that is driven off of those items. And
35:34
then there's some interesting nuances as well. Right.
35:37
In the last mile consumption piece, you're less worried
35:39
about duplication. You're less worried about,
35:41
you know, Oh my God, am I copying this data or
35:43
am I copying this in 14 different ways? You're
35:45
more worried about is, is the interface
35:48
optimal for the
35:50
consumption, right? Versus optimal
35:52
for storage and distribution.
35:57
As more people start using AI for projects,
35:59
two things are important. clear. It's a rapidly
36:01
advancing field, and it's tough to navigate.
36:04
How can you get the best results for your use case?
36:06
Instead of being subjected to a bunch
36:09
of buzzword bingo, hear directly from
36:11
pioneers in the developer and data science
36:13
space on how they use GraphTech to build
36:15
AI-powered apps. Attend
36:17
the Dev and ML Talks at Nodes 2023, a free online conference
36:22
on October 26th featuring some of
36:24
the brightest minds in tech. Check
36:26
out the agenda and register today at neo4j.com
36:30
slash nodes. That's N-E-O,
36:33
the number four, j.com
36:35
slash N-O-D-E-S. I'm
36:40
curious how technical debt
36:42
factors into the overall process
36:45
of the development
36:47
and consideration around what
36:49
the strategy is and how to approach
36:52
implementation, both in terms of,
36:54
I already have this existing technical debt, and
36:56
so that constrains the available
36:59
set of capabilities that I have, or it will extend
37:01
the delivery timeline. But
37:03
also, this is the strategy that
37:05
I want to implement. This is the timeline I'm committing
37:07
to. So now I need to consciously take on
37:10
this additional technical debt. I'm just curious
37:12
how that plays out in the overall process.
37:14
It's a good question. And I say that
37:16
only because we all accumulate
37:18
technical debt, and I haven't quite seen, to
37:21
the extent that I would like to, out in
37:23
the wild, including
37:26
when I used to be on the
37:28
other side of the fence, like leading data teams in
37:30
corporations, both in tech and non-tech, do
37:33
it well. And so here's
37:35
one of the ways that I've seen get
37:40
close to doing it well. It's really
37:42
negotiating a percentage of your
37:44
execution backlog to be dedicated
37:47
to technical debt that the
37:50
engineering team has accountability
37:52
for, in terms of prioritizing so
37:54
that the overall cost of delivery comes down.
37:57
So what does that mean? In the backlog, you could have
37:59
now... new features, bug fixes,
38:02
and technical debt all be
38:05
commingled. And so what we've seen work
38:07
well is you take about 30% of that backlog
38:10
and you say, hey, we're going to dedicate
38:12
this to technical debt and we're going to give the accountability
38:14
to the data engineering managers or the software
38:16
engineering managers to go through and drive it. They
38:19
prioritize it, they put it on there so that you go
38:21
through and see it, you go through and move it accordingly. The
38:23
product manager should be able to see if they're doing
38:25
a good job of it or not by tracking
38:28
the overall cost of production. And
38:30
operating and maintenance costs associated with
38:33
the product itself. So
38:35
the lower amount of tech debt that you have, I
38:38
think you can see it in a couple of different ways.
38:40
One is your O&M costs are going to go lower.
38:43
And typically, O&M costs, operating and maintenance
38:46
costs are roughly in the 50% mark. So if you can bring
38:48
that down by 20 and kind
38:50
of bring it even into 30 or if you're super
38:53
optimal into the 15% range, that's
38:55
a good indicator that you're resolving
38:57
your tech debt as much as you can. The
39:00
other nascent thing to look at
39:02
is attrition. When you have
39:04
a really poorly built product
39:06
and it's going to be tough for
39:09
you to maintain people on the operating
39:11
and maintenance side of the equation to go through and drive that
39:13
out. So that's one other
39:15
side of the equation. The other thing that I've seen
39:17
work really well is in terms of
39:19
tech tech, because when people go through
39:22
and provide
39:24
these strategies or
39:26
even these patterns and drive
39:29
them out, at that point in time,
39:31
they were probably the best
39:34
and the greatest. But over time,
39:36
like anything else, everything deteriorates. Technology
39:39
is moving at a faster clip rate. So
39:41
having a dedicated time during
39:44
your execution mechanism, like one
39:46
sprint out of seven in
39:48
a classic PI type setting with agile,
39:51
sorry, safe agile, could be one
39:53
where you kind of take a step back and you allow the practitioners
39:56
on the floor to drive and step forward
39:58
who are actually the ones that are closest. to
40:00
the pain to say, hey, there are
40:03
these new ways of building things. Can we go through
40:05
and try and implement them and see where they
40:07
go? And that kind of raises
40:10
the bar in terms of making sure
40:12
that not just tech debt stays in check,
40:14
but then you're innovating.
40:16
In some cases, what I've seen is teams
40:19
completely stop delivery of net new
40:21
features. And saying that, you
40:23
know what, the way that we're gonna resolve this is we're gonna
40:25
take care of all the bugs, right? So we're gonna have something
40:27
called a bug bash, and then take
40:29
that completely down,
40:31
right? Maybe they do like a month long
40:33
worth effort there.
40:35
And then they go through and throttle
40:38
their backlog, so to speak, so to
40:40
make sure that they can get back in line. So these are different
40:42
ways that I've seen teams go through
40:44
and manage this concept of tech debt. And
40:46
then the last thing that I'll mention is, the concept
40:48
that I talked about,
40:50
use cases being hydrated up into
40:52
a set of patterns and these patterns kind
40:54
of going into capabilities.
40:56
It's really important to kind of go through and score
40:59
those capabilities on a yearly basis to say, how
41:01
well are we doing, right? And sanitize
41:03
that and say, and that's another way to measure architecture
41:05
as well. And that I
41:08
have yet to see teams do a good job in because
41:10
they just don't think of
41:12
architecture and scoring the architecture in that way.
41:14
You know, someone writes a blueprint, you know, it's super
41:17
high level, somebody goes through and implements it, and
41:19
we never score those capabilities. Like for example,
41:22
how is our data introduction capabilities? Is it a nine
41:24
out of 10?
41:25
Why is it a nine out of 10?
41:26
Well, guess what folks, we can't ingest TSVs,
41:28
okay. How important is it?
41:31
Do we have any use cases that go through that? Well,
41:33
yeah, we do. You know, we have five
41:35
out of 20 use cases that are doing that, okay.
41:38
Well, how much time are we spending as a result of that? Well,
41:40
our sprint points are X, you know, for
41:43
these kinds of things. That kind of telemetry
41:45
walking that backwards and then saying, hey, this
41:47
is how we score architecture. I haven't
41:49
seen that as much in the wild,
41:52
if any. But I think that's another
41:54
way to kind of score architecture based on the capabilities
41:56
that you've driven and to make sure that these
41:59
tech debt items.
41:59
kind of get brought to the surface.
42:02
And circling back on the interface
42:05
of the product, there's also the question
42:07
of customer education of how
42:10
much context and how
42:12
much familiarity do they need to have
42:14
of the data, of the statistical
42:17
aspects of that data in order to be able
42:19
to use it to effectively
42:22
make decisions or is the
42:24
understanding that they're reaching actually accurate
42:26
based on their background? And
42:29
I'm wondering how you've seen teams try
42:31
to approach that element of delivering
42:34
the data product, delivering the
42:36
guardrails or surrounding
42:39
capabilities so that the
42:41
end user is able to actually
42:43
effectively make use of that product
42:45
without having to have somebody sitting beside
42:47
them saying, okay, this is what you need to know. These
42:50
are the steps to actually use this thing. These are
42:52
the other things that you need to do after the fact,
42:53
etc.
42:54
Good question. I'll start off with a story,
42:56
right? I think all of us will be
42:58
very familiar with this one. That number is
43:00
incorrect. And they're like, why is that
43:03
number incorrect? Because the person did the
43:05
roll up in the wrong way. Okay,
43:07
well, it was obvious that the column
43:09
was there, so I ended up rolling it up. Well, what you didn't
43:12
do is you didn't apply a filter because
43:14
it's not a column actually, you have to apply
43:16
a filter for this column and then do an aggregation
43:19
and then you'll get the right number because effectively what you've
43:21
done right now is you've made it 10x the
43:23
number that it is. And so these
43:26
kinds of stories, right, I've genericized
43:28
it, but these kinds of stories are pervasive, like all
43:30
of us have heard it, right?
43:32
And so if you think about it and say
43:34
it's saying, well, how did that come to fruition?
43:37
People think just because you have the data, you can just
43:39
kind of give it out and not knowing
43:42
the persona group that the person belongs to
43:44
and how the consumption experience has
43:47
been defined for that persona.
43:50
You'll often hear people say, hey, just give me access to the
43:52
data, I'll figure it out, you know? And oftentimes
43:54
you end up with stories like this. So I've
43:57
seen well and done well and kind
43:59
of something that...
43:59
we practice and both preach
44:02
is that the interface that sits on top
44:04
of the data
44:05
needs to walk backwards from the set of questions
44:07
that we're trying to answer.
44:09
What are the kind of roll-ups that we're trying to do?
44:11
What is it that we need
44:13
to do in order to make sure that we put a definition
44:15
around the roll-ups so that it's
44:18
relevant? What are the filter conditions
44:20
that are relevant for those roll-ups
44:22
versus not? And in
44:25
this particular instance, I'm talking strictly about
44:27
dashboards so that you have
44:30
those items outlined so that when people
44:32
come through and consume this, the
44:35
number of toggles or
44:36
inputs that you can use that
44:38
you can get an outcome with is limited
44:41
so that you can go through and drive that out. And
44:44
so that level of metering is super, super important.
44:47
Now on the other aspect of educating
44:50
the user about the data and
44:52
what it means, what I've
44:54
seen specifically
44:56
in the modeling arena is boundary
44:59
conditions and self-trotting even
45:01
before you get the results out,
45:03
right? To say, hey, this kind of breaches are
45:05
out of bound conditions
45:07
and therefore this needs a second set of review.
45:10
What I have seen, or the worst,
45:12
is a ton of very
45:15
detailed documents spanning
45:17
multiple pages that exactly explains
45:19
what that is or in fact
45:21
even a user session, you know, that every time
45:24
you get on board, I sit with you and I walk you through what that means.
45:26
That's another thing that I don't see used
45:28
very well. So our preference
45:30
and what we typically like to do is a set
45:33
of tests that are run to make sure that
45:35
the data that you're actually consuming is
45:37
accurate and of high quality and of integrity.
45:39
And then on the consumption side, really limiting
45:41
the inputs to the outputs, right? Like, you
45:44
know, like if there's a country where they don't use zip
45:46
code or they use
45:48
another form of zip code, then don't show that
45:50
option, you know? Just limiting it considerably
45:53
and then lining that up to the questions that you're asking.
45:56
And in your experience of
45:58
working in this space, of helping
46:01
data teams understand what is the
46:03
customer experience that they're trying to
46:05
satisfy, how can they actually go about
46:08
delivering those capabilities? What
46:10
are some of the most interesting or innovative or unexpected
46:12
ways that you've seen teams either go
46:15
through the process of developing and
46:17
executing a given strategy or some
46:19
of the most interesting formulations
46:22
of that strategy that you've seen?
46:24
Yeah, I think when we think about
46:26
customer experience, let's just
46:28
kind of ground ourselves a little bit on the definition of
46:31
how we bring that to life.
46:33
The inputs to customer experience is really
46:35
kind of taking a look at your business and saying, these
46:37
are the different touch points that
46:40
our customers produce as
46:42
they interact with our digital
46:44
as well as our analog real estate. And
46:47
so right there, you can take the analog
46:49
real estate out, you know, and
46:51
you pretty much have the digital real estate and you said, okay,
46:53
well, these are the different interaction
46:56
points that we have. All right, so now
46:58
that we have that, we use those as
47:00
the input to then drive
47:02
decisions and that then the
47:05
customer experiences. And
47:09
that whole process could be, how do we optimize
47:11
the loan registration process for the
47:14
lowest number of clicks, right, to get to a decision? It
47:16
could be, you know, how do we make sure that,
47:18
you know, Tobias gets the
47:20
most relevant content that gets presented on screen
47:22
so that he quickly makes a decision on
47:25
buying a product that is relevant to
47:27
their need, right? So what
47:29
I'm trying to get at is the way that I've
47:31
seen teams do a really, really good
47:33
job of that is asking
47:35
the question as to what is the core metric that we need to hedge
47:38
on that clearly defines is the customer
47:40
experience optimal or not? Is
47:42
it the number of clicks?
47:44
Is the time taken per page? Is it
47:46
the number of items that he's left on
47:48
a basket? What is that? Data teams and
47:51
I haven't seen many data teams do it, but I've seen a lot
47:53
of business intelligence teams do it, which is they
47:55
really, really anchor and they ask the question as to what is the
47:57
metric that we need to be really optimizing for.
47:59
and getting that formulated, getting that listed
48:02
out accurately and done well, right? The
48:04
next thing from there that I have
48:06
seen data teams do well
48:09
is take that and think about all
48:12
of the data elements that come through
48:14
and formulate that answer and start
48:16
putting in early signs
48:18
of failure, right? So for example,
48:21
in order to determine the number of clicks, we get that from
48:23
five different systems. And we know that this
48:25
one system, when we get it from
48:27
that one system, we have to
48:29
make sure that the integrity and the quality is extremely
48:31
high. Okay. But we produce
48:33
this on a weekly basis. Should we flag
48:36
this at the end of the week? Or can we
48:38
flag this as and when the
48:40
data is coming in to say this is out
48:42
of bounds and this doesn't make any sense. There's
48:44
a new ordinal value that we need to flag. Oh,
48:47
these two systems are no longer in sync because our joint
48:49
structures are gonna be off. And oh, by the way,
48:51
now this is gonna lead to a massive skew. So
48:54
to summarize where I've seen data
48:56
teams do really, really well is build those capabilities
48:58
around observability and monitoring.
49:01
And for me, there are two distinct things. Monitoring is the
49:03
things that you actually know that you can monitor for. And
49:05
then observability is everything else that you see
49:08
coming through the panacea that you're
49:10
able to kind of decipher and kind of understand.
49:12
And then using
49:13
machine learning almost to help you understand
49:16
the patterns and behaviors, the slow drift that's
49:18
going on. And relying less on the operational
49:21
systems to tell you where the problems are. Because
49:24
the operational systems, if they have issues
49:26
going on, they can easily flag it. But otherwise, they
49:28
kind of go through and drive whatever
49:30
it is they need to do. And they can kind
49:33
of go through and keep producing the results, right? So
49:35
having a lot of that infrastructure built on the data engineering
49:37
side to drive that out is where I've seen
49:39
data engineering teams innovate in
49:41
Excel, right? Because the
49:44
alternate is, oh, why
49:46
don't we see a lot of data teams innovate on the
49:48
KPI side or pushing the business to think
49:50
more about that? They don't, it's almost
49:52
like having the spark plugs to find the car.
49:55
It doesn't work that way. So I think it's an
49:57
unfair expectation to
49:59
have.
49:59
that of data teams, what I think they do really,
50:02
really well is optimizing on the infrastructure pieces
50:04
that I mentioned.
50:05
And in your experience of working
50:07
in this space, what are the most interesting or
50:09
unexpected or challenging lessons that you've learned
50:11
in the process?
50:13
Always question the
50:15
core set of assumptions coming in. Also,
50:18
you know, people will hand
50:20
you over code. I mean, oftentimes
50:23
what really happens is, you know, you're trying to build an analytics product.
50:26
And, you know, like you're trying to go through
50:28
and walk all the way back to the source system.
50:30
You're trying to analyze the data. And
50:33
you've got people telling you
50:35
how the data is manifested
50:38
in these systems. And
50:41
they will talk about it. They give
50:43
you these diagrams and all those different things. I
50:45
think taking a synthetic transaction all the way from
50:47
the left to the right in terms of, hey, here's how the
50:49
data originates. This is how it gets manifested
50:52
in these systems. These are all the assumptions that we're
50:54
making. These are the edge cases. Documenting
50:57
all those items and seeing it and living through
50:59
it, I think is not just key, but
51:01
it's paramount. Because one of the things that always
51:03
shocks me is you kind of come in and then, you know,
51:06
people will say in the operational side, right, they will
51:08
say, let's just take the example
51:10
of a trucking company. They'll say, hey, whenever our trucks
51:12
leave late, our drivers always enter the information.
51:15
And it's a part of our SOP, but we don't see that in the system.
51:18
And so why don't you see that in the system? Well, the thing is they
51:21
tried entering it in this field before.
51:23
It didn't quite work for them.
51:25
So they started using the comment field afterwards.
51:28
So yes, they are doing it, right? So
51:30
the SOP is still active and relevant. However,
51:34
and that data is in the system.
51:36
It's just not where they said that it would be. One
51:38
of the good mitigation strategies that I've discovered
51:40
for this is to go out and
51:43
see and take a walk, you know, with
51:45
the actual executioners of the process
51:48
and see what that means. And that's another piece
51:50
that I also kind of bring to the top of the surface
51:53
is business process and understanding
51:55
business process and walking
51:57
that into where the data is
51:59
manifested. and what operational system and
52:01
how it's manifested. That top to bottom kind
52:04
of viewpoint is important so that
52:06
you can tease these kinds of things
52:08
out.
52:09
And for teams who are starting
52:11
down the path of trying to incorporate
52:14
these strategic processes into
52:17
their delivery workflow, what are
52:19
the cases where going through the whole
52:21
process of building a data product
52:24
strategy, using that as the means to
52:26
identify and prioritize
52:29
work to be done is overkill and
52:31
you just need to focus on the technical
52:33
aspects and that that is actually
52:35
the core capability that you need to deliver.
52:38
I think when you're fairly small, when
52:42
I said fairly small, like you've got a team of, let's
52:44
say, a team of five people
52:46
and then you kind of provide analytics to the organization
52:50
and you formulate
52:52
and work through it in a solution by solution basis
52:54
and that's all that you have. You
52:57
can still start thinking about data and
52:59
the concept of a product and defining a strategy
53:01
but your throughput or the ammunition
53:04
that you bring to the table is gonna be far less. So
53:06
you're gonna accumulate a ton of technical debt as you go through
53:08
it. And honestly, in the beginning, it's
53:11
gonna be par for the course, right? So
53:12
in that case, the team may
53:14
not think that it's overkill
53:17
but your stakeholders may because the initial
53:19
cost of you building a data ingestion
53:22
pattern-based framework that will automatically
53:24
auto ingest data, man, the cost of that
53:26
initiative for a single use case will be extremely high.
53:29
So my suggestion is for places where
53:31
you don't have a lot of executive leadership
53:33
support, i.e. those leaders haven't come
53:36
from a very strong data background and
53:38
they can't see the need
53:40
for it but need to see hard numbers in
53:42
the context of a single use case that's very,
53:45
very myopic, this will be overkill 100%. So
53:48
then the question is, well, how do you, is
53:50
it still not right for the organization and what should we do
53:52
about it? And so I think this is where making
53:55
sure that as you work through the
53:57
use cases, you carve out a certain
53:59
set. of your backlog and
54:02
use that in a very nuanced
54:04
way to start building some shared
54:06
capabilities, right? And so this
54:09
is kind of the point that I had made earlier about the fact
54:11
that your acceleration is gonna be less,
54:13
which means you're gonna travel, you're
54:16
not gonna travel as fast as you normally would. I
54:18
think those are apart for the course, but that's
54:20
kind of what I would do in cases like
54:22
that. And those are the places where this
54:25
would be overkill, in areas
54:27
where you've got executive
54:29
support, you've got a set of people around
54:31
you who actually have seen the need
54:34
for building data products
54:36
at scale, and you have multiple teams
54:38
that are all producing data products
54:40
of different variety. There
54:43
may be a big aspiration to
54:46
provide some of these central capability
54:48
source to lower the overall cost of production. Building
54:51
the use case for that, showcasing what the ROI
54:53
looks like, and doing something that
54:56
product managers do day in and day out, right? In
54:58
organizations like that, right? Where
55:01
you have 50 people all producing products,
55:04
right? Or solutions, so to speak, right? And
55:06
I go through and get serviced by consumers.
55:09
You could start seeing these kinds of concepts
55:12
accepted more so than not. Just
55:14
to summarize, I think it's relevant in
55:16
either set of organizations,
55:18
but it's more pertinent
55:21
than investments. It's a lot easier to make, where
55:23
you have a lot of people just working
55:25
through providing data solutions,
55:27
and you kind of take a look at it, and you said, hey, didn't we just produce
55:29
that data set like last week? Yeah, that had
55:31
four columns, but this has five. So why is that other
55:34
team doing it? Why don't we just kind of take this data set and
55:36
make it into an asset, and then put that
55:38
on there? And oh, by the way, why don't we put privacy
55:40
treatments on it as well? Because that other team
55:42
did that too. How do we mix it? Oh, you're
55:44
spinning up, you know, you're spinning up like
55:47
an S3 bucket in this way, right? Why
55:50
don't we use Terraform to go through and do that? Oh,
55:52
well, you know, our centers are different, or our naming
55:54
conventions are different. And so
55:56
I think these kinds of problems come at scale,
55:59
right? Because not to bias.
55:59
can't move from team
56:00
A to team B because even though
56:02
they use the same cloud provider, the way that
56:05
they do business is different. And
56:07
so the op model is different. So these
56:09
are problems at scale versus, you know,
56:11
in smaller sizes, smaller
56:13
teams, it's less forgiving because,
56:16
you know, like the telephone problem is
56:18
not that high. And for teams and
56:20
individuals who are trying
56:22
to upskill into this space
56:25
of managing data product strategy
56:27
or understanding how best to integrate it into
56:29
their work. What are
56:30
some of the resources that you have
56:32
found useful and that you recommend people dig
56:35
into to be able to understand
56:37
more of the tactical elements of
56:39
how to bring data product strategy
56:42
into the work that they're doing for delivering data
56:44
to their various end consumers?
56:46
Really honing in and understanding software development
56:49
practices and what they mean, I think
56:51
is a good space to start off in. So
56:53
this involves everything from what
56:55
does CINCD mean, you know, what
56:58
does what does really, you know,
57:00
building services look like, what does
57:03
contracts mean in this space, like,
57:05
you know, like, you know, API contracts, what does discoverable
57:07
services look like? And this is very, very software
57:10
engineering oriented. And then and then that's where
57:12
that's where I assume there's got to be a little bit of learning,
57:14
right, kind of coming to the table. The other
57:16
part that I think data engineering teams
57:19
and practitioners currently providing
57:21
data and analytic solutions will bring to the table
57:24
by themselves is the inherent nature where
57:26
data is different. The data assets being
57:28
produced, information assets being produced, they're
57:31
different than just core services, right? So how do
57:33
you think about the app model there? What does
57:35
that look like? And how do you take these these
57:38
concepts and build them into
57:40
this? So for us, for
57:42
example, when we produce data pipeline, do we have a baseline
57:45
data set that we can test against every single
57:47
time? Right? How do we measure drift? What
57:49
does that mean? Like, you know, should we build leaderboards
57:51
or not?
57:52
And then using that kind
57:54
of set of introspective Q&A to
57:56
start building out capabilities to say, okay,
57:59
well, this is what it means.
57:59
is what it looks like and start leveraging
58:02
and deep diving on those items. That's
58:04
what I would suggest now. Tactically, there
58:06
are a lot of thinkers in this space, right? Who
58:09
have all kind of provided their own
58:11
perspective on what it means. I mean, I thought Works
58:13
with a company I think has done a lot of time in the space
58:16
of data products. Sanjeev Mohan,
58:18
you know, has done a lot of thinking
58:20
on the data product space. You've
58:22
got data contracts with
58:25
Chad, you know, so on and so on. So I
58:27
think they're staying close to
58:30
all of these different vectors coming up is a
58:32
big one as well. What I found exceptionally
58:35
helpful is staying close to all the Slack channels, you
58:37
know, where different people are like really
58:39
ideating and thinking about what this means. And
58:43
our space is constantly evolving as well, right?
58:45
So if you think about metric stores, if you think
58:47
about, you know, the concept of obviously
58:50
the data mesh and Faber-Brock have kind
58:52
of come to fruition and different people are
58:54
working on different things in that arena. But
58:56
if you think about data observability, if you
58:58
think about data contracts, like so these are all kind of
59:01
relatively new concepts coming up, right?
59:03
Like, you know, over the past three years. So they've started
59:05
to take shape and they started to take hold in
59:07
and thinking about how this impacts
59:10
our space is going to be the biggest one. And for us, what that
59:12
means is there's a ton of change, right? And
59:14
so when you are in these Slack channels, whether
59:16
it's for data quality, whether it's
59:19
for data observability, you know, provide
59:22
a big eye or any of the other companies, you tend
59:24
to start hearing people talk about these
59:26
interdisciplinary concepts and bringing them together.
59:29
And then obviously, you know, the shameless plug
59:31
for your own podcast device. I mean, like,
59:34
I think, you know, if you're a data engineer and you're not kind of listening
59:36
to some of these things, you're probably missing the
59:38
beat on the trends going on and then kind of incorporating
59:40
that back into your own set
59:42
of practices. Right. So tactically, those are the places that
59:44
I would look for.
59:46
Are there any other aspects of this
59:48
space of data product strategy,
59:50
how to think about it from a technical
59:53
perspective, how to incorporate it into
59:55
your overall work processes that we didn't
59:57
discuss yet that you'd like to cover before we close out the show?
59:59
I think we did
1:00:02
touch on it, but let me double click
1:00:04
on it further. I think this concept of
1:00:06
metrics and really gauging to see if your
1:00:08
strategy is headed in the direction that
1:00:10
it needs to head is core. When
1:00:13
we start thinking about a data product strategy,
1:00:16
the question that we need to ask ourselves is what
1:00:18
are we going to get as a result of that? Is it going to be
1:00:21
lowering the cost of producing products?
1:00:23
Is it going to be increasing throughput on
1:00:25
capabilities that we already have? Depending
1:00:28
on that and really understanding
1:00:29
why and what that means is
1:00:32
going to be key and core.
1:00:34
Also understanding if you're doing this for defense
1:00:37
or offense purposes. If you're
1:00:39
doing this to optimize a cost, you're trying to increase
1:00:41
top line. Answering those questions initially
1:00:44
and grounding yourself in why
1:00:46
you're doing what you're doing is going to be super important.
1:00:48
Otherwise,
1:00:49
this will be just like another flavor of the day.
1:00:51
You will be producing solutions and nothing more and
1:00:54
probably at twice the cost and for
1:00:57
one half the value.
1:00:58
All right. Well, for anybody who wants to get
1:01:00
in touch with you and follow along with the work that you're
1:01:02
doing, I'll have you add your preferred contact information
1:01:04
to the show notes. As the final question,
1:01:07
I'd like to get your perspective on what you see as
1:01:09
being the biggest gap in the tooling or technology
1:01:11
that's available for data management today.
1:01:13
I think the biggest, well, firstly, I
1:01:15
think there's still the biggest problems that we have
1:01:18
is about comprehension of how we use things more
1:01:20
than the technologies. One aspect
1:01:23
that I see that I see we completely
1:01:25
lack on is this ability
1:01:27
to learn from the way that others are using
1:01:30
the tooling and the data within the
1:01:32
ecosystem that we have and then making
1:01:34
our systems more intelligent. One
1:01:37
of the things that we always
1:01:39
think about with respect to data management is
1:01:42
it's kind of like being a cartographer.
1:01:45
There are many cartographers all towards your organization
1:01:47
that are doing these queries more, merging
1:01:50
or culling through data and then formulating these side
1:01:52
roads. And oftentimes, whenever you start
1:01:55
looking at it, they're interpreting how this data
1:01:57
is being assimilated together and then creating
1:01:59
this map. of the organization. When
1:02:02
one person does it, how can another
1:02:04
person not take advantage of it? And when one person
1:02:06
does it, how do we have enough confidence that that
1:02:08
end road or that side road
1:02:10
can have the right level of throughput that way we
1:02:12
can actually go through and
1:02:15
use it for other purposes, right? And then how do we
1:02:17
kind of
1:02:17
auto-migrate that up? That whole building
1:02:20
an intelligent ecosystem, right? Where
1:02:22
you have data that helps you
1:02:24
derive the way to use new data, I
1:02:27
think it's completely lacking in this business. And I
1:02:29
don't know if we're doing as much in
1:02:31
that arena or not, right? So intelligent systems
1:02:34
and using AI for BI, I think
1:02:36
is a big one that I see us having
1:02:38
a gap in.
1:02:39
All right, well, thank you very much for
1:02:41
taking the time today to join me and
1:02:43
share the work that you are doing and
1:02:46
your experience of building
1:02:48
and executing on data product
1:02:50
strategies. It's definitely a very important
1:02:53
area, one that has been growing in
1:02:55
visibility and adoption. So I appreciate
1:02:58
the time you've taken to share that with us and
1:03:00
I hope you enjoy the rest of your day. Thanks, Tobias, appreciate
1:03:02
it. Thanks for listening. Don't forget to check out
1:03:04
our other shows, podcast.init, which covers the Python language, its community, and the
1:03:06
innovative ways it is being used, and the Machine Learning Podcast,
1:03:09
which helps you go from idea to production with machine learning.
1:03:11
Visit the site at dataengineeringpodcast.com to subscribe to the show, find
1:03:13
out the products, and then you can find out the products at
1:03:16
dataengineeringpodcast.com. And if you're interested in more information about the
1:03:18
machine learning podcast, and subscribe
1:03:20
to the show. And if you're interested in more information about the machine learning
1:03:22
podcast, visit the site at dataengineeringpodcast.com, subscribe to the show,
1:03:25
find out the mailing list, and read the show notes. And if you've learned something
1:03:27
or tried out a product from the show, then tell us about it. Email
1:03:29
host at dataengineeringpodcast.com with your story. And to
1:03:31
help other people find the show, please leave a review on Apple Podcasts.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More