Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Managing data and access to data is one
0:03
of the biggest challenges that a company can
0:05
face. It's common for data
0:07
to be siloed into independent sources that
0:09
are difficult to access in a unified
0:11
and integrated way. One
0:13
approach to solving this problem is to build
0:16
a layer on top of the heterogeneous data
0:18
sources. This layer can serve
0:20
as an interface for the data and provide
0:22
governance and access control. QUBE
0:24
is a semantic layer between the data
0:26
source and data applications. RTM
0:29
Kaidanov is the founder of QUBE and he
0:31
joins the show to talk about the approach
0:33
QUBE is taking. This episode
0:35
is hosted by Lee Acheson. Lee
0:38
Acheson is a software architect,
0:40
author, and thought leader on
0:42
cloud computing and application modernization.
0:45
His bestselling book, Architecting for Scale,
0:47
is an essential resource for technical
0:49
teams looking to maintain high availability
0:51
and manage risk in their cloud
0:54
environments. Lee is the
0:56
host of his podcast, Modern Digital
0:58
Business, produced for people looking to
1:00
build and grow their digital business.
1:02
Listen at mdb.fm. Follow
1:05
Lee at softwarearchitectureinsights.com and
1:07
see all his content
1:09
at leeacheson.com. Ardea,
1:24
welcome to Software Engineering Daily. Thank you.
1:26
Thank you for having me today. I'm
1:28
excited about today's conversation. Great.
1:30
Great. So let's make sure we're
1:32
all on the same page to get
1:35
started. So let's talk about first some
1:37
fundamental definitions and things like that. The
1:40
word data silo. So when I think of a
1:42
data silo, what I think
1:44
of primarily is like independent data
1:47
sources that contain interrelated
1:49
data, data that's meant to work
1:51
together. What do you think of
1:53
that definition? Is that a good definition or what would you
1:56
enhance that with? Yeah, I
1:58
think it's a good definition. So, you
2:01
know, like essentially the database
2:03
or specific warehouse of a
2:05
data storage, where like the
2:07
data is located and it becomes sort
2:09
of disconnected to other places and it
2:11
becomes a silo. That's what people usually
2:13
think about, you know, data silos. And
2:15
I think it's a good enough definition.
2:17
The only way I would enhance
2:19
it is that what if we think about
2:21
the data, metadata, data definition
2:23
silos. So that's an interesting problem. The
2:26
one we solve in a cube is
2:28
that maybe in your
2:31
Power BI or like a top-line
2:33
organization or in some Python
2:35
Django app, you don't actually hold the
2:37
data, but you have a lot of
2:39
like SQL scripts that they do analysis
2:41
of data. They try to calculate some
2:44
metrics and they become like a metrics
2:46
silos or data definition silos, right? Because
2:48
you kind of calculate something to show
2:50
data to, you know, like maybe a
2:52
customer's partner, so internally and people kind
2:54
of, you know, they got some idea
2:56
out of this data and maybe the
2:58
data was correct, but the definition was
3:00
kind of siloed. So that's something
3:02
an interesting kind of enhancement to the
3:04
idea of data silos. So it may
3:06
not only exact data, but more like
3:09
a data definition or metric definition silo.
3:11
Got it. Yeah. So it's not just the data,
3:13
it's how the data is used and how it's
3:15
defined and the meaning of the
3:17
data as well. That's actually a great extension to
3:19
that definition. So why
3:21
are data silos a
3:24
critical issue for data
3:26
modeling in general or data management,
3:29
data usage in general? Yeah,
3:32
I think, you know, it just, if
3:34
you zoom out, I think the whole
3:36
purpose of having a data is to
3:38
help business to drive decisions, right?
3:41
So we want to be a
3:43
data-driven as an organization, you know, people
3:46
on all levels like execs and
3:48
management and individual contributors, they all
3:50
wanted to make their decisions, day-to-day
3:53
decisions, operations, these data-driven. So they
3:55
need to have unaccess to data.
3:58
And what's happening is... is that
4:01
we try to create more and
4:03
more touch points for people with
4:05
data, but naturally by creating this,
4:07
we also creating silos as a
4:09
side effect because we move some
4:12
data closer to the marketing and
4:14
they started to use it, but then it
4:16
becomes silo. And by silo, I mean usually,
4:18
you know, like something disconnected as we just
4:21
defined, right? Whether the data itself or the
4:23
definition itself. And a problem
4:25
is that it becomes really
4:27
hard to keep them in sync. So
4:29
you kind of end up in a
4:31
situation where the organization kind of becomes
4:33
data driven. They do work with data,
4:35
they have access to the data, but
4:37
is this data correct? Does
4:39
it show the real number? Does it
4:42
stay in sync? If your company decided
4:44
to change specific metrics, maybe
4:46
five out of your seven silos, they
4:48
have been updated, but then the rest,
4:51
they haven't. And now like some of
4:53
your department is looking at the old
4:55
definition. So I think if I
4:57
try to generalize it and it's software engineering
4:59
podcast, so I would call that problem is
5:01
a repetition problem. And we know we have
5:03
a dry idea. Like we have like, do
5:06
not repeat yourself in software engineering. So essentially
5:08
what's happening is that we sort of repeat
5:10
data or repeat data definition in many ways,
5:12
and then we need to keep them in
5:14
sync. And as we are engineers,
5:17
we know that just bad, right? It's really,
5:19
really hard to repeat things and then keep
5:21
those things in sync. So
5:23
go into an example just for those
5:25
who aren't following. You've got
5:28
data that's collected in via
5:30
an engineering department, and then marketing
5:32
wants to make use of some of that data. So
5:35
they take part of that data, bring
5:37
it into one of their systems. Now the
5:39
data is duplicated into a different system. They
5:42
take that data, perhaps enhance it
5:44
with some other information that they
5:46
have and reformat
5:48
or restructure it or reanalyze in
5:50
a different way and make
5:52
different use of it for different purposes. Even
5:55
though the fundamental data has common
5:57
roots, the data itself now is
5:59
out of sync. because there's differences
6:01
and different interpretations. Not
6:03
only the data itself is a little different, but the
6:06
interpretation of the data is considerably different.
6:08
And that's the sort of out
6:11
of sync, non-dry, if you
6:13
will, problem that you're talking
6:15
about here. Is that correct? Yeah, exactly.
6:17
That's exactly it. So you've
6:19
created something that's called a somatic layer on top
6:21
of data silos. I think that's actually what you're
6:24
calling it as well too. So
6:27
tell me a little bit about, you
6:29
know, what does a semantic layer
6:31
look like on top of silo data?
6:34
Right, yeah. If we think
6:36
about semantic layer, I think many of us
6:39
worked with the different types of the semantic
6:41
layer every time we worked with the BI.
6:43
So essentially, the semantic layer by itself
6:46
has usually been a part of the
6:48
BI. In every BI, we
6:50
can draw and drop measures, dimensions
6:52
to build charts. So every time
6:54
you work with this high level,
6:57
business level metrics definitions, we
6:59
work with the semantic layer. And then
7:01
what semantic layer does, it takes these
7:03
definitions and translates them into
7:05
the SQL queries and it
7:07
knows the underlying database structure
7:09
and warehouse structure. So the
7:11
semantic layer is sort of
7:13
a bridge between the
7:15
metrics that business utilizes
7:18
and then the underlying data structure.
7:21
Now the problem is that because every BI
7:23
has a semantic layer. So just
7:26
to make sure everyone's on the same
7:28
page, by BI, you're talking about the
7:30
business intelligence that makes use of the
7:32
data, whether that's a marketing use or
7:34
some other use. Right, right, exactly. So
7:37
now if organization has a 10
7:39
business intelligence, so data
7:41
visualization, data consumption tools, now the
7:43
semantic layer would be in every
7:45
of this tool, which
7:48
creates a silos, it creates a data
7:50
definition silos. Now
7:52
that's a problem for all the
7:54
reasons we just talked about, it's not dry,
7:56
it's going to stay out of sync, the
7:59
solution. to that is universal
8:01
semantic layers. That's what we're
8:03
building at Kube here is idea
8:05
to take this kind of
8:07
the piece of the
8:09
layer of your stack where you have
8:12
a repetition specifically semantic layer and
8:14
put it into a single universal
8:16
place from what place you
8:18
can reuse this definition across all
8:21
the data visualization, data consumption
8:23
tools and business intelligence. So now
8:25
it kind of makes your system,
8:27
your data architecture dry at scale.
8:30
Okay, so you didn't do anything with
8:32
the data. The data may or may
8:34
not be dry, probably is at some
8:36
level, hopefully it is anyway, but
8:39
the definition itself, you've created
8:41
a dry understanding of
8:43
what the various pieces of data
8:45
mean and how to interpret them and created
8:48
one standard for how to use that data.
8:51
Yes, yes, exactly. So Kube becomes
8:53
a universal semantic layer, Kube becomes
8:55
an interface to your data that
8:57
holds all the definitions in one
8:59
place. So it knows about
9:01
the underlying data silos, it
9:04
knows the underlying data storages, it
9:06
knows about potential issues with data,
9:08
but all these issues, they do
9:10
not need to be exposed to
9:12
the consumers. They only communicate with
9:14
Kube and then Kube communicates with
9:16
all the consumers. So Kube becomes
9:18
an interface to the data consumers
9:20
as like a state, like
9:23
there is like a pattern right in software
9:25
engineering for state pattern. So it's essentially for
9:27
state to the data consumers of the data.
9:36
This episode of Software Engineering Daily
9:38
is brought to you by HookDeck.
9:40
Building event-driven applications just got significantly
9:42
easier with HookDeck. Your go-to event
9:45
gateway for managing webhooks and asynchronous
9:47
messaging between first and third party
9:49
APIs and services. With
9:51
HookDeck, you can receive, transform and
9:53
filter webhooks from third party services
9:55
and throttle the delivery to your
9:57
own infrastructure. You can securely send...
10:00
webhooks triggered from your own platform to
10:02
your customers' endpoints. Ingest events
10:04
at scale from IoT devices
10:06
or SDKs, and use HookDeck
10:08
as your asynchronous API infrastructure.
10:11
No matter your use case, HookDeck
10:13
is built to support your full
10:15
software development lifecycle. Use the HookDeck
10:17
CLI to receive events on your
10:20
local host. Automate dev, staging, and
10:22
prod environment creation using the HookDeck
10:24
API or Terraform provider. And
10:26
gain full visibility of all events
10:28
using the HookDeck logging and metrics
10:30
in the HookDeck dashboard. Start building
10:33
reliable and scalable event-driven applications
10:35
today. Visit hookdeck.com/SEDaily and sign
10:37
up to get a three-month
10:40
trial of the HookDeck team
10:42
plan for free. So
10:51
maybe it will help if we go into
10:53
a specific example. Let's assume
10:55
we have an e-commerce store. That's a
10:57
great example that everyone loves to use.
11:00
Let's use an e-commerce store. E-commerce
11:02
store is going to be collecting data for
11:04
multiple places or for multiple purposes. It's
11:06
going to be collecting data
11:09
about website hits and clicks
11:11
and all that sort of stuff. It's going to
11:13
be collecting data based on
11:15
advertisements and the effectiveness of
11:18
those advertisements and who's driving
11:20
what traffic to the site.
11:23
Then there's going to be information about data
11:25
from cart ads and cart
11:28
deletes and checkouts and
11:30
order data, essentially. That
11:33
ultimately turns into ordering data. And
11:35
then shipment information from a warehouse and that's going
11:37
to be a different set of data. And
11:40
there's 20 or 30 other pieces of
11:42
data that we haven't talked about. But that's basically
11:45
the different sources of the types of data we're
11:47
talking about. Using
11:49
that example, walk through what a semantic
11:51
layer might do and who might be
11:53
able to take advantage of that. Right. So
11:56
imagine you collect all this data into,
11:58
say, a site. into warehouse like snowflake.
12:01
In our example, we can keep it
12:03
simple just to have one single warehouse.
12:05
So you collect all your data into
12:08
that warehouse, maybe use ETL
12:10
tool like a five trend to like
12:13
ETL, some Stripe data,
12:15
ETL, your Shopify orders, and
12:17
then you can enhance that data
12:20
maybe with some analytics coming from
12:22
your websites through segment. So
12:24
by the end of the day, all the data arrives in
12:26
a snowflake. Now in your
12:28
organization, you have Tableau, you have
12:31
Power BI, you have Excel, and
12:33
then you also need to display
12:35
data to the customers through the
12:37
dashboards. Now you start
12:39
building different metrics, say you can
12:42
want to build an average order
12:44
value. So you define that in
12:47
Tableau workbook, you define that
12:49
in a Power BI, and then you
12:51
defined it in a some SQL script
12:53
that's powering your customer facing analytics. Now
12:55
at some point, you probably want to
12:57
change that, because the business evolves,
13:00
the definitions of the metrics evolves,
13:03
and or you need to fix some
13:05
discrepancies in the definitions. So you go
13:07
and redefine that maybe in
13:09
Tableau, but you forget to redefine
13:11
it in Power BI, and maybe
13:13
in some of the charts, in
13:15
this customer facing analytics, because you
13:17
have like 100 SQL queries that
13:19
power different charts, and then it
13:21
sort of becomes sort of like
13:23
the bigger problem with the Molymetrics,
13:26
you know, like you need to change. So at
13:28
some point, they all becoming out of sync. So
13:31
solution to that would be, okay, do
13:33
we really need to define them on
13:35
this visualization layer in the
13:37
first place? What if visualization layer is going
13:40
to be thin, very dumb
13:42
from you know, like logic perspective,
13:44
just to render things, and
13:47
then go to universal semantic layer
13:49
for definitions. Now do really,
13:51
we need to have an average order value
13:53
to be defined in Tableau. Tableau
13:56
can just go and say, hey, Q, give
13:58
me average order value. So now, without
14:01
a semantic layer, your Tableau would
14:03
go directly to Snowflake. With semantic
14:05
layer, the Tableau would go to
14:07
Kube and a Kube would go
14:09
to Snowflake. So Kube would become
14:11
the hub that receives all the
14:13
queries. It changes the queries
14:15
to get the real definition of
14:18
the data and then go to
14:20
Snowflake to query the data and then send it
14:22
back to Tableau. So it's sort of like universal,
14:24
like a proxy or get away to the data
14:26
for all the tools that holds all the definition.
14:29
And by having it, you can
14:31
make a visualization layer very thin
14:34
without any logic attached to that.
14:36
So if you need an average order
14:39
value, you create an average order value
14:41
piece of data, if you will, that's
14:43
calculated. And that calculation occurs
14:45
within the semantic layer. WorkOS
14:57
is a modern identity platform built for B2B
14:59
SaaS. It provides
15:01
seamless APIs for authentication, user identity,
15:04
and complex enterprise features like SSO
15:06
and skin provisioning. It's a drop
15:08
in replacement for Auth0 and supports
15:10
up to 1 million monthly active
15:12
users for free. It's perfect for
15:15
B2B SaaS companies frustrated with high
15:17
costs, opaque pricing, and lack of
15:19
enterprise capabilities supported by legacy Auth
15:21
vendors. The APIs are flexible and
15:24
easy to use, designed to provide
15:26
an effortless experience from your first
15:28
user all the way to your largest
15:30
enterprise customer. Today, hundreds of high growth
15:32
scaleups are already powered by WorkOS, including
15:35
ones you probably know, like Vercel, Webflow,
15:37
and Loon. Check out
15:39
workos.com/SED to learn more.
15:49
So there's obviously dry code
15:51
advantages to this. Are
15:53
there performance advantages as well? Yes.
15:56
So there are several additional
15:59
benefits to that. performance and
16:01
unifying governance and access control.
16:03
So the performance is once
16:06
you define everything in universal semantic
16:08
layer it means you're going to
16:10
query everything through the semantic layer
16:13
right so it means the semantic
16:15
layer becomes a place where caching
16:17
started to make sense as a single
16:20
place of all the requests right so
16:22
once you have all your system goes
16:24
through universal semantic layer now you can
16:26
do a caching there. So kube
16:28
specifically has a few different implementations and
16:30
caching strategies that can help with that
16:33
but I think the idea is that
16:35
because you query through the semantic layer now
16:38
it's an ideal place to cache so it's
16:40
definitely a lot of opportunities to improve performance
16:42
here and the same idea applies
16:44
to access control. So access control
16:46
tends to be not centralized as
16:49
well on the different business intelligence
16:51
tools and visualization tools but if
16:53
you have a single place where
16:55
you query your data through then
16:57
it's an opportunity to centralize access
17:00
control. So that's two additional benefits
17:02
that semantic layer architecture can provide
17:04
as well. Okay okay
17:06
and you mentioned universal governance
17:08
which I know is tied to access
17:10
control but when I think of universal
17:12
governance I think a lot more than
17:14
that too. I'm assuming there's
17:16
other benefits like regulation compliance
17:19
and things like that that can do you want
17:21
to talk about that a little bit? Yeah
17:23
so once you know like once you have
17:26
all the metrics definitions in place and dimension
17:28
in place in semantic
17:30
layer now you can classify them
17:32
and say this is a PI
17:34
data this is data under that
17:37
compliance you can also manage the
17:39
owners of the data so you
17:41
know like you have different groups and
17:43
teams kind of responsible to make sure
17:45
that the data is up to date
17:47
so you can apply the whole set
17:49
of governance features that you would expect.
17:51
I think the difference between semantic
17:54
layer first architecture and a more
17:56
classical governance architecture is that semantic
17:58
layer is active. Meaning that
18:00
you're going to make all your queries
18:03
directly through the semantic layer. In
18:06
a traditional governance approach, the
18:08
governance system usually sits on top
18:10
of your stack and a little bit on
18:12
the side, meaning that your top
18:14
block query is directly snowflake, but then
18:16
you have a governance platform that describes
18:18
the data and talks about the data,
18:20
but there is no strong opportunity to
18:22
enforce that, right? Because it sits on
18:24
the side. You can't filter when you're
18:26
not in line. Exactly, exactly,
18:29
yeah. Okay, so governance is
18:31
a useful case for this as well
18:33
too. And governance changes.
18:36
I mean, a lot of governance rules
18:38
change on a regular basis, and having
18:40
to go through and change the rules
18:43
and multiple tools can be problematic and
18:45
error-prone for that matter. And this allows
18:47
you to change it into a single
18:49
location. Am I characterizing that
18:51
correctly? Yeah, exactly. Yeah, I think, you know,
18:53
like you're spot on. It's the same problem
18:56
with the data model, right? Like it just
18:58
you sprinkle all this definition across all the
19:00
tools, and then it's not centralized,
19:02
it's not dry. So it's the same
19:05
idea applies to the data model and
19:07
to the governance. In fact, I believe
19:09
that governance should be really connected to
19:11
the data model because it all
19:13
about this. It's like, what kind of metric is
19:15
this? What kind of people can access
19:17
to that metric or that specific dimension? Is
19:20
it a PII dimension and you know, like
19:22
so on and so forth? Yeah,
19:24
yeah. In some ways, governance is just a
19:26
part of the metadata associated with the data
19:29
and you need to tie it to the
19:31
data. And that's usually not the
19:33
way it's done, but a layer
19:35
like this can certainly enforce those sorts
19:37
of restrictions. Is
19:47
your code getting dragged down by joins and
19:49
long query times? The problem might be your
19:51
database. Try simplifying the complex with graph. A
19:53
graph database lets you model data the way
19:56
it looks in the real world, instead of
19:58
forcing it into rows and columns. Stop
20:00
asking relational databases to do more than
20:03
they were made for. Graphs work well
20:05
for use cases with lots of data
20:07
connections like supply chain, fraud detection, real-time
20:09
analytics, and Gen AI. With
20:12
Neo4j, you can code in your favorite programming
20:14
language and against any driver. Plus, it's easy
20:16
to integrate into your tech stack. People
20:19
are solving some of the world's biggest problems with graphs. Now
20:22
it's your turn.
20:24
Visit neo4j.com/developer to
20:27
get started. That's
20:29
neo4j.com/developer. So
20:38
in the example that we gave,
20:40
the e-commerce example, we were collecting
20:42
data from multiple sources and essentially
20:44
ETLing them all into a single
20:46
data warehouse and then putting
20:48
the semantic layer on top of that. But
20:50
there actually are scenarios where there
20:52
isn't a universal single data warehouse.
20:55
There's data from multiple sources that
20:57
is disjoint and isn't
21:00
uniform. First of all, can
21:02
you deal with that? I'm assuming you can with
21:04
Qube, but do you recommend centralizing
21:07
your data before you put a semantic layer
21:09
on top or are there advantages to leaving
21:11
the data decentralized and putting a
21:13
semantic layer on top? Yeah,
21:16
great question. So semantic
21:18
layer and it gives specifically
21:20
we can work both, you
21:22
know, like on centralized data
21:25
and data allocated in different places
21:27
and in different styles. The
21:30
way the Qube works is that it can
21:32
dynamically connect to different data warehouses based on
21:34
the data model definition. So when you design
21:37
your data model, you can say, oh, this
21:39
part of the data comes from Snowflake. That
21:41
part of the data comes from maybe BigQuery.
21:44
And then that's how we access this
21:46
data. That's how we are going between
21:48
data sources if needed. I think
21:51
there are like a different approaches
21:53
to modeling data. So obviously, cloud
21:56
data warehouse way, then you have something
21:59
like that. Lakehouse, which is to
22:01
some extent kind of very close to
22:03
the Data Warehouse architecture, and then you
22:05
have like full zero ETL approach where
22:08
you just keep the data in, you
22:10
know, like in places where it is
22:12
and then you use some FIDRATION engine
22:15
like Atrina to access it. So I
22:17
think from a semantic
22:19
layer standpoint we fit into any
22:21
of this. I don't think we
22:23
position semantic layer as a FIDRATION
22:26
engine specifically right now, so we
22:28
would rather rely on something like
22:30
Atrina if you really need to
22:32
do heavy FIDRATION. We can FIDRATE
22:34
data, so we can join
22:36
across multiple data sources, but if
22:39
you need to really go deep into
22:41
more like complicated use cases where you
22:43
need to like push down some compute
22:45
to the source and then kind of
22:47
bring it and more like massage data,
22:49
so there are like engines like Atrina
22:51
that are really good at it. But
22:53
at the end of the day we
22:55
can both with both kind of ETL
22:57
data, full ETL data or like not
22:59
ETL data and that needs to be
23:01
FIDRATED. In terms of the advantages,
23:04
I think there is a lot of advantages to
23:06
have the data in a single place, but at
23:08
the same time it's always the cost of the
23:10
moving data. So I think if
23:12
the cost of moving data is really
23:14
high for a variety of reasons, then
23:16
it's probably not worth it. But if
23:19
there is an opportunity to centralize data
23:21
in a single warehouse like house architecture,
23:23
it feels like a preferable solution. That
23:25
makes sense. I think one of the things that I
23:28
see when we talk about you
23:30
know low ETL data processing where
23:32
you have multiple sources, you leave
23:34
into multiple sources, is interpretation
23:36
of data can be an issue. You
23:38
know as a simple example, you
23:41
have data coming from well let's say
23:43
from your Spotify order processing. This is
23:45
a simple using our example here. And
23:48
you know engineering is looking at that
23:50
data, marketing is looking at that data, finance
23:52
is looking at that data, and
23:55
they all think it looks different than
23:57
it really does. Because
23:59
they make assumptions. Every single one
24:01
of them make assumptions about what the
24:03
data means that may or may
24:05
not be true assumptions. The finance
24:07
may assume every transaction
24:09
ends up – I'm kind of drawing
24:12
a blank on specific examples here. But
24:14
what I'm trying to get to here is data
24:17
itself, when you look at data from
24:19
a given source, different consumers
24:21
of that data can make different assumptions
24:23
about what that data actually means or
24:26
contains that may or may not be
24:28
true. And so you get different interpretations.
24:31
Tell me how a somatic layer
24:33
helps remove that problem. Yeah,
24:36
good question. So I felt like at the
24:38
end of the day we talk about the
24:40
trust in data. We're like, do we
24:43
have a trust in data in our data
24:45
consumers, right? When they go and consume data
24:47
in a tableau, do they trust the data?
24:50
I feel like the way to
24:52
solve it is to provide as
24:54
much context as possible to the
24:56
data consumer so they understand where
24:58
the data is coming from and
25:01
how it's being processed and how it's
25:03
being calculated. But at the same time,
25:05
the problem is that all these steps,
25:07
they are very technical, right? We
25:10
have a lot of pipelines like in transformation
25:12
code and data modeling code kind of telling
25:14
you, oh yeah, you can look at it,
25:16
but if you can read code, you can
25:19
understand that. I think the problem
25:21
here is how to make the breach
25:23
to tell data consumers how the
25:26
data is calculated. So imagine a
25:28
work, you know, like an organization,
25:31
imagine, you know, like an e-commerce
25:33
store that has a data engineer and then
25:35
the marketing is looking at the data, finance
25:37
is looking at the data and they don't
25:39
trust the data. So now they come into
25:41
a data engineer and asking, hey, how did
25:43
you calculate that? And if the engineer can
25:45
explain in more details how this calculation done,
25:47
that will create more trust level right in
25:49
that data. So I think the question is
25:51
like, is it possible to optimize that? And
25:54
I feel like that is connected to
25:56
maybe like governance or we just mostly
25:58
catalogs solution of the data. So
26:00
that's something that we think a lot
26:02
about at Q as a because we've
26:05
been this place where all the data
26:07
model definitions allocated today
26:10
Is there an opportunity to surface
26:12
that knowledge to the data consumer?
26:15
In a way that they can learn about
26:17
the data learn what kind of the data
26:20
they have what the lineage of the data
26:22
the definitions and then take that knowledge and
26:24
use the data and Trust
26:26
the data that makes sense you know,
26:28
I think one of the problems that I run
26:31
into a lot when I do these sorts of
26:33
data analysis is You've
26:35
got a definition of what you want to try and get out
26:37
of your data and how you get that
26:39
data Changes the
26:42
results right? So a simple example.
26:44
I'm trying to get the average
26:47
click-through rate for Social
26:49
media engaged sessions coming into my site. I
26:52
made that up or something like that. Yeah
26:54
Well, what is the average data for that
26:56
mean? Well, let's look over 90 days. Let's
26:58
okay. So what 90 days? Well
27:00
the last quarter or the last
27:02
90 days or last 90 days minus today or
27:04
last 90 days Plus yesterday or
27:07
you see what I'm saying is the definition
27:09
of how do I get that? What seems
27:11
like a very well described piece of information,
27:13
right is different and
27:15
different people have different interpretations what
27:17
ends up happening is Different
27:20
people come up with radically different answers
27:22
and in fact how you Collect the
27:25
data can actually cause the data
27:27
to be more or less
27:29
usable as well Tell me
27:31
how a somatic layer like this can
27:33
help with that problem. Yeah. Yeah, that's
27:36
a perfect example
27:38
of a problem that we're trying to
27:40
solve with a semantic layer because Because
27:43
even a single definition can have
27:45
a lot of variations, right? Like
27:48
as in your example average Click-through
27:51
rate for different people. It may mean
27:53
different things and then if
27:55
you let you know like
27:58
this definition to sprinkle across
28:00
organization to be located in different places. Anyway,
28:02
we will have a lot of different definitions,
28:04
right? And everyone would like create their own
28:07
like definitions here, like on a sideways, or
28:09
no one is like looking at and then
28:11
kind of use that definition in some presentation
28:14
that will end up at the board level
28:16
and the board level would be like, what
28:18
the number is that? Problem is that
28:21
we need a centralized governance
28:23
of the metrics. And like
28:26
people can have different metrics. That's
28:28
totally fine. We just need a
28:30
framework to be able to develop
28:32
them and then document
28:35
and then just share with the rest of
28:37
the organization. So if someone would come to, you
28:40
know, like to me as a data engineer and say, like, I want
28:42
this average click through rate, I will
28:44
ask a lot of follow up questions of how exactly
28:46
we want to do this, right? But do you
28:49
want it to calculate it this way? Look at the 90 days.
28:51
Are we talking about the rolling? Are we
28:54
talking about any specific filters and all of
28:56
that? So through that conversation, we will come
28:58
up with some definition that we mutually agree,
29:00
right? The person who wants the metrics
29:03
and I'm as a data engineer and I can say,
29:05
okay, we have the data that
29:07
we collect in a specific way that we can
29:10
support you. Calculations that we
29:12
can give you that metric. Now I
29:14
would go into my semantic layer. I will create
29:16
that as a code. Essentially it's
29:18
a conscious of code base by the end of
29:20
the day. You put this definition in a code
29:23
base and then you have this
29:25
metric. Now, and now that metric is going
29:27
to be available in a table or like
29:29
other places. Now I do
29:31
need to document it. Of course, I need to create
29:34
a really good description. That's how it's
29:36
being calculated. So the other
29:38
person who comes to that metric, they can
29:40
read this definition and they understand, okay, this
29:43
is an average click through rate, but that's
29:45
exactly how that was calculated. And maybe they
29:47
need another one and different type of the
29:50
calculation. They come to me and say, Hey,
29:52
I need to make a change. That's
29:55
fine. And we can make the change and
29:57
then we can create a second one. I think.
30:00
potential challenge in that solution. That's
30:02
sort of a state of today, right? Like this
30:04
is how all users use cube and
30:07
how they, they leverage the whole stack.
30:09
That is might create a little bit like a work
30:12
for the data engineer to make
30:14
sure that everything is defined and documented.
30:17
I think that's where I believe
30:19
that AI can help us. We
30:21
haven't counted about AI yet, but I felt like
30:23
we should have, right? It's just like this. Yeah,
30:26
right. I think
30:28
what you made and think we can do is
30:30
that as long as we keep everything as code
30:33
base, AI is really good
30:35
at writing code. I think
30:37
that's the best use case for
30:39
the modern day AI is just to
30:41
generate code and generate text
30:43
description. So essentially we can use AI to
30:46
like, if you need a new metric, maybe
30:48
you don't need to go to data engineer, you
30:50
go to AI and AI can go and
30:52
take that metric definition, generate that code to
30:54
create this metric definition and then send the
30:57
pool request and then data engineer. They
30:59
only need to review that. Your
31:01
point about AI, I think it's illegal in
31:03
today's society, not to talk about AI when
31:05
you're talking about data, but so yeah, I
31:08
agree with that. But I hear
31:10
what you're saying and I mostly, but I'm not
31:12
sure I completely agree with you, it makes a
31:14
lot of sense, but I think one of the
31:16
problems you still can run
31:18
into, let's keep to the data engineer
31:20
example and then we'll extend the day
31:22
in a second, when someone
31:25
comes to you and wants the definition
31:27
and you create that definition of some
31:30
piece of data and give it to
31:32
them, ideally you'd want everyone else to
31:34
use that same definition. Now you
31:36
mentioned someone else is going to come, they want
31:38
something slightly different and so you create a
31:40
second copy of the data definition with a
31:43
slight variation and then a
31:45
third copy with a slight variation and a
31:47
fourth copy of the slight variation and sooner
31:49
or later you still have a
31:51
semantic layer, but rather than having 20 definitions
31:54
that are scattered throughout your organization, you
31:56
have 20 definitions side by
31:59
side. in the semantic
32:01
layer and they're all different definitions. How
32:03
do you avoid that problem? And doesn't
32:05
AI actually make that problem worse by
32:07
making it easier to create new ones?
32:11
Yeah, that's a good question. I
32:13
think I still believe that's a
32:15
better state of
32:17
things rather than having this 20
32:19
definitions being hidden and without any
32:21
understanding whether they've been used or
32:23
not at all. Like once we
32:25
have them in a central place,
32:28
let's say 20 definitions, we
32:30
might understand that maybe this
32:32
15 of them are
32:34
really legacy right now. They essentially should
32:37
be deprecated because they're not being used
32:39
and then we can centrally govern them
32:41
and sort of remove them
32:43
from the stack entirely. Once
32:46
we have that central place, we can see the
32:48
lineage and we see, oh, there is link, no
32:50
really charts that have been powered by this definitions.
32:52
Or maybe they're like, it's still charged but people don't
32:54
use them, let's go and deprecate this chart. So
32:57
it's a place that helps us to control
32:59
that and evolve. I think we need to
33:01
accept that the definitions are going to change.
33:04
We just need to build a framework how we support
33:06
that change. Got it, so
33:08
the simple fact that they're located one area makes
33:11
the change easier and makes
33:13
the consolidation of the change is not just
33:16
duplicating now. You can also consolidate a lot
33:18
easier as well because you know which
33:20
ones are being used, who's using
33:23
them and you could potentially, and this might
33:25
be also a good use for AI as well.
33:27
You got 30 definitions with
33:30
minor differences between them. 20
33:33
of them are being used. Can we
33:35
consolidate that into five by making certain
33:38
changes in the definitions that might
33:40
be acceptable to the consumers and
33:43
perhaps the AI can actually help
33:45
making those recommendations and you can
33:48
adjust the usage models to
33:50
a better definition. That's also
33:52
more uniform as well. But you can't do
33:54
that if they're scattered throughout the entire
33:56
code base. You can only do that if they're
33:59
all known. all centralized and understood
34:01
by one system
34:06
or one entity. Is that a fair statement?
34:08
Yeah, exactly. I think you're spot on. And also
34:10
it's all a single code base, right? It's a
34:12
single framework. So you can refactor it. You
34:14
can think about, okay, how we make these
34:17
definitions more efficient. We can use
34:19
AI, as you mentioned, to help us
34:21
support, you know, like the similarities,
34:23
the differences. So, you know, like in a
34:26
help to refactor it. But I
34:28
feel like having it as a code base
34:30
and in the central place that gives a
34:32
lot of downstream benefits. So if
34:34
I were to describe two problems, which do
34:36
you think is the biggest problem with data
34:38
modeling today? And then the follow on question
34:41
is, how can what
34:43
you're doing help with this? Is
34:45
the biggest problem with data modeling, is it
34:48
a lack of cohesive modeling
34:50
where data is hard to understand,
34:52
some tools don't know what the
34:55
data is or how to make use of it.
34:58
Like for instance, AI, you can't just throw an
35:00
AI algorithm at data without any understanding of that
35:02
data. Or is the model
35:05
meaning lost or
35:08
misunderstood because it's not well documented and
35:11
hence the data is misused or misunderstood
35:13
by some tools? What do you think
35:15
is the bigger problem? I
35:18
think in general the
35:20
data modeling as a concept,
35:23
it's hard to maybe be it's
35:25
easier for the engineers like data people who are
35:27
doing that, you know, like all their lives like
35:29
for 10, 15 and even more years. But then
35:31
if we're like bring it
35:33
to the data consumers, what
35:35
a measure, what a dimension all the
35:38
business intelligence don't be showing all this
35:40
like, you know, like multi dimensional concepts,
35:42
but it's still, it's still sometimes hard
35:44
to get an idea what is it.
35:46
Sometimes a lot of people talk about
35:48
metrics. But really, I don't think
35:50
we even have a universal definition of a metric.
35:52
What is a metric? Is metric a measure? Or
35:55
metric is a measure with a time dimension. And
35:57
if we add a filter to that. That
36:00
is the metric. So there is like,
36:02
I feel like a lot of bag space
36:06
around the data modeling, especially
36:08
between connecting the business concepts
36:11
and the business users and data consumers with
36:13
the data engineers, because in a data engineering
36:15
and a data modeling kind of part,
36:18
it's a little bit more determined. We
36:20
have a lot of different approaches, you
36:22
know, like a Kimball, we have like
36:25
data wall, all of that stuff. So
36:27
it's a little bit more structured. But
36:29
then when the complexity comes in is
36:31
how we translate that structure,
36:34
which is inherently very complex to the
36:36
data consumers, when they only ask you
36:38
for a metric. And you try
36:40
to explain, oh, but metric is less
36:42
tangible, right? Like let's think about measures
36:44
and dimensions and all of this. So
36:46
I think that's, that's a
36:48
hard thing. Now, how we
36:51
can make it easier documentation is,
36:53
I think, I'm just in general kind of
36:55
keeping everything up to date and
36:58
documenting it just it's not a hard problem
37:00
from a sort of brain power perspective, but
37:02
it's a lot of like a manual mundane
37:04
work that no one wants to do. And
37:06
that's an example where like AI can actually
37:09
help us. I think that's a big problem,
37:11
but it's just like something we really need
37:13
to automate, but we never had really good
37:15
tools to automate that. And then
37:17
I think we can go and kind of try
37:20
to solve this many, many problems.
37:23
And we will help us and then of
37:25
the day to bring the
37:29
non data faults closer to the data to
37:31
better understand data, because things like better documentation
37:33
to definitely help them to just kind of
37:35
work with work with the data in a
37:38
better way. So mostly
37:40
what we've been talking about AI here
37:42
now is AI as a
37:45
tool to analyze the
37:47
data model and analyze the data structure
37:50
and the data essentially automating a lot
37:52
of the things that a data engineer
37:54
does with the data. So
37:56
helping to increase the usefulness of
37:58
the data. by giving
38:00
better documentation and better understanding of
38:03
the data that you have available.
38:05
And that's great. What about for
38:07
the customer that's looking to apply
38:09
AI and a
38:12
large language model to analyze their data
38:15
to help create customer
38:18
useful information based on that data?
38:20
In other words, the building and
38:22
large learning models that need to
38:24
understand your data, how does
38:26
the data semantic model help that use
38:29
of AI? I think the
38:31
way that modern AI
38:34
transformers, LLMs, they work
38:37
with the data is through the
38:39
code generation. So essentially
38:41
because of the architecture they build
38:44
in, that's probably the best way
38:46
to do that. So we cannot
38:48
really think about uploading
38:50
a lot of data right into
38:52
AI or into its context because
38:54
the context is limited by
38:57
its architecture. So the best
38:59
way the system can do
39:01
analysis is to break down
39:04
the complex task into multiple self-tasks
39:07
and then execute the code snippets to
39:09
analyze the data. And then based on
39:11
this, come up to
39:13
some conclusion that it needs to generate
39:15
a code snippet again and then arrive
39:18
at the final answer. That's how you
39:20
know, for example, Chia GPD for
39:22
data analysis works. If you upload the Excel
39:24
file and say, hey, run me some analysis
39:26
on top of it, it will just generate
39:28
a Python script, execute it and give you
39:30
the answer back. So now
39:32
I think it means
39:34
that many, many AI agents,
39:37
they would need to access data in
39:39
a cloud, data warehouses, in a lake
39:41
houses and all these places. Now
39:43
the questions like how they would be able
39:45
to do that and the answer is by
39:48
generating and executing SQL. The system, they will
39:50
generate a lot of SQL. So I feel
39:52
like a lot of SQL right now is
39:54
being written by humans or being generated by
39:56
business intelligence tools. In the next few years,
39:58
we'll see a lot of SQL being... generated by
40:00
AI agents. Now
40:03
the question is like how we can help them to generate the
40:05
SQL? Because they don't really
40:07
know anything about your data, they don't
40:09
have a context, how do they know what
40:11
columns they have? The simplest approach would be
40:13
like, oh yes, let's just take DDL of
40:16
your database and you know like
40:18
give it to the AI agent as a context.
40:21
And if people try that and I
40:23
think it was some research papers and
40:25
benchmarks kind of comparing different approaches, so
40:28
that usually does not give you a
40:30
really good and strong accuracy because the
40:32
columns are cryptic, you
40:35
don't understand the relationship between entities. So
40:39
the way to improve that is
40:41
to give context about the data. It's
40:43
like what dimensions you have here,
40:45
what measures you have, what like
40:48
columns you have in data, what
40:50
are relationships between different entities. So
40:52
essentially give semantics to the AI
40:54
agent. So now if you package
40:56
that as a really good context
40:59
that you can attach to that
41:01
your prompt and say now generate
41:03
SQL, then the AI agent would
41:05
obviously generate a much better SQL,
41:07
so we're very high accuracy. So
41:09
that's the thing how these semantic
41:11
layers can help in that architecture,
41:14
they can be this provider of this context.
41:16
Yeah, that makes a lot of sense since
41:19
again it all comes down to data understanding,
41:21
right? And the AI has to understand your
41:23
data just like the humans who are doing
41:25
the analysis does. So
41:28
we've been talking a lot about
41:30
the BI use cases for data
41:32
and for the
41:34
most part those are batch
41:38
analysis. Not always, but they're not
41:41
real-time analysis for the most part. Most of
41:43
the types of analysis we've been talking about
41:45
so far are the types
41:47
of analysis that happen after the fact.
41:50
But what about real-time data analysis? How
41:52
useful is a semantic layer? Does a
41:54
semantic layer like Q
41:57
help or does it just delay you?
42:00
the processing to the point that
42:02
makes it unrealistic for real-time analysis.
42:05
How does it work in the scope of
42:07
a real-time analysis? Yeah, so
42:09
I think a few things
42:12
here. First, I think real-time needed
42:15
very rarely, like a true real-time. There
42:17
are use cases where we need a
42:20
true real-time, like a streaming flow real-time,
42:23
but with my experience in data, it's
42:25
a very rare case. In
42:27
many use cases, we don't need a
42:30
real real-time. Now, the other
42:32
thing is streaming level real-time is
42:34
extremely expensive. If
42:36
an organization or team decides, oh, we
42:39
need the streaming level real-time, then
42:41
they need to be ready to pay
42:43
for that, because all these technologies that
42:45
help you to process the streaming
42:47
data, they are extremely expensive, and
42:50
your stack is going to cost a lot. And
42:53
the other thing is it's really hard. There
42:55
is no single solution. It's not like it's no
42:57
flake for streaming data that you can just, oh,
42:59
let's just stream everything into that warehouse and then
43:02
write a lot of SQL queries. People
43:04
try to do that, and there are a
43:06
lot of great companies and technologies that started
43:08
to try to address this problem. Probably
43:11
the older one was like a ksqlDB.
43:13
I don't think it's very active anymore,
43:15
but it was an attempt to bring
43:18
SQL to the streaming, and
43:20
then there are like a newer
43:22
one materialized with interesting ideas around,
43:24
like, can we build a snowflake
43:26
level experience but on top of
43:28
streaming data? It's still hard. Everything is
43:30
still in progress. I'm sort of a
43:32
bullish that these technologies will help us
43:34
to make our life easier, but it's
43:36
still hard. Now, how it all connects
43:38
is semantically a cube specifically
43:40
built to work on top of SQL
43:43
back end. So, if
43:45
there is a way to run a SQL on
43:47
top of the streaming data, like, for example, these
43:50
materialized, so with ksqlDB, you can potentially put a
43:52
cube on top of that, and that's going to
43:54
work. But, cube is not
43:56
designed to work on top of like a
43:58
Kafka, you know, like just... or something like
44:00
that, you still need to have a back end. So all
44:04
the streaming architecture we have within our
44:06
community of users, they are all very
44:08
complicated. Got it. Are
44:10
you a fast service or are you a
44:12
standalone application? How are you structured and how
44:14
do people engage with you? Great
44:16
question. We have a cloud offering where
44:19
we can have like a shared cloud. So
44:21
essentially it's going to be one VPC that
44:24
in the specific region, in a specific cloud,
44:26
that we support that our customers
44:29
can share that VPC essentially
44:32
to multi-tenant architecture. And then we have
44:34
a dedicated offering that essentially we spin
44:36
out the dedicated VPC instance in
44:39
a specific region that customers selected in
44:41
specific platform. And then we run everything
44:43
in that VPC. And then finally we
44:45
can bring, we call it to bring your
44:47
own cloud. We can bring everything inside
44:49
the customer cloud. So it really depends, you
44:52
know, like as you can imagine, the first
44:54
option is more like for SM based in
44:56
a mean market. And then as we go
44:58
to like larger enterprises with more compliance, it's
45:00
just more regulations. So it's
45:02
a more complicated deployment versions. That
45:04
makes sense. Who are your competitors?
45:07
So we are the
45:10
Google, they bought a company
45:12
called Looker about three, four
45:14
years ago. So Looker was essentially
45:16
a business intelligence tool. They
45:19
bought Looker to sell
45:21
more BigQuery. And also they
45:23
wanted to make Looker to be
45:26
more like a headless the universal
45:28
semantic layer eventually because the
45:30
Looker has a really strong semantic layer LookML
45:32
in it. So they wanted to take it
45:34
kind of advantage of that and just kind
45:37
of what if we can use
45:39
that semantic layer not only for Looker UI,
45:41
but across all the other tools. It's still
45:44
TBD, to be honest, it still didn't happen.
45:46
So, and we know that the modern Google
45:48
is not doing well in terms
45:50
of the acquisitions, right? It's not yet
45:53
clear when it's going to materialize as a
45:55
competitor and if it's going to happen at
45:57
all. We also have a company called
45:59
Edscape. they've been around a
46:01
little longer than Kube, solving the
46:03
same problem. Some of the
46:06
concepts are very the same. I think
46:08
the difference between Kube and at scale
46:11
is that we are very
46:13
caught first and we are approached
46:16
this problem with more like an engineering
46:18
philosophy and engineering triggers approach where like
46:20
it's a code base you can put
46:22
it under the version control that's all
46:24
the things you can do where like
46:26
at scale is more you know like
46:29
visual builder more like a traditional
46:31
think about business objects universe style
46:33
of experience. Right. So it sounds
46:35
like it's still a pretty young
46:37
space though is that a fair
46:39
statement? It's a very young
46:41
space. I think it's a very new
46:43
category that is developing very fast and
46:45
I think you know like we'll see
46:47
more and more adoption of it in
46:49
the next several years but it's
46:51
still very young space. Right. So
46:54
Ardeum is the CEO of Kube,
46:56
a data modeling company focused on
46:58
data semantics. Ardeum, thank you for
47:00
joining me today on Software Engineering
47:02
Daily. Thank you for having me.
47:04
It was a great conversation.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More