A Semantic Layer for Data with Artyom Keydunov by Software Engineering Daily | Podchaser

Episode from the podcastSoftware Engineering Daily

A Semantic Layer for Data with Artyom Keydunov

Released Thursday, 4th April 2024

Good episode? Give it some love!

A Semantic Layer for Data with Artyom Keydunov

A Semantic Layer for Data with Artyom Keydunov

Thursday, 4th April 2024

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Managing data and access to data is one

0:03

of the biggest challenges that a company can

0:05

face. It's common for data

0:07

to be siloed into independent sources that

0:09

are difficult to access in a unified

0:11

and integrated way. One

0:13

approach to solving this problem is to build

0:16

a layer on top of the heterogeneous data

0:18

sources. This layer can serve

0:20

as an interface for the data and provide

0:22

governance and access control. QUBE

0:24

is a semantic layer between the data

0:26

source and data applications. RTM

0:29

Kaidanov is the founder of QUBE and he

0:31

joins the show to talk about the approach

0:33

QUBE is taking. This episode

0:35

is hosted by Lee Acheson. Lee

0:38

Acheson is a software architect,

0:40

author, and thought leader on

0:42

cloud computing and application modernization.

0:45

His bestselling book, Architecting for Scale,

0:47

is an essential resource for technical

0:49

teams looking to maintain high availability

0:51

and manage risk in their cloud

0:54

environments. Lee is the

0:56

host of his podcast, Modern Digital

0:58

Business, produced for people looking to

1:00

build and grow their digital business.

1:02

Listen at mdb.fm. Follow

1:05

Lee at softwarearchitectureinsights.com and

1:07

see all his content

1:09

at leeacheson.com. Ardea,

1:24

welcome to Software Engineering Daily. Thank you.

1:26

Thank you for having me today. I'm

1:28

excited about today's conversation. Great.

1:30

Great. So let's make sure we're

1:32

all on the same page to get

1:35

started. So let's talk about first some

1:37

fundamental definitions and things like that. The

1:40

word data silo. So when I think of a

1:42

data silo, what I think

1:44

of primarily is like independent data

1:47

sources that contain interrelated

1:49

data, data that's meant to work

1:51

together. What do you think of

1:53

that definition? Is that a good definition or what would you

1:56

enhance that with? Yeah, I

1:58

think it's a good definition. So, you

2:01

know, like essentially the database

2:03

or specific warehouse of a

2:05

data storage, where like the

2:07

data is located and it becomes sort

2:09

of disconnected to other places and it

2:11

becomes a silo. That's what people usually

2:13

think about, you know, data silos. And

2:15

I think it's a good enough definition.

2:17

The only way I would enhance

2:19

it is that what if we think about

2:21

the data, metadata, data definition

2:23

silos. So that's an interesting problem. The

2:26

one we solve in a cube is

2:28

that maybe in your

2:31

Power BI or like a top-line

2:33

organization or in some Python

2:35

Django app, you don't actually hold the

2:37

data, but you have a lot of

2:39

like SQL scripts that they do analysis

2:41

of data. They try to calculate some

2:44

metrics and they become like a metrics

2:46

silos or data definition silos, right? Because

2:48

you kind of calculate something to show

2:50

data to, you know, like maybe a

2:52

customer's partner, so internally and people kind

2:54

of, you know, they got some idea

2:56

out of this data and maybe the

2:58

data was correct, but the definition was

3:00

kind of siloed. So that's something

3:02

an interesting kind of enhancement to the

3:04

idea of data silos. So it may

3:06

not only exact data, but more like

3:09

a data definition or metric definition silo.

3:11

Got it. Yeah. So it's not just the data,

3:13

it's how the data is used and how it's

3:15

defined and the meaning of the

3:17

data as well. That's actually a great extension to

3:19

that definition. So why

3:21

are data silos a

3:24

critical issue for data

3:26

modeling in general or data management,

3:29

data usage in general? Yeah,

3:32

I think, you know, it just, if

3:34

you zoom out, I think the whole

3:36

purpose of having a data is to

3:38

help business to drive decisions, right?

3:41

So we want to be a

3:43

data-driven as an organization, you know, people

3:46

on all levels like execs and

3:48

management and individual contributors, they all

3:50

wanted to make their decisions, day-to-day

3:53

decisions, operations, these data-driven. So they

3:55

need to have unaccess to data.

3:58

And what's happening is... is that

4:01

we try to create more and

4:03

more touch points for people with

4:05

data, but naturally by creating this,

4:07

we also creating silos as a

4:09

side effect because we move some

4:12

data closer to the marketing and

4:14

they started to use it, but then it

4:16

becomes silo. And by silo, I mean usually,

4:18

you know, like something disconnected as we just

4:21

defined, right? Whether the data itself or the

4:23

definition itself. And a problem

4:25

is that it becomes really

4:27

hard to keep them in sync. So

4:29

you kind of end up in a

4:31

situation where the organization kind of becomes

4:33

data driven. They do work with data,

4:35

they have access to the data, but

4:37

is this data correct? Does

4:39

it show the real number? Does it

4:42

stay in sync? If your company decided

4:44

to change specific metrics, maybe

4:46

five out of your seven silos, they

4:48

have been updated, but then the rest,

4:51

they haven't. And now like some of

4:53

your department is looking at the old

4:55

definition. So I think if I

4:57

try to generalize it and it's software engineering

4:59

podcast, so I would call that problem is

5:01

a repetition problem. And we know we have

5:03

a dry idea. Like we have like, do

5:06

not repeat yourself in software engineering. So essentially

5:08

what's happening is that we sort of repeat

5:10

data or repeat data definition in many ways,

5:12

and then we need to keep them in

5:14

sync. And as we are engineers,

5:17

we know that just bad, right? It's really,

5:19

really hard to repeat things and then keep

5:21

those things in sync. So

5:23

go into an example just for those

5:25

who aren't following. You've got

5:28

data that's collected in via

5:30

an engineering department, and then marketing

5:32

wants to make use of some of that data. So

5:35

they take part of that data, bring

5:37

it into one of their systems. Now the

5:39

data is duplicated into a different system. They

5:42

take that data, perhaps enhance it

5:44

with some other information that they

5:46

have and reformat

5:48

or restructure it or reanalyze in

5:50

a different way and make

5:52

different use of it for different purposes. Even

5:55

though the fundamental data has common

5:57

roots, the data itself now is

5:59

out of sync. because there's differences

6:01

and different interpretations. Not

6:03

only the data itself is a little different, but the

6:06

interpretation of the data is considerably different.

6:08

And that's the sort of out

6:11

of sync, non-dry, if you

6:13

will, problem that you're talking

6:15

about here. Is that correct? Yeah, exactly.

6:17

That's exactly it. So you've

6:19

created something that's called a somatic layer on top

6:21

of data silos. I think that's actually what you're

6:24

calling it as well too. So

6:27

tell me a little bit about, you

6:29

know, what does a semantic layer

6:31

look like on top of silo data?

6:34

Right, yeah. If we think

6:36

about semantic layer, I think many of us

6:39

worked with the different types of the semantic

6:41

layer every time we worked with the BI.

6:43

So essentially, the semantic layer by itself

6:46

has usually been a part of the

6:48

BI. In every BI, we

6:50

can draw and drop measures, dimensions

6:52

to build charts. So every time

6:54

you work with this high level,

6:57

business level metrics definitions, we

6:59

work with the semantic layer. And then

7:01

what semantic layer does, it takes these

7:03

definitions and translates them into

7:05

the SQL queries and it

7:07

knows the underlying database structure

7:09

and warehouse structure. So the

7:11

semantic layer is sort of

7:13

a bridge between the

7:15

metrics that business utilizes

7:18

and then the underlying data structure.

7:21

Now the problem is that because every BI

7:23

has a semantic layer. So just

7:26

to make sure everyone's on the same

7:28

page, by BI, you're talking about the

7:30

business intelligence that makes use of the

7:32

data, whether that's a marketing use or

7:34

some other use. Right, right, exactly. So

7:37

now if organization has a 10

7:39

business intelligence, so data

7:41

visualization, data consumption tools, now the

7:43

semantic layer would be in every

7:45

of this tool, which

7:48

creates a silos, it creates a data

7:50

definition silos. Now

7:52

that's a problem for all the

7:54

reasons we just talked about, it's not dry,

7:56

it's going to stay out of sync, the

7:59

solution. to that is universal

8:01

semantic layers. That's what we're

8:03

building at Kube here is idea

8:05

to take this kind of

8:07

the piece of the

8:09

layer of your stack where you have

8:12

a repetition specifically semantic layer and

8:14

put it into a single universal

8:16

place from what place you

8:18

can reuse this definition across all

8:21

the data visualization, data consumption

8:23

tools and business intelligence. So now

8:25

it kind of makes your system,

8:27

your data architecture dry at scale.

8:30

Okay, so you didn't do anything with

8:32

the data. The data may or may

8:34

not be dry, probably is at some

8:36

level, hopefully it is anyway, but

8:39

the definition itself, you've created

8:41

a dry understanding of

8:43

what the various pieces of data

8:45

mean and how to interpret them and created

8:48

one standard for how to use that data.

8:51

Yes, yes, exactly. So Kube becomes

8:53

a universal semantic layer, Kube becomes

8:55

an interface to your data that

8:57

holds all the definitions in one

8:59

place. So it knows about

9:01

the underlying data silos, it

9:04

knows the underlying data storages, it

9:06

knows about potential issues with data,

9:08

but all these issues, they do

9:10

not need to be exposed to

9:12

the consumers. They only communicate with

9:14

Kube and then Kube communicates with

9:16

all the consumers. So Kube becomes

9:18

an interface to the data consumers

9:20

as like a state, like

9:23

there is like a pattern right in software

9:25

engineering for state pattern. So it's essentially for

9:27

state to the data consumers of the data.

9:36

This episode of Software Engineering Daily

9:38

is brought to you by HookDeck.

9:40

Building event-driven applications just got significantly

9:42

easier with HookDeck. Your go-to event

9:45

gateway for managing webhooks and asynchronous

9:47

messaging between first and third party

9:49

APIs and services. With

9:51

HookDeck, you can receive, transform and

9:53

filter webhooks from third party services

9:55

and throttle the delivery to your

9:57

own infrastructure. You can securely send...

10:00

webhooks triggered from your own platform to

10:02

your customers' endpoints. Ingest events

10:04

at scale from IoT devices

10:06

or SDKs, and use HookDeck

10:08

as your asynchronous API infrastructure.

10:11

No matter your use case, HookDeck

10:13

is built to support your full

10:15

software development lifecycle. Use the HookDeck

10:17

CLI to receive events on your

10:20

local host. Automate dev, staging, and

10:22

prod environment creation using the HookDeck

10:24

API or Terraform provider. And

10:26

gain full visibility of all events

10:28

using the HookDeck logging and metrics

10:30

in the HookDeck dashboard. Start building

10:33

reliable and scalable event-driven applications

10:35

today. Visit hookdeck.com/SEDaily and sign

10:37

up to get a three-month

10:40

trial of the HookDeck team

10:42

plan for free. So

10:51

maybe it will help if we go into

10:53

a specific example. Let's assume

10:55

we have an e-commerce store. That's a

10:57

great example that everyone loves to use.

11:00

Let's use an e-commerce store. E-commerce

11:02

store is going to be collecting data for

11:04

multiple places or for multiple purposes. It's

11:06

going to be collecting data

11:09

about website hits and clicks

11:11

and all that sort of stuff. It's going to

11:13

be collecting data based on

11:15

advertisements and the effectiveness of

11:18

those advertisements and who's driving

11:20

what traffic to the site.

11:23

Then there's going to be information about data

11:25

from cart ads and cart

11:28

deletes and checkouts and

11:30

order data, essentially. That

11:33

ultimately turns into ordering data. And

11:35

then shipment information from a warehouse and that's going

11:37

to be a different set of data. And

11:40

there's 20 or 30 other pieces of

11:42

data that we haven't talked about. But that's basically

11:45

the different sources of the types of data we're

11:47

talking about. Using

11:49

that example, walk through what a semantic

11:51

layer might do and who might be

11:53

able to take advantage of that. Right. So

11:56

imagine you collect all this data into,

11:58

say, a site. into warehouse like snowflake.

12:01

In our example, we can keep it

12:03

simple just to have one single warehouse.

12:05

So you collect all your data into

12:08

that warehouse, maybe use ETL

12:10

tool like a five trend to like

12:13

ETL, some Stripe data,

12:15

ETL, your Shopify orders, and

12:17

then you can enhance that data

12:20

maybe with some analytics coming from

12:22

your websites through segment. So

12:24

by the end of the day, all the data arrives in

12:26

a snowflake. Now in your

12:28

organization, you have Tableau, you have

12:31

Power BI, you have Excel, and

12:33

then you also need to display

12:35

data to the customers through the

12:37

dashboards. Now you start

12:39

building different metrics, say you can

12:42

want to build an average order

12:44

value. So you define that in

12:47

Tableau workbook, you define that

12:49

in a Power BI, and then you

12:51

defined it in a some SQL script

12:53

that's powering your customer facing analytics. Now

12:55

at some point, you probably want to

12:57

change that, because the business evolves,

13:00

the definitions of the metrics evolves,

13:03

and or you need to fix some

13:05

discrepancies in the definitions. So you go

13:07

and redefine that maybe in

13:09

Tableau, but you forget to redefine

13:11

it in Power BI, and maybe

13:13

in some of the charts, in

13:15

this customer facing analytics, because you

13:17

have like 100 SQL queries that

13:19

power different charts, and then it

13:21

sort of becomes sort of like

13:23

the bigger problem with the Molymetrics,

13:26

you know, like you need to change. So at

13:28

some point, they all becoming out of sync. So

13:31

solution to that would be, okay, do

13:33

we really need to define them on

13:35

this visualization layer in the

13:37

first place? What if visualization layer is going

13:40

to be thin, very dumb

13:42

from you know, like logic perspective,

13:44

just to render things, and

13:47

then go to universal semantic layer

13:49

for definitions. Now do really,

13:51

we need to have an average order value

13:53

to be defined in Tableau. Tableau

13:56

can just go and say, hey, Q, give

13:58

me average order value. So now, without

14:01

a semantic layer, your Tableau would

14:03

go directly to Snowflake. With semantic

14:05

layer, the Tableau would go to

14:07

Kube and a Kube would go

14:09

to Snowflake. So Kube would become

14:11

the hub that receives all the

14:13

queries. It changes the queries

14:15

to get the real definition of

14:18

the data and then go to

14:20

Snowflake to query the data and then send it

14:22

back to Tableau. So it's sort of like universal,

14:24

like a proxy or get away to the data

14:26

for all the tools that holds all the definition.

14:29

And by having it, you can

14:31

make a visualization layer very thin

14:34

without any logic attached to that.

14:36

So if you need an average order

14:39

value, you create an average order value

14:41

piece of data, if you will, that's

14:43

calculated. And that calculation occurs

14:45

within the semantic layer. WorkOS

14:57

is a modern identity platform built for B2B

14:59

SaaS. It provides

15:01

seamless APIs for authentication, user identity,

15:04

and complex enterprise features like SSO

15:06

and skin provisioning. It's a drop

15:08

in replacement for Auth0 and supports

15:10

up to 1 million monthly active

15:12

users for free. It's perfect for

15:15

B2B SaaS companies frustrated with high

15:17

costs, opaque pricing, and lack of

15:19

enterprise capabilities supported by legacy Auth

15:21

vendors. The APIs are flexible and

15:24

easy to use, designed to provide

15:26

an effortless experience from your first

15:28

user all the way to your largest

15:30

enterprise customer. Today, hundreds of high growth

15:32

scaleups are already powered by WorkOS, including

15:35

ones you probably know, like Vercel, Webflow,

15:37

and Loon. Check out

15:39

workos.com/SED to learn more.

15:49

So there's obviously dry code

15:51

advantages to this. Are

15:53

there performance advantages as well? Yes.

15:56

So there are several additional

15:59

benefits to that. performance and

16:01

unifying governance and access control.

16:03

So the performance is once

16:06

you define everything in universal semantic

16:08

layer it means you're going to

16:10

query everything through the semantic layer

16:13

right so it means the semantic

16:15

layer becomes a place where caching

16:17

started to make sense as a single

16:20

place of all the requests right so

16:22

once you have all your system goes

16:24

through universal semantic layer now you can

16:26

do a caching there. So kube

16:28

specifically has a few different implementations and

16:30

caching strategies that can help with that

16:33

but I think the idea is that

16:35

because you query through the semantic layer now

16:38

it's an ideal place to cache so it's

16:40

definitely a lot of opportunities to improve performance

16:42

here and the same idea applies

16:44

to access control. So access control

16:46

tends to be not centralized as

16:49

well on the different business intelligence

16:51

tools and visualization tools but if

16:53

you have a single place where

16:55

you query your data through then

16:57

it's an opportunity to centralize access

17:00

control. So that's two additional benefits

17:02

that semantic layer architecture can provide

17:04

as well. Okay okay

17:06

and you mentioned universal governance

17:08

which I know is tied to access

17:10

control but when I think of universal

17:12

governance I think a lot more than

17:14

that too. I'm assuming there's

17:16

other benefits like regulation compliance

17:19

and things like that that can do you want

17:21

to talk about that a little bit? Yeah

17:23

so once you know like once you have

17:26

all the metrics definitions in place and dimension

17:28

in place in semantic

17:30

layer now you can classify them

17:32

and say this is a PI

17:34

data this is data under that

17:37

compliance you can also manage the

17:39

owners of the data so you

17:41

know like you have different groups and

17:43

teams kind of responsible to make sure

17:45

that the data is up to date

17:47

so you can apply the whole set

17:49

of governance features that you would expect.

17:51

I think the difference between semantic

17:54

layer first architecture and a more

17:56

classical governance architecture is that semantic

17:58

layer is active. Meaning that

18:00

you're going to make all your queries

18:03

directly through the semantic layer. In

18:06

a traditional governance approach, the

18:08

governance system usually sits on top

18:10

of your stack and a little bit on

18:12

the side, meaning that your top

18:14

block query is directly snowflake, but then

18:16

you have a governance platform that describes

18:18

the data and talks about the data,

18:20

but there is no strong opportunity to

18:22

enforce that, right? Because it sits on

18:24

the side. You can't filter when you're

18:26

not in line. Exactly, exactly,

18:29

yeah. Okay, so governance is

18:31

a useful case for this as well

18:33

too. And governance changes.

18:36

I mean, a lot of governance rules

18:38

change on a regular basis, and having

18:40

to go through and change the rules

18:43

and multiple tools can be problematic and

18:45

error-prone for that matter. And this allows

18:47

you to change it into a single

18:49

location. Am I characterizing that

18:51

correctly? Yeah, exactly. Yeah, I think, you know,

18:53

like you're spot on. It's the same problem

18:56

with the data model, right? Like it just

18:58

you sprinkle all this definition across all the

19:00

tools, and then it's not centralized,

19:02

it's not dry. So it's the same

19:05

idea applies to the data model and

19:07

to the governance. In fact, I believe

19:09

that governance should be really connected to

19:11

the data model because it all

19:13

about this. It's like, what kind of metric is

19:15

this? What kind of people can access

19:17

to that metric or that specific dimension? Is

19:20

it a PII dimension and you know, like

19:22

so on and so forth? Yeah,

19:24

yeah. In some ways, governance is just a

19:26

part of the metadata associated with the data

19:29

and you need to tie it to the

19:31

data. And that's usually not the

19:33

way it's done, but a layer

19:35

like this can certainly enforce those sorts

19:37

of restrictions. Is

19:47

your code getting dragged down by joins and

19:49

long query times? The problem might be your

19:51

database. Try simplifying the complex with graph. A

19:53

graph database lets you model data the way

19:56

it looks in the real world, instead of

19:58

forcing it into rows and columns. Stop

20:00

asking relational databases to do more than

20:03

they were made for. Graphs work well

20:05

for use cases with lots of data

20:07

connections like supply chain, fraud detection, real-time

20:09

analytics, and Gen AI. With

20:12

Neo4j, you can code in your favorite programming

20:14

language and against any driver. Plus, it's easy

20:16

to integrate into your tech stack. People

20:19

are solving some of the world's biggest problems with graphs. Now

20:22

it's your turn.

20:24

Visit neo4j.com/developer to

20:27

get started. That's

20:29

neo4j.com/developer. So

20:38

in the example that we gave,

20:40

the e-commerce example, we were collecting

20:42

data from multiple sources and essentially

20:44

ETLing them all into a single

20:46

data warehouse and then putting

20:48

the semantic layer on top of that. But

20:50

there actually are scenarios where there

20:52

isn't a universal single data warehouse.

20:55

There's data from multiple sources that

20:57

is disjoint and isn't

21:00

uniform. First of all, can

21:02

you deal with that? I'm assuming you can with

21:04

Qube, but do you recommend centralizing

21:07

your data before you put a semantic layer

21:09

on top or are there advantages to leaving

21:11

the data decentralized and putting a

21:13

semantic layer on top? Yeah,

21:16

great question. So semantic

21:18

layer and it gives specifically

21:20

we can work both, you

21:22

know, like on centralized data

21:25

and data allocated in different places

21:27

and in different styles. The

21:30

way the Qube works is that it can

21:32

dynamically connect to different data warehouses based on

21:34

the data model definition. So when you design

21:37

your data model, you can say, oh, this

21:39

part of the data comes from Snowflake. That

21:41

part of the data comes from maybe BigQuery.

21:44

And then that's how we access this

21:46

data. That's how we are going between

21:48

data sources if needed. I think

21:51

there are like a different approaches

21:53

to modeling data. So obviously, cloud

21:56

data warehouse way, then you have something

21:59

like that. Lakehouse, which is to

22:01

some extent kind of very close to

22:03

the Data Warehouse architecture, and then you

22:05

have like full zero ETL approach where

22:08

you just keep the data in, you

22:10

know, like in places where it is

22:12

and then you use some FIDRATION engine

22:15

like Atrina to access it. So I

22:17

think from a semantic

22:19

layer standpoint we fit into any

22:21

of this. I don't think we

22:23

position semantic layer as a FIDRATION

22:26

engine specifically right now, so we

22:28

would rather rely on something like

22:30

Atrina if you really need to

22:32

do heavy FIDRATION. We can FIDRATE

22:34

data, so we can join

22:36

across multiple data sources, but if

22:39

you need to really go deep into

22:41

more like complicated use cases where you

22:43

need to like push down some compute

22:45

to the source and then kind of

22:47

bring it and more like massage data,

22:49

so there are like engines like Atrina

22:51

that are really good at it. But

22:53

at the end of the day we

22:55

can both with both kind of ETL

22:57

data, full ETL data or like not

22:59

ETL data and that needs to be

23:01

FIDRATED. In terms of the advantages,

23:04

I think there is a lot of advantages to

23:06

have the data in a single place, but at

23:08

the same time it's always the cost of the

23:10

moving data. So I think if

23:12

the cost of moving data is really

23:14

high for a variety of reasons, then

23:16

it's probably not worth it. But if

23:19

there is an opportunity to centralize data

23:21

in a single warehouse like house architecture,

23:23

it feels like a preferable solution. That

23:25

makes sense. I think one of the things that I

23:28

see when we talk about you

23:30

know low ETL data processing where

23:32

you have multiple sources, you leave

23:34

into multiple sources, is interpretation

23:36

of data can be an issue. You

23:38

know as a simple example, you

23:41

have data coming from well let's say

23:43

from your Spotify order processing. This is

23:45

a simple using our example here. And

23:48

you know engineering is looking at that

23:50

data, marketing is looking at that data, finance

23:52

is looking at that data, and

23:55

they all think it looks different than

23:57

it really does. Because

23:59

they make assumptions. Every single one

24:01

of them make assumptions about what the

24:03

data means that may or may

24:05

not be true assumptions. The finance

24:07

may assume every transaction

24:09

ends up – I'm kind of drawing

24:12

a blank on specific examples here. But

24:14

what I'm trying to get to here is data

24:17

itself, when you look at data from

24:19

a given source, different consumers

24:21

of that data can make different assumptions

24:23

about what that data actually means or

24:26

contains that may or may not be

24:28

true. And so you get different interpretations.

24:31

Tell me how a somatic layer

24:33

helps remove that problem. Yeah,

24:36

good question. So I felt like at the

24:38

end of the day we talk about the

24:40

trust in data. We're like, do we

24:43

have a trust in data in our data

24:45

consumers, right? When they go and consume data

24:47

in a tableau, do they trust the data?

24:50

I feel like the way to

24:52

solve it is to provide as

24:54

much context as possible to the

24:56

data consumer so they understand where

24:58

the data is coming from and

25:01

how it's being processed and how it's

25:03

being calculated. But at the same time,

25:05

the problem is that all these steps,

25:07

they are very technical, right? We

25:10

have a lot of pipelines like in transformation

25:12

code and data modeling code kind of telling

25:14

you, oh yeah, you can look at it,

25:16

but if you can read code, you can

25:19

understand that. I think the problem

25:21

here is how to make the breach

25:23

to tell data consumers how the

25:26

data is calculated. So imagine a

25:28

work, you know, like an organization,

25:31

imagine, you know, like an e-commerce

25:33

store that has a data engineer and then

25:35

the marketing is looking at the data, finance

25:37

is looking at the data and they don't

25:39

trust the data. So now they come into

25:41

a data engineer and asking, hey, how did

25:43

you calculate that? And if the engineer can

25:45

explain in more details how this calculation done,

25:47

that will create more trust level right in

25:49

that data. So I think the question is

25:51

like, is it possible to optimize that? And

25:54

I feel like that is connected to

25:56

maybe like governance or we just mostly

25:58

catalogs solution of the data. So

26:00

that's something that we think a lot

26:02

about at Q as a because we've

26:05

been this place where all the data

26:07

model definitions allocated today

26:10

Is there an opportunity to surface

26:12

that knowledge to the data consumer?

26:15

In a way that they can learn about

26:17

the data learn what kind of the data

26:20

they have what the lineage of the data

26:22

the definitions and then take that knowledge and

26:24

use the data and Trust

26:26

the data that makes sense you know,

26:28

I think one of the problems that I run

26:31

into a lot when I do these sorts of

26:33

data analysis is You've

26:35

got a definition of what you want to try and get out

26:37

of your data and how you get that

26:39

data Changes the

26:42

results right? So a simple example.

26:44

I'm trying to get the average

26:47

click-through rate for Social

26:49

media engaged sessions coming into my site. I

26:52

made that up or something like that. Yeah

26:54

Well, what is the average data for that

26:56

mean? Well, let's look over 90 days. Let's

26:58

okay. So what 90 days? Well

27:00

the last quarter or the last

27:02

90 days or last 90 days minus today or

27:04

last 90 days Plus yesterday or

27:07

you see what I'm saying is the definition

27:09

of how do I get that? What seems

27:11

like a very well described piece of information,

27:13

right is different and

27:15

different people have different interpretations what

27:17

ends up happening is Different

27:20

people come up with radically different answers

27:22

and in fact how you Collect the

27:25

data can actually cause the data

27:27

to be more or less

27:29

usable as well Tell me

27:31

how a somatic layer like this can

27:33

help with that problem. Yeah. Yeah, that's

27:36

a perfect example

27:38

of a problem that we're trying to

27:40

solve with a semantic layer because Because

27:43

even a single definition can have

27:45

a lot of variations, right? Like

27:48

as in your example average Click-through

27:51

rate for different people. It may mean

27:53

different things and then if

27:55

you let you know like

27:58

this definition to sprinkle across

28:00

organization to be located in different places. Anyway,

28:02

we will have a lot of different definitions,

28:04

right? And everyone would like create their own

28:07

like definitions here, like on a sideways, or

28:09

no one is like looking at and then

28:11

kind of use that definition in some presentation

28:14

that will end up at the board level

28:16

and the board level would be like, what

28:18

the number is that? Problem is that

28:21

we need a centralized governance

28:23

of the metrics. And like

28:26

people can have different metrics. That's

28:28

totally fine. We just need a

28:30

framework to be able to develop

28:32

them and then document

28:35

and then just share with the rest of

28:37

the organization. So if someone would come to, you

28:40

know, like to me as a data engineer and say, like, I want

28:42

this average click through rate, I will

28:44

ask a lot of follow up questions of how exactly

28:46

we want to do this, right? But do you

28:49

want it to calculate it this way? Look at the 90 days.

28:51

Are we talking about the rolling? Are we

28:54

talking about any specific filters and all of

28:56

that? So through that conversation, we will come

28:58

up with some definition that we mutually agree,

29:00

right? The person who wants the metrics

29:03

and I'm as a data engineer and I can say,

29:05

okay, we have the data that

29:07

we collect in a specific way that we can

29:10

support you. Calculations that we

29:12

can give you that metric. Now I

29:14

would go into my semantic layer. I will create

29:16

that as a code. Essentially it's

29:18

a conscious of code base by the end of

29:20

the day. You put this definition in a code

29:23

base and then you have this

29:25

metric. Now, and now that metric is going

29:27

to be available in a table or like

29:29

other places. Now I do

29:31

need to document it. Of course, I need to create

29:34

a really good description. That's how it's

29:36

being calculated. So the other

29:38

person who comes to that metric, they can

29:40

read this definition and they understand, okay, this

29:43

is an average click through rate, but that's

29:45

exactly how that was calculated. And maybe they

29:47

need another one and different type of the

29:50

calculation. They come to me and say, Hey,

29:52

I need to make a change. That's

29:55

fine. And we can make the change and

29:57

then we can create a second one. I think.

30:00

potential challenge in that solution. That's

30:02

sort of a state of today, right? Like this

30:04

is how all users use cube and

30:07

how they, they leverage the whole stack.

30:09

That is might create a little bit like a work

30:12

for the data engineer to make

30:14

sure that everything is defined and documented.

30:17

I think that's where I believe

30:19

that AI can help us. We

30:21

haven't counted about AI yet, but I felt like

30:23

we should have, right? It's just like this. Yeah,

30:26

right. I think

30:28

what you made and think we can do is

30:30

that as long as we keep everything as code

30:33

base, AI is really good

30:35

at writing code. I think

30:37

that's the best use case for

30:39

the modern day AI is just to

30:41

generate code and generate text

30:43

description. So essentially we can use AI to

30:46

like, if you need a new metric, maybe

30:48

you don't need to go to data engineer, you

30:50

go to AI and AI can go and

30:52

take that metric definition, generate that code to

30:54

create this metric definition and then send the

30:57

pool request and then data engineer. They

30:59

only need to review that. Your

31:01

point about AI, I think it's illegal in

31:03

today's society, not to talk about AI when

31:05

you're talking about data, but so yeah, I

31:08

agree with that. But I hear

31:10

what you're saying and I mostly, but I'm not

31:12

sure I completely agree with you, it makes a

31:14

lot of sense, but I think one of the

31:16

problems you still can run

31:18

into, let's keep to the data engineer

31:20

example and then we'll extend the day

31:22

in a second, when someone

31:25

comes to you and wants the definition

31:27

and you create that definition of some

31:30

piece of data and give it to

31:32

them, ideally you'd want everyone else to

31:34

use that same definition. Now you

31:36

mentioned someone else is going to come, they want

31:38

something slightly different and so you create a

31:40

second copy of the data definition with a

31:43

slight variation and then a

31:45

third copy with a slight variation and a

31:47

fourth copy of the slight variation and sooner

31:49

or later you still have a

31:51

semantic layer, but rather than having 20 definitions

31:54

that are scattered throughout your organization, you

31:56

have 20 definitions side by

31:59

side. in the semantic

32:01

layer and they're all different definitions. How

32:03

do you avoid that problem? And doesn't

32:05

AI actually make that problem worse by

32:07

making it easier to create new ones?

32:11

Yeah, that's a good question. I

32:13

think I still believe that's a

32:15

better state of

32:17

things rather than having this 20

32:19

definitions being hidden and without any

32:21

understanding whether they've been used or

32:23

not at all. Like once we

32:25

have them in a central place,

32:28

let's say 20 definitions, we

32:30

might understand that maybe this

32:32

15 of them are

32:34

really legacy right now. They essentially should

32:37

be deprecated because they're not being used

32:39

and then we can centrally govern them

32:41

and sort of remove them

32:43

from the stack entirely. Once

32:46

we have that central place, we can see the

32:48

lineage and we see, oh, there is link, no

32:50

really charts that have been powered by this definitions.

32:52

Or maybe they're like, it's still charged but people don't

32:54

use them, let's go and deprecate this chart. So

32:57

it's a place that helps us to control

32:59

that and evolve. I think we need to

33:01

accept that the definitions are going to change.

33:04

We just need to build a framework how we support

33:06

that change. Got it, so

33:08

the simple fact that they're located one area makes

33:11

the change easier and makes

33:13

the consolidation of the change is not just

33:16

duplicating now. You can also consolidate a lot

33:18

easier as well because you know which

33:20

ones are being used, who's using

33:23

them and you could potentially, and this might

33:25

be also a good use for AI as well.

33:27

You got 30 definitions with

33:30

minor differences between them. 20

33:33

of them are being used. Can we

33:35

consolidate that into five by making certain

33:38

changes in the definitions that might

33:40

be acceptable to the consumers and

33:43

perhaps the AI can actually help

33:45

making those recommendations and you can

33:48

adjust the usage models to

33:50

a better definition. That's also

33:52

more uniform as well. But you can't do

33:54

that if they're scattered throughout the entire

33:56

code base. You can only do that if they're

33:59

all known. all centralized and understood

34:01

by one system

34:06

or one entity. Is that a fair statement?

34:08

Yeah, exactly. I think you're spot on. And also

34:10

it's all a single code base, right? It's a

34:12

single framework. So you can refactor it. You

34:14

can think about, okay, how we make these

34:17

definitions more efficient. We can use

34:19

AI, as you mentioned, to help us

34:21

support, you know, like the similarities,

34:23

the differences. So, you know, like in a

34:26

help to refactor it. But I

34:28

feel like having it as a code base

34:30

and in the central place that gives a

34:32

lot of downstream benefits. So if

34:34

I were to describe two problems, which do

34:36

you think is the biggest problem with data

34:38

modeling today? And then the follow on question

34:41

is, how can what

34:43

you're doing help with this? Is

34:45

the biggest problem with data modeling, is it

34:48

a lack of cohesive modeling

34:50

where data is hard to understand,

34:52

some tools don't know what the

34:55

data is or how to make use of it.

34:58

Like for instance, AI, you can't just throw an

35:00

AI algorithm at data without any understanding of that

35:02

data. Or is the model

35:05

meaning lost or

35:08

misunderstood because it's not well documented and

35:11

hence the data is misused or misunderstood

35:13

by some tools? What do you think

35:15

is the bigger problem? I

35:18

think in general the

35:20

data modeling as a concept,

35:23

it's hard to maybe be it's

35:25

easier for the engineers like data people who are

35:27

doing that, you know, like all their lives like

35:29

for 10, 15 and even more years. But then

35:31

if we're like bring it

35:33

to the data consumers, what

35:35

a measure, what a dimension all the

35:38

business intelligence don't be showing all this

35:40

like, you know, like multi dimensional concepts,

35:42

but it's still, it's still sometimes hard

35:44

to get an idea what is it.

35:46

Sometimes a lot of people talk about

35:48

metrics. But really, I don't think

35:50

we even have a universal definition of a metric.

35:52

What is a metric? Is metric a measure? Or

35:55

metric is a measure with a time dimension. And

35:57

if we add a filter to that. That

36:00

is the metric. So there is like,

36:02

I feel like a lot of bag space

36:06

around the data modeling, especially

36:08

between connecting the business concepts

36:11

and the business users and data consumers with

36:13

the data engineers, because in a data engineering

36:15

and a data modeling kind of part,

36:18

it's a little bit more determined. We

36:20

have a lot of different approaches, you

36:22

know, like a Kimball, we have like

36:25

data wall, all of that stuff. So

36:27

it's a little bit more structured. But

36:29

then when the complexity comes in is

36:31

how we translate that structure,

36:34

which is inherently very complex to the

36:36

data consumers, when they only ask you

36:38

for a metric. And you try

36:40

to explain, oh, but metric is less

36:42

tangible, right? Like let's think about measures

36:44

and dimensions and all of this. So

36:46

I think that's, that's a

36:48

hard thing. Now, how we

36:51

can make it easier documentation is,

36:53

I think, I'm just in general kind of

36:55

keeping everything up to date and

36:58

documenting it just it's not a hard problem

37:00

from a sort of brain power perspective, but

37:02

it's a lot of like a manual mundane

37:04

work that no one wants to do. And

37:06

that's an example where like AI can actually

37:09

help us. I think that's a big problem,

37:11

but it's just like something we really need

37:13

to automate, but we never had really good

37:15

tools to automate that. And then

37:17

I think we can go and kind of try

37:20

to solve this many, many problems.

37:23

And we will help us and then of

37:25

the day to bring the

37:29

non data faults closer to the data to

37:31

better understand data, because things like better documentation

37:33

to definitely help them to just kind of

37:35

work with work with the data in a

37:38

better way. So mostly

37:40

what we've been talking about AI here

37:42

now is AI as a

37:45

tool to analyze the

37:47

data model and analyze the data structure

37:50

and the data essentially automating a lot

37:52

of the things that a data engineer

37:54

does with the data. So

37:56

helping to increase the usefulness of

37:58

the data. by giving

38:00

better documentation and better understanding of

38:03

the data that you have available.

38:05

And that's great. What about for

38:07

the customer that's looking to apply

38:09

AI and a

38:12

large language model to analyze their data

38:15

to help create customer

38:18

useful information based on that data?

38:20

In other words, the building and

38:22

large learning models that need to

38:24

understand your data, how does

38:26

the data semantic model help that use

38:29

of AI? I think the

38:31

way that modern AI

38:34

transformers, LLMs, they work

38:37

with the data is through the

38:39

code generation. So essentially

38:41

because of the architecture they build

38:44

in, that's probably the best way

38:46

to do that. So we cannot

38:48

really think about uploading

38:50

a lot of data right into

38:52

AI or into its context because

38:54

the context is limited by

38:57

its architecture. So the best

38:59

way the system can do

39:01

analysis is to break down

39:04

the complex task into multiple self-tasks

39:07

and then execute the code snippets to

39:09

analyze the data. And then based on

39:11

this, come up to

39:13

some conclusion that it needs to generate

39:15

a code snippet again and then arrive

39:18

at the final answer. That's how you

39:20

know, for example, Chia GPD for

39:22

data analysis works. If you upload the Excel

39:24

file and say, hey, run me some analysis

39:26

on top of it, it will just generate

39:28

a Python script, execute it and give you

39:30

the answer back. So now

39:32

I think it means

39:34

that many, many AI agents,

39:37

they would need to access data in

39:39

a cloud, data warehouses, in a lake

39:41

houses and all these places. Now

39:43

the questions like how they would be able

39:45

to do that and the answer is by

39:48

generating and executing SQL. The system, they will

39:50

generate a lot of SQL. So I feel

39:52

like a lot of SQL right now is

39:54

being written by humans or being generated by

39:56

business intelligence tools. In the next few years,

39:58

we'll see a lot of SQL being... generated by

40:00

AI agents. Now

40:03

the question is like how we can help them to generate the

40:05

SQL? Because they don't really

40:07

know anything about your data, they don't

40:09

have a context, how do they know what

40:11

columns they have? The simplest approach would be

40:13

like, oh yes, let's just take DDL of

40:16

your database and you know like

40:18

give it to the AI agent as a context.

40:21

And if people try that and I

40:23

think it was some research papers and

40:25

benchmarks kind of comparing different approaches, so

40:28

that usually does not give you a

40:30

really good and strong accuracy because the

40:32

columns are cryptic, you

40:35

don't understand the relationship between entities. So

40:39

the way to improve that is

40:41

to give context about the data. It's

40:43

like what dimensions you have here,

40:45

what measures you have, what like

40:48

columns you have in data, what

40:50

are relationships between different entities. So

40:52

essentially give semantics to the AI

40:54

agent. So now if you package

40:56

that as a really good context

40:59

that you can attach to that

41:01

your prompt and say now generate

41:03

SQL, then the AI agent would

41:05

obviously generate a much better SQL,

41:07

so we're very high accuracy. So

41:09

that's the thing how these semantic

41:11

layers can help in that architecture,

41:14

they can be this provider of this context.

41:16

Yeah, that makes a lot of sense since

41:19

again it all comes down to data understanding,

41:21

right? And the AI has to understand your

41:23

data just like the humans who are doing

41:25

the analysis does. So

41:28

we've been talking a lot about

41:30

the BI use cases for data

41:32

and for the

41:34

most part those are batch

41:38

analysis. Not always, but they're not

41:41

real-time analysis for the most part. Most of

41:43

the types of analysis we've been talking about

41:45

so far are the types

41:47

of analysis that happen after the fact.

41:50

But what about real-time data analysis? How

41:52

useful is a semantic layer? Does a

41:54

semantic layer like Q

41:57

help or does it just delay you?

42:00

the processing to the point that

42:02

makes it unrealistic for real-time analysis.

42:05

How does it work in the scope of

42:07

a real-time analysis? Yeah, so

42:09

I think a few things

42:12

here. First, I think real-time needed

42:15

very rarely, like a true real-time. There

42:17

are use cases where we need a

42:20

true real-time, like a streaming flow real-time,

42:23

but with my experience in data, it's

42:25

a very rare case. In

42:27

many use cases, we don't need a

42:30

real real-time. Now, the other

42:32

thing is streaming level real-time is

42:34

extremely expensive. If

42:36

an organization or team decides, oh, we

42:39

need the streaming level real-time, then

42:41

they need to be ready to pay

42:43

for that, because all these technologies that

42:45

help you to process the streaming

42:47

data, they are extremely expensive, and

42:50

your stack is going to cost a lot. And

42:53

the other thing is it's really hard. There

42:55

is no single solution. It's not like it's no

42:57

flake for streaming data that you can just, oh,

42:59

let's just stream everything into that warehouse and then

43:02

write a lot of SQL queries. People

43:04

try to do that, and there are a

43:06

lot of great companies and technologies that started

43:08

to try to address this problem. Probably

43:11

the older one was like a ksqlDB.

43:13

I don't think it's very active anymore,

43:15

but it was an attempt to bring

43:18

SQL to the streaming, and

43:20

then there are like a newer

43:22

one materialized with interesting ideas around,

43:24

like, can we build a snowflake

43:26

level experience but on top of

43:28

streaming data? It's still hard. Everything is

43:30

still in progress. I'm sort of a

43:32

bullish that these technologies will help us

43:34

to make our life easier, but it's

43:36

still hard. Now, how it all connects

43:38

is semantically a cube specifically

43:40

built to work on top of SQL

43:43

back end. So, if

43:45

there is a way to run a SQL on

43:47

top of the streaming data, like, for example, these

43:50

materialized, so with ksqlDB, you can potentially put a

43:52

cube on top of that, and that's going to

43:54

work. But, cube is not

43:56

designed to work on top of like a

43:58

Kafka, you know, like just... or something like

44:00

that, you still need to have a back end. So all

44:04

the streaming architecture we have within our

44:06

community of users, they are all very

44:08

complicated. Got it. Are

44:10

you a fast service or are you a

44:12

standalone application? How are you structured and how

44:14

do people engage with you? Great

44:16

question. We have a cloud offering where

44:19

we can have like a shared cloud. So

44:21

essentially it's going to be one VPC that

44:24

in the specific region, in a specific cloud,

44:26

that we support that our customers

44:29

can share that VPC essentially

44:32

to multi-tenant architecture. And then we have

44:34

a dedicated offering that essentially we spin

44:36

out the dedicated VPC instance in

44:39

a specific region that customers selected in

44:41

specific platform. And then we run everything

44:43

in that VPC. And then finally we

44:45

can bring, we call it to bring your

44:47

own cloud. We can bring everything inside

44:49

the customer cloud. So it really depends, you

44:52

know, like as you can imagine, the first

44:54

option is more like for SM based in

44:56

a mean market. And then as we go

44:58

to like larger enterprises with more compliance, it's

45:00

just more regulations. So it's

45:02

a more complicated deployment versions. That

45:04

makes sense. Who are your competitors?

45:07

So we are the

45:10

Google, they bought a company

45:12

called Looker about three, four

45:14

years ago. So Looker was essentially

45:16

a business intelligence tool. They

45:19

bought Looker to sell

45:21

more BigQuery. And also they

45:23

wanted to make Looker to be

45:26

more like a headless the universal

45:28

semantic layer eventually because the

45:30

Looker has a really strong semantic layer LookML

45:32

in it. So they wanted to take it

45:34

kind of advantage of that and just kind

45:37

of what if we can use

45:39

that semantic layer not only for Looker UI,

45:41

but across all the other tools. It's still

45:44

TBD, to be honest, it still didn't happen.

45:46

So, and we know that the modern Google

45:48

is not doing well in terms

45:50

of the acquisitions, right? It's not yet

45:53

clear when it's going to materialize as a

45:55

competitor and if it's going to happen at

45:57

all. We also have a company called

45:59

Edscape. they've been around a

46:01

little longer than Kube, solving the

46:03

same problem. Some of the

46:06

concepts are very the same. I think

46:08

the difference between Kube and at scale

46:11

is that we are very

46:13

caught first and we are approached

46:16

this problem with more like an engineering

46:18

philosophy and engineering triggers approach where like

46:20

it's a code base you can put

46:22

it under the version control that's all

46:24

the things you can do where like

46:26

at scale is more you know like

46:29

visual builder more like a traditional

46:31

think about business objects universe style

46:33

of experience. Right. So it sounds

46:35

like it's still a pretty young

46:37

space though is that a fair

46:39

statement? It's a very young

46:41

space. I think it's a very new

46:43

category that is developing very fast and

46:45

I think you know like we'll see

46:47

more and more adoption of it in

46:49

the next several years but it's

46:51

still very young space. Right. So

46:54

Ardeum is the CEO of Kube,

46:56

a data modeling company focused on

46:58

data semantics. Ardeum, thank you for

47:00

joining me today on Software Engineering

47:02

Daily. Thank you for having me.

47:04

It was a great conversation.

Rate

Get this podcast via API

From The Podcast

Software Engineering Daily

Technical interviews about software topics.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More