Surveying The Market Of Database Products by Data Engineering Podcast | Podchaser

Episode from the podcastData Engineering Podcast

Surveying The Market Of Database Products

Released Monday, 30th October 2023

Good episode? Give it some love!

Surveying The Market Of Database Products

Surveying The Market Of Database Products

Monday, 30th October 2023

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello,

0:11

and welcome to the Data Engineering Podcast,

0:13

the show about modern data management.

0:17

Introducing Ruterstack Profiles. Ruterstack

0:19

Profiles takes the SAS guesswork and SQL

0:21

grunt work out of building complete customer profiles

0:24

so you can quickly ship actionable, enriched

0:26

data to every downstream team. You

0:29

specify the customer traits, then Profiles

0:31

runs the joins and computations for you to create

0:34

complete customer profiles. Get

0:36

all of the details and try the new product today

0:38

at DataEngineeringPodcast.com slash Ruterstack.

0:42

You shouldn't have to throw away the database to build

0:44

with fast-changing data. You should be able

0:46

to keep the familiarity of SQL and the proven

0:49

architecture of cloud warehouses, but swap

0:51

the decades-old batch computation model for

0:53

an efficient incremental engine to get complex

0:55

queries that are always up to date.

0:57

With Materialize, you can. It's the

1:00

only true SQL streaming database built

1:02

from the ground up to meet the needs of modern data

1:04

products. Whether it's real-time

1:06

dashboarding and analytics, personalization

1:08

and segmentation, or automation and alerting,

1:11

Materialize gives you the ability to work with fresh,

1:13

correct, and scalable results, all in

1:15

a familiar SQL interface. Go to

1:17

DataEngineeringPodcast.com slash

1:19

Materialize today to get two weeks free.

1:22

Your host is Tobias Macy, and today

1:24

I'm interviewing Tanya Braggen about her views

1:26

on the database products market. So, Tanya,

1:28

can you start by introducing yourself? Thank

1:30

you, Tobias, and it's great to be on the show. So

1:33

as you mentioned, my name is Tanya Braggen. I've been

1:35

in the data space for roughly

1:38

a decade and a half now. My beginnings

1:40

were really coming into the space more from a consulting

1:43

perspective. I was a student of computer

1:45

science and I worked for Deloitte and then went back

1:47

to grad school. And kind of how I got into the data space

1:49

is I was looking for my next job out

1:52

of grad school. And the advice

1:54

I got was, you know, go and interview for product

1:56

management jobs. And I happened to land at

1:58

a startup in the Seattle area called

1:59

ExtraHop Networks. And this was my first

2:02

data startup. It was specifically

2:04

in the networking kind of niche, but I learned a lot

2:06

about building analytics for

2:09

large amounts of data. And from there, I went on

2:11

to Elastic, the company behind Elasticsearch.

2:14

And this is really where I would say

2:16

the majority of my experience in a

2:18

data space has formed. And

2:20

in the past couple of years, I moved on to a company

2:22

called Clickhouse, which is another

2:24

company similarly to Elastic focused

2:27

on data analytics.

2:28

And you mentioned a bit about your history. Do

2:30

you remember where in that journey

2:32

you first started working in the data

2:35

space and what it is about it that made

2:37

you want to keep going in that trajectory?

2:39

Yeah, so at ExtraHop, I didn't

2:41

think of myself as really working in a data space because

2:43

we were building a solution specifically for network

2:46

engineers. But of course, a big aspect

2:48

of it was capturing all this networking data.

2:50

And we actually had a custom database

2:52

that we built specifically to run on these network

2:55

appliances. This was in the era when really

2:57

a lot of companies still were on premise and

2:59

how they captured network data was in these big appliances.

3:02

And to run efficiently inside that appliance, ExtraHop

3:04

built a custom database. And I knew of course, a lot

3:07

about it, but it wasn't something that we sold to the

3:09

general market. With Elastic, things are very different.

3:11

Elastic was one of the first, I would say,

3:13

really popular analytical databases

3:16

that was open source and just widely

3:18

adopted first for search and then for logging. And

3:20

that's when I really sort of got very interested

3:23

in the aspect of what a database, simply

3:25

just a database can enable in terms of use cases.

3:27

Because the kind of use cases Elastic enabled were

3:30

really, really broad and wide. And this is also

3:32

where I really just started enjoying working

3:34

with open source technologies and communities. For me,

3:36

this was a big just revelation

3:39

of how much you can learn from just somebody picking

3:41

up your product and using it for something unexpected.

3:44

And that was a large reason for why I joined

3:46

Clickhouse. This is also an open source database

3:49

and growing in popularity, primarily

3:51

due to the open source distribution. And there's

3:53

somebody working in the product

3:56

side of a database vendor,

3:59

what are some of the

3:59

the aspect of the database

4:02

market, the technology that you're

4:04

focused on and that are the pieces

4:09

of the technology and the ecosystem

4:12

that are most relevant to

4:14

your specific role and the types of

4:16

end users that you're interacting

4:18

with to get feedback on the product.

4:20

So as you kind of pointed out, I think even

4:23

just by asking this question, database in

4:25

the end is simply infrastructure. It enables

4:27

storing data. In the end, what

4:29

users want to do with it is enable real world

4:32

use cases, something that they're building, an application

4:34

that they're building. And those are the things that I really

4:36

look at. What are people building? Why

4:39

are they building it? Why this specific technology

4:41

and not that becomes a lever for

4:43

them to build it faster, better, and

4:46

why this sometimes just

4:48

causes a completely new technology

4:51

to come to market. But the last thing, the interesting

4:53

part was search, right? This

4:55

was in the era when websites

4:58

were still kind of new to having search as an

5:00

experience on their website. Of course, now we're all

5:02

very used to having a search bar. If you come to a website

5:05

and there's no search bar, you would be like, this is nuts.

5:08

Everybody must have a search bar. But when Elasticsearch became

5:10

popular, it wasn't yet the case. And

5:12

so explosion and interest in building search

5:15

technologies or search experiences rather, and

5:17

that enabled by search technologies, is what really

5:19

caused Elasticsearch to appear as a

5:22

really prominent player there. And for

5:24

me, I continue to watch new applications.

5:26

To me, what's really interesting is what

5:28

is the next trend? What is the next application

5:31

that everyone is going to build? And what will

5:33

they need for that? Because that's what ultimately

5:35

a database technology enables.

5:36

And going from Elastic

5:39

to Clickhouse, they're very

5:41

different engines, very different

5:43

target use cases. I'm sure that there's some

5:46

overlap in terms of the ways

5:48

that they're being applied. I'm wondering what

5:50

are some of the aspects of your learnings

5:52

from your time at Elastic that you've been able to bring

5:55

into Clickhouse to help inform some

5:57

of the product direction that you want

5:59

to drive towards?

5:59

Yeah, it's interesting that you say that Elastic and Clickhouse

6:02

are different. They're actually very similar in

6:04

many ways. Elastic started off as,

6:07

again, known as primarily the search technology.

6:09

So the main data structure that it used

6:11

was an inverted index to get a bunch of documents

6:14

indexed for very fast search. But then very

6:16

quickly, it added a columnar store to enable

6:18

analytics. And why? It's because

6:20

a search bar usually then results

6:23

in an experience of then looking at the

6:25

actual results that are brought back

6:27

and analyzing them. So it made sense to pair

6:29

this inverted index with a columnar store for analytics.

6:32

And so during my time in Elastic, I was actually

6:34

responsible for what was then called the

6:36

logging product line. We really thought of

6:39

analytics as just analyzing logs. Any

6:41

event was a log. And that's where

6:43

the biggest overlap is with technologies like Clickhouse

6:46

and other all-up databases. So while Elastic

6:48

didn't call themselves an all-up database, they were

6:50

absolutely one, and they still are, right?

6:53

Just they called themselves a search engine and just kind

6:55

of stuck with them. They called everything a search use

6:57

case. But in reality, they had a very, and

6:59

they still have a very popular analytic solution.

7:01

In terms of Clickhouse, I'll get back

7:04

to it, but kind of going to your original question,

7:06

like what aspects of my

7:08

Elastic experience apply now Clickhouse?

7:10

Again, a lot. Both databases are

7:12

open source. And so what I find is that

7:15

in product management, working with open source

7:17

products versus fully commercial products, it's

7:19

a very different ballgame. In open source,

7:21

you have this community of users that you

7:24

may never meet and you cannot necessarily

7:26

interview. So it's almost like the elements of consumer-oriented

7:29

product management come in. You have to almost

7:32

measure the sentiment in your user base

7:34

as opposed to knowing every

7:35

commercial user of your product.

7:37

You have to look again at adoption trends

7:40

versus buying trends. And it's

7:42

really interesting. Certainly my learning there

7:45

from Elastic mapped very much onto

7:47

my experience currently at Clickhouse. The

7:49

second part that maps very well is working

7:52

for a venture-backed fast-growing

7:54

company. Once you have venture investment,

7:56

it's just a very different ballgame versus say bootstrapping

7:59

a company. you're simply working on open source

8:01

project that doesn't have that aspect. At

8:03

Elastic, again, this was a really great learning.

8:05

It was just a rocket ship in terms of growth.

8:08

And so learning how to stick

8:10

with the pace of the company growth, how to evolve

8:12

during that time, was something that

8:15

I took forward with me. And the last part

8:17

is leading teams, which I think kind of comes with growth.

8:20

If you work for a high growing company, often

8:23

you are in a position to step into a leadership

8:25

role if you wish, certainly there's opportunity.

8:27

And then how do you then bring

8:29

new talent into the company? How do you

8:31

motivate new people to take

8:33

on the challenges that maybe you're doing today?

8:36

Those aspects absolutely map.

8:37

And another interesting aspect

8:40

of this particular area of the

8:42

industry is that databases

8:44

are kind of their own category of product

8:47

where there's a lot of pieces of data infrastructure,

8:49

but the database is typically

8:52

something that requires a certain amount of

8:54

time and diligence before

8:57

just bringing in to an infrastructure

8:59

because it is likely going to outlast

9:02

pretty much every other aspect of the application that's being

9:04

built on top of it because of the

9:06

weight of the data that is stored

9:09

there. And for people who are

9:11

thinking about database technologies,

9:13

how they want to structure their applications,

9:16

can we start by just enumerating the overarching

9:19

categories within the database

9:21

product market as it exists today?

9:23

Yeah, but you're absolutely right about

9:26

databases being so sticky, right?

9:28

Like being the center of gravity, almost of the infrastructure.

9:31

So yeah, like where

9:33

to start? So first of all, I would say

9:36

transactional databases are still the workhorse

9:38

of just a typical data workload.

9:41

And why? Because a

9:44

lot of the data is well served

9:46

by transactional databases. This is why and

9:49

this is why Postgres, MySQL,

9:51

also like traditionally the document databases

9:53

that have evolved to have more transaction capabilities

9:56

like MongoDB, those are commonplace. If

9:58

you're picking up a new application,

11:29

trends

12:00

of the industry. You mentioned that when you started at

12:02

Elastic, it was still fairly early on.

12:04

Search was an up and coming experience

12:08

that consumers were starting to grow

12:10

accustomed to and expect. I'm wondering

12:12

what are some of the major trends

12:15

in the industry, both as far as the

12:18

consumer patterns, the ways

12:20

the databases are being incorporated into

12:22

applications and infrastructure that have

12:24

driven the development and growth

12:27

of some of these new and emerging categories,

12:29

particularly for the very niche use cases.

12:32

Yeah. So I think, you

12:34

know, in addition to searches I mentioned for even

12:36

during Elastic Search, this area

12:38

of analyzing data was already becoming

12:41

big and there's so many sub use cases

12:43

there. And the trend again of needing

12:45

an analytical database for some

12:48

of these interactive application continues. I'll give

12:50

a couple of examples and actually here I'll start with Clickhouse

12:52

just because again, it's a newer technology drive driven

12:54

a little bit by some of the newer trends.

12:57

So originally Clickhouse and the name stands

12:59

for Clickstream Data Warehouse was

13:01

developed for a web analytical workload

13:04

basically. So Google Analytics is probably the most

13:06

common example that might come to mind if

13:08

you want to analyze the performance on your website, you

13:10

put something in your website

13:13

like a snippet of JavaScript and that sends events

13:15

back as to who visits your website and

13:17

why and you can go and analyze that

13:19

data. So that kind of data, which

13:22

is append mostly, right, and you

13:25

know, not changing, usually again, it's like a log of data,

13:28

but it comes at a really high rate

13:30

and the results and the kind

13:32

of analysis that you do looks both at

13:34

the most recent data and historical data and

13:37

asks questions of just a few columns of the data.

13:39

So it's a very typical kind of all up workload.

13:41

This is the workflow for which Clickhouse was originally

13:44

kind of built. But interestingly, like the kind of use

13:46

cases and applications I see now that

13:48

are being built on top of Clickhouse

13:50

and similar technologies are really

13:52

driven by this trend to build, I would say

13:54

productivity tools across all industry

13:57

verticals. So marketing professionals

13:59

as an example.

13:59

More and more tools are being

14:02

built to make marketing professionals

14:04

more effective. And why? Because ad tech

14:06

continues to grow. There's so many things

14:09

that a marketer needs to do today to optimize

14:11

spend in terms of driving leads.

14:14

It is absolutely a data-oriented

14:16

job. There's no way for you to do a good

14:18

job as a marketer without having access to

14:20

data and effective tools on top of that data to make

14:22

decisions. Basically, it's a must. So

14:25

everybody that has an effective marketing department is

14:27

buying these tools and drives development

14:29

of all of these SaaS startups in the marketing

14:32

space.

14:32

Same in the sales space. If you're a seller

14:35

today, in order for you to be

14:37

effective and to have an edge over competition,

14:40

again, is to use data to really understand

14:42

the trends in your region, to really understand

14:44

some of the view maybe that your marketing colleagues have,

14:47

but with a kind of a lens of a salesperson.

14:49

So again, all of these sales productivity

14:51

startups

14:52

need to analyze a lot of data and they have

14:54

to choose a database to do it at scale and

14:56

also efficiently, because if these are

14:58

SaaS services, it's not just about delivering

15:00

fast results. The database has to be optimized

15:03

for your workload for you to have positive margins.

15:05

And so this is why more specialized

15:07

analytical databases are getting adopted

15:10

for building some of these very data-intensive

15:13

interactive applications that ultimately

15:15

drive ROI for many businesses.

15:18

And I can talk about more applications, but

15:20

I wanted to hit on that because, again, it's really data-intensive

15:24

applications that need interaction,

15:26

that need real-time decision-making.

15:31

This episode is brought to you by Datafold, a

15:33

testing automation platform for data engineers

15:36

that finds data quality issues before the code

15:38

and data are deployed to production. Datafold

15:41

leverages data diffing to compare production

15:43

and development environments and column-level

15:45

lineage to show you the exact impact of every

15:47

code change on data, metrics, and BI

15:49

tools, keeping your team productive

15:52

and stakeholders happy. Datafold

15:54

integrates with DBT, the modern data

15:56

stack, and seamlessly plugs in your data CI

15:59

for team-wide and...

15:59

automated testing.

16:01

If you are migrating to a modern data stack,

16:03

Datafold can also help you automate data and

16:06

code validation to speed up the migration.

16:08

Learn more about Datafold by visiting dataengineeringpodcast.com

16:12

slash datafold today. Absolutely.

16:16

And within the different,

16:18

particularly newer segments

16:20

of the database market, what are the pieces

16:23

that you see growing most rapidly

16:25

or requiring or

16:27

at least gaining the most attention and potentially

16:30

leading to accelerated growth?

16:32

Yeah. So again,

16:34

going back to some of the newest trends,

16:37

again, unless you've been under a rock, you've

16:39

heard of OpenAI, you've heard of ChadGPT,

16:42

and you've heard of GenAI applications.

16:44

I think a lot of people are asking themselves right now,

16:47

first of all, how much attention

16:50

should I be paying to this trend? Is this something

16:52

that's going to completely change the way

16:54

I build products in my sector? Or

16:56

is it just incremental? And if it's

16:58

more disruptive, does it mean

17:01

that I need to change the way I build applications?

17:04

What does it mean to consume results

17:06

from a large language models? Do I have

17:08

to actually train one myself? So a lot of

17:10

people are asking those questions. And in

17:13

terms of application building, what's

17:15

becoming really clear is that

17:17

while hosted large language models

17:19

are quite adept in order to get

17:22

really good results for any particular domain,

17:24

you do have to fine tune

17:25

those results.

17:26

And in order to fine tune those results, at some

17:28

point, you have to, again, if

17:30

you know the space, you'll know the terminology, but you have

17:33

to develop these embeddings based

17:35

on the data that you have and combine that

17:37

with results that are coming back from a

17:40

pre-trained model that maybe you're consuming.

17:43

So there's a question right now of whether

17:45

to build an application that is

17:48

somehow powered by an LLM, that

17:50

you have to have a way to host your own embeddings, or

17:52

can you do this in some other hosted scenario?

17:54

So it becomes kind of a question for a lot of engineers

17:57

and developers out there is, do I need a specialized

17:59

back store or can I just use Postgres

18:02

and the built-in Postgres kind of vector store,

18:04

is that going to be enough? Same with

18:07

an all-app database. If you're using

18:09

Clickhouse, the question becomes, well, is Clickhouse

18:11

vector search

18:12

sufficient for my purposes or do I need

18:14

something even more specialized like Pinecone? I

18:16

believe it's still an open question. However,

18:19

if there's anything I've seen kind of in terms of trends

18:21

in technology space in general, it

18:24

is usually toward simplicity

18:26

and consolidation. So I think if it's possible

18:28

for existing databases to build in those

18:30

capabilities in a way that's sufficiently

18:33

performant and resource efficient,

18:35

then it will happen. If it's simply impossible,

18:37

if the architectures are so divergent and

18:40

these workloads are that important, there

18:42

may be

18:42

a third class of databases that gets developed.

18:45

But I think it's an open question. Yeah,

18:47

it's definitely interesting and

18:49

early days for the vector database

18:51

market. And yes, everybody has

18:53

their opinions as to which one is going to

18:55

win out, particularly if you happen to work

18:57

for that vector database vendor.

19:00

For sure.

19:01

And again, the way I see it is

19:03

like, certainly, again, I think transactional

19:06

analytical databases should be developing

19:08

these capabilities. Because if it's possible for you to serve

19:10

even a fraction of that market, somebody doesn't

19:13

have to get a new database. I'll give you an example for why

19:15

our customers ask for it. So we have customers

19:17

in a fraud analytics space where they're analyzing

19:20

a lot of information or to make a decision as

19:22

to say, a transaction is fraudulent or

19:24

some behavior is undesirable.

19:26

And they do it based on heuristics. So they have

19:29

an analytical database for that purpose. And

19:31

it was working very well for them. And now they want to augment

19:33

it with a couple of fraud detection

19:35

methods that are maybe reliant on

19:38

LLMs. They don't want to move all of this data. And they don't want

19:40

to ideally, they don't want to host two databases

19:42

with overlapping data. If possible,

19:45

they just want to host embeddings in Clickhouse and

19:47

combine that with the data they already have in Clickhouse.

19:49

So if we can deliver them performance that is sufficient

19:52

for their use case, of course, we will try to do that. Does

19:54

it mean that there's no even

19:57

more advanced use case for which a vector database

19:59

is necessary? No, it doesn't mean that. So it's

20:01

possible that both need to exist, that existing

20:04

databases need to add embeddings

20:06

and vector search capabilities, but still for

20:08

more specialized use cases, you may need a

20:10

dedicated vector database.

20:13

Circling back to

20:15

the stickiness of databases

20:17

as a piece of infrastructure, we've touched

20:19

on a few of the types of questions that teams should be

20:21

thinking about in that selection process,

20:24

but wondering if you can just talk through

20:26

some of the core elements

20:29

of performing proper due diligence

20:31

on this technology selection, some of the

20:34

technology concerns, some of the organizational

20:36

concerns, and just some of the ways

20:38

that teams should be approaching this

20:41

step of identifying, do

20:43

I even need a new database? Do I need a database

20:45

at all? And if so, which is the right

20:48

one for this particular use case?

20:50

Right. I was thinking about this

20:52

question ahead of time. And it's a

20:54

tough challenge, actually, because in order

20:56

to select a database, you have to really understand your

20:58

workload. And sometimes you don't, like you start

21:00

building an application and you don't yet know

21:03

what the shape of your workload is going to look like until

21:05

you've built the app or prototype the app,

21:07

or really kind of got to a point where real

21:10

world usage is driving certain

21:13

shapes of your workload. You may not know ahead

21:15

of time exactly how many columns

21:17

you're going to have in your data or which column,

21:21

for instance, will end up having a

21:23

certain cardinality of value. So you just simply don't know.

21:26

You could have a hypothesis, but you may not know.

21:28

So one thing I will say, you probably will

21:30

make the wrong decision at some point, like if

21:32

you have a database that simply doesn't scale, the

21:35

question then is how quickly can you migrate

21:38

or move some of that workload to another technology?

21:40

This is why at Clickhouse, we actually do focus specifically

21:43

on making that part of the journey easier. We just

21:45

anticipate that, of course, a lot of existing

21:47

folks who are users at some point will hit a

21:50

scaling limitation and they will need to quickly

21:52

onboard onto Clickhouse. And making that path

21:54

very simple is important. And then as

21:56

far as trying to do it upfront, I guess I

21:59

would say that, yes. just knowing that there

22:01

is even a transactional versus analytical workload

22:03

is important because they are quite different.

22:06

Transactional workloads ultimately are

22:08

more static, right? You have rows, of course, you

22:10

can kind of grow away, but you're

22:13

updating existing data in place. It's

22:15

a slower, I would say, growing workload,

22:17

whereas analytical workloads are kind

22:20

of more like changes. Imagine you've got a

22:22

more static inventory of

22:24

products. Your analytical workload would be anything

22:26

that has to do with changes in inventory. And of course, that

22:28

data set, kind of time index is going

22:30

to grow a lot faster. So anything

22:33

that grows really fast because it's really

22:35

more about changes in some other static data

22:37

set, that is an analytical workload. So knowing

22:40

that is the case, I would say from early

22:42

on establishing this pattern where you have both

22:44

a transactional analytical database

22:46

is valuable and then kind of basing

22:49

your technology decisions based

22:51

on that and kind of anticipating that

22:53

that is the case. I'm also seeing increasingly,

22:57

again, database vendors and database technologies

22:59

anticipating that for users in the first place. So

23:01

there's transactional databases building more

23:04

and more foreign data wrappers for analytical databases

23:06

and even almost helping their users detect

23:08

when they hit some sort of scaling limits in the transactional

23:11

database and saying, okay, like move it to an analytical

23:13

database and we'll still give you ability to kind

23:15

of query

23:16

across both and vice versa. Analytical

23:19

data stores build CDC capture,

23:21

like a change data capture capabilities to very

23:23

quickly. Detect changes in transactional

23:25

databases and onboarding those

23:28

workloads. So hopefully that helps. Like I would say

23:30

just even knowing that transactional versus

23:32

analytical workloads exist already helps

23:34

a lot.

23:35

And another interesting aspect

23:37

of this overall question of which

23:40

engine do I need, particularly

23:42

in that divide between OLTP

23:44

or online transactional processing and online

23:46

analytical processing, is do you

23:49

need both? If so, how do you

23:51

make them work better together where transactional

23:54

engines have long been the

23:56

solid workhorse of application

23:58

development for a long time? they were even

24:00

the engines used for data

24:03

warehousing in the early days of data warehousing

24:05

before we got column restores and MPP

24:07

databases. And now that

24:10

we do have column restore is available, we do

24:12

have MPP databases for being able to

24:14

parallelize that analytics, what do

24:16

you see as the major motivators for

24:19

having that be a separate set

24:21

of technologies, separate pieces of infrastructure

24:24

and some of the inefficiencies and

24:26

complexities that are driven as a result of that?

24:28

It's true, right? I think the only thing

24:30

I can think of is just the size

24:32

of analytical workloads grew,

24:34

you

24:35

know, again, exponentially, or grew

24:38

to such a point where transactional databases

24:40

became just not feasible for

24:42

the type of analysis that people want to do. And

24:45

also the expectations of the type of applications

24:47

you want to build changed. Because I think for a while,

24:49

when it came to analytics, it was sufficient to have

24:52

the kind of experience where you produce a

24:54

report, right, like you analyze something

24:56

and you produce a report and gets emailed to you,

24:58

you know, every day or every week or even

25:00

every month, you know, so like imagine kind of an internal

25:03

workload that is analytics focused,

25:05

like that was just kind of how internal teams

25:07

work for a long time. And that, of course,

25:09

would not work for any sort of, you know,

25:12

SaaS applications where interactive experience is

25:14

required. So I think their revolution actually started

25:16

with the SaaS part, people wanted to build

25:18

more interactive experiences on the websites

25:21

that kind of introduced technologies again,

25:23

first like Elasticsearch, you know, many others that

25:26

powered these applications. And now the question

25:28

is being asked internally by internal teams is why

25:30

shouldn't we adopt the same for internal

25:33

users? Why should they wait for a report?

25:35

Or why should they have the kind of query that you run

25:37

and then kind of go away and come back to in many minutes?

25:40

Those questions are being asked. And so

25:42

what we're seeing now is I think some of the things

25:45

that have made some of these SaaS services

25:47

successful, internal teams are asking themselves, why

25:49

shouldn't our internal users have that experience? Because if

25:52

they don't, they actually will go and try to consume those

25:54

SaaS services, right? And internal

25:56

teams are seeing kind of more and more demands

25:58

for interactive dashboards. interactive applications.

26:01

I would say with internal teams where this started,

26:03

at least in my experience, was on the financial

26:05

side. So financial sector for a while really

26:08

led in terms of just having high

26:10

expectations

26:10

for internal users. If you're

26:12

a trader, at the end of the day, you need to have an interactive

26:14

application that helps you make a decision

26:17

of what bets to place the next day. And

26:19

you can't wait for the following

26:21

day. You need that decision now. So any

26:23

internal stakeholders where they needed

26:26

to consume data very quickly and

26:28

interactively, I think this is what really

26:31

introduced the need for more specialized

26:34

databases and data stores for analytical workloads

26:36

that could support these interactive use

26:38

cases.

26:42

Data projects are notoriously complex.

26:44

With multiple stakeholders to manage across varying

26:47

backgrounds and tool chains, even simple

26:49

reports can become unwieldy to maintain. Miro

26:52

is your single pane of glass where everyone can discover,

26:55

track, and collaborate on your organization's data.

26:58

I especially like the ability to combine your technical

27:01

diagrams with data documentation and

27:03

dependency mapping, allowing your data

27:05

engineers and data consumers to communicate

27:07

seamlessly about your projects. Find

27:10

simplicity in your most complex projects with

27:12

Miro.

27:13

Your first three Miro boards are free when you

27:15

sign up today at dataengineeringpodcast.com

27:18

slash Miro. That's three free boards

27:21

at dataengineeringpodcast.com

27:23

slash M-I-R-O. One

27:27

of the shortcomings that is introduced

27:30

by virtue of splitting out

27:32

the analytical engine for its

27:34

speed of analysis and computation from

27:36

the transactional store that is getting

27:39

the data as it is generated is

27:41

the need for being able to either say, we're

27:43

going to batch this, and this is how long you

27:45

can expect to have data delayed when you're running

27:47

this report. Or you need to bring in something like

27:49

change data capture or some other streaming technology

27:52

to be able to feed the data directly over

27:54

to the analytical system. And a third

27:57

approach that I've seen applied

27:59

in some is federation

28:01

of queries where this is where things like Trino,

28:03

Presto come in. I know Clickhouse has

28:06

some support for things like foreign data

28:08

wrappers. I'm wondering what you see

28:10

as the overall trade-off,

28:13

some of the ways that teams should be thinking about how

28:15

best to make the analytical system

28:17

work as closely as possible with the transactional

28:20

store without introducing

28:22

arbitrary breakage when network connections

28:24

fail.

28:25

Yeah, there's several very interesting

28:28

topics here. So on the change

28:30

data capture side, I believe this

28:32

needs to be, again, just a built-in capability of

28:35

analytical databases. Clickhouse

28:37

handles it by... We have this concept

28:39

of a materialized Postgres and materialized

28:41

MySQL engine where we

28:44

basically... Yeah, we can create almost like a

28:46

logical view of your

28:48

MySQL or Postgres database and just query

28:51

it as well as capture changes from

28:55

these databases using these engines

28:57

that basically act as our CDC. I believe it

28:59

just needs to be built in and vice

29:01

versa. Old LTP databases should

29:03

have foreign data wrappers for the most popular analytical

29:06

databases that they see kind of in their ecosystem.

29:09

But you mentioned object stores

29:11

and kind of the data lake use case. This is another

29:13

really interesting evolution that we're seeing.

29:16

So again, primarily on internal

29:18

analytics side, what we've seen is

29:20

that cloud data warehouses, like

29:23

Snowflake, Redshift, BigQuery, they

29:26

of course have come to prominence in the past, say,

29:28

five years. And their big accomplishment

29:31

was moving all of these on-premise,

29:33

more traditional data warehouse workloads from Teradata,

29:36

Oracle, and so on into the cloud. And

29:38

it's great because now that these workloads are in

29:40

a cloud environment, teams, and again, primarily

29:43

it might be internal teams working on internal

29:46

analytical use cases or asking themselves, well,

29:48

does it make sense to keep these workloads

29:51

in a monolithic data warehouse? Or

29:53

does it make sense, for instance, to put some of these workloads

29:56

into a data lake and to

29:58

query using different data?

29:59

engines. And we are, I

30:02

would say that what I'm seeing is really

30:04

more the trend toward unbundling these

30:06

cloud data warehouses. Again, not every organization

30:09

is bought into it yet, but we're definitely seeing that trend

30:11

in some of the organizations that we work with where

30:14

they're saying, okay, now that we have this

30:16

data in a more open environment, in a cloud

30:19

provider of choice, we can start again, moving

30:21

the pieces where they belong. And the way

30:23

Clickhouse fits into it is it's becoming

30:26

more like a real time engine to

30:28

work on top of data lakes, as well as next

30:30

to data lakes, and helping kind

30:32

of that trend of unbundling what has

30:34

become kind of a monolithic

30:36

version, like of an on

30:38

premise data warehouse, but in the cloud, the cloud data

30:40

warehouse.

30:41

Another element of database

30:43

engines, the ways that they fit into

30:46

in particular analytical use cases

30:48

is that they're not the only

30:50

operator in that space. There's typically

30:53

a complex web of dependencies between

30:55

different systems, data is flowing in

30:57

and flowing out for different use cases.

30:59

And so it can be difficult to understand

31:01

what is actually happening at any moment in time

31:04

when it needs to debug something, which brings

31:06

in the question of data observability. And

31:09

that is a whole other market.

31:11

But from the perspective of somebody working

31:14

with teams building database engines,

31:17

what do you see as the role of the database

31:19

itself in cooperating and

31:21

enabling the observability

31:23

aspects from an analytical perspective

31:26

so that people who are operating these infrastructures

31:28

can have more confidence that they're

31:31

looking at the right things that they understand what's going

31:33

on and that they can tune the workloads as

31:35

needed.

31:36

As you mentioned, I'm more on

31:38

the side of a database vendor,

31:40

like working with data observability tools. So

31:42

the first thing I will mention just how

31:45

important data observability

31:46

tools

31:47

are starting to become to stakeholders,

31:50

it does seem like there's been an inflection where it's

31:52

just an expectation. And this is in addition

31:54

to other data management tooling that we see.

31:56

So, you know, data versioning

31:59

data So that tooling, I would

32:01

say, we're seeing a movement where

32:03

it starts to be used, I would say, much earlier

32:06

in the adoption of a data store, especially again,

32:08

for internal analytics, when you've

32:11

got many stakeholders and they

32:13

all need to understand what is the data catalog,

32:16

what is data lineage, like how are changes propagated.

32:19

Even for our own internal data warehouse team,

32:22

you might imagine our commercial

32:25

focus is around our cloud offerings, so our finance

32:27

team just lives and dies by this MRR number,

32:30

monthly recurring revenue. Well, this number gets generated

32:32

from many sources of data and any

32:35

change that may affect

32:37

how this number gets calculated, it's

32:40

critical for us to understand. If there's anything

32:43

that occurs that may taint

32:46

how we view this number, we report

32:48

it to the board, we're of course reported internally. So

32:51

companies have similar important

32:53

metrics and data

32:55

fields that they need to understand their

32:58

integrity. And so this is driving adoption

32:59

of tools that I've already mentioned, specifically

33:03

data orchestration, data versioning,

33:05

and data observability. The database vendor,

33:08

so what we do to enable

33:10

these tools, and there are tools that integrate

33:12

with Clickhouse. Some of them work, by the way, on

33:14

top of other tools, so for instance, DBT, a

33:17

pretty big player in this space, some

33:19

of them work very natively on top of that.

33:21

What they ask of us is a few things. One

33:24

is really good kind of self observability.

33:26

So every time anything

33:29

changes in the database, it needs to be observable.

33:32

And within Clickhouse, the way it's accomplished is,

33:34

we're a database, where would we put data about

33:36

ourselves? We put it in ourselves. Like when you spin up

33:38

Clickhouse, it has these system tables, as

33:41

we call them. Everything is in there. Any DDL

33:43

statement that you run, any log about

33:46

anything that happens is in our internal

33:48

system tables, you can query it, it's very easy,

33:50

it's right there. And we just happen to be also very efficient

33:53

at storing them. So it's not a big overhead on the database

33:55

itself. But that is what makes it

33:57

very easy for data observability partners

34:00

integrate with us, there's nothing we have to add for

34:02

them. All the data is there and

34:04

they can query it on day one. And then

34:07

the second part I would say is ability

34:10

to go deeper if need

34:12

be. So there needs to be some ability to turn

34:14

on kind of more advanced tracing and profiling

34:17

if something goes wrong. This is

34:19

where, you know, Clickhouse and other vendors

34:21

are starting to build in open standards

34:23

based ways to kind of self monitor more internals

34:25

of the database. So open telemetry

34:28

is kind of a really popular

34:29

increasingly popular way of monitoring

34:33

specifically say traces within a database

34:35

product, that is something you would turn on optionally

34:38

and use only if needed.

34:40

And then from somebody who's working

34:42

on the product side, dealing

34:45

with people who are trying to

34:47

understand how a given database

34:49

engine fits within their stack and within their

34:51

use case, what are some of the elements of customer

34:54

education that you find yourself coming back to

34:56

the most or areas of

34:59

maybe misunderstanding or misconceptions that people

35:01

have going into the tool selection

35:03

process?

35:04

So that's a really interesting question. And this

35:06

one may surprise you a little bit because like with

35:09

Clickhouse, and again, the last image was a little

35:11

bit different because we were a search technology and kind

35:13

of our terminology was very search oriented.

35:15

And people came to Elasticsearch with

35:17

an expectation that it was a search engine

35:20

first primarily and then everything else second.

35:22

With Clickhouse, we're mostly

35:24

like ANSI SQL compatible. And like from

35:26

a syntax perspective, for the most part,

35:28

you can kind of take your queries and just kind of port them

35:31

over. We do have some SQL extensions for analytics.

35:34

That's extra. But if you're like coming over from transactional

35:36

world, you might look at Clickhouse and

35:38

say, ah, you know, I just take my workloads there and everything's

35:40

fine. But where things kind of break up a

35:42

little bit, and this is something to pay attention to when adopting

35:45

any new database is in the end, the

35:47

devil's in the details when it comes to specifically

35:49

data organization and semantics.

35:52

So I'll give you one example. We have a

35:54

concept of a primary key, we call this a primary

35:56

key in Clickhouse. What it means in Clickhouse

35:58

is actually the key by which we.

35:59

And why is that important is because

36:02

for analytical workloads, how you've sorted the

36:04

records based on which key, basically

36:06

the data is organized kind of in order versus not

36:09

has a huge effect on how

36:11

fast you can query columns back for specific

36:13

types of aggregation. So for analytical workloads

36:16

like Clickhouse, the sorting order

36:18

of records basically on disk is very important. So

36:21

we call, when you create a table, we

36:23

say you should use a primary key and that primary

36:25

key should be something by which you will query. And

36:27

that's what we say. But of course,

36:29

in transaction order, primary key means something completely

36:31

different,

36:31

right? It's all about sort of constraints

36:34

and, you know, and so users

36:36

get very confused. They say, like, you look like SQL and

36:38

you walk like SQL, but you have this primary key that means

36:40

something completely different. So I guess for folks

36:42

building databases, my advice is don't

36:45

take terms that mean something else in very

36:47

popular databases and make them mean

36:49

something else entirely in your database. It's going to

36:52

be confusing. For us, I think it's too late

36:54

to unroll that one. But if, like, I was

36:56

the creator of Clickhouse back in the day, I probably

36:58

would have made a different decision on the name

37:00

of, like, primary key. And there's a few other small

37:02

examples like this.

37:03

Yeah, naming things is hard. Always

37:06

very hard.

37:07

But back to education, like, how do we educate

37:09

users? So, yes, we educate them on some of these

37:11

nuances. But actually, yeah, a lot of the education

37:14

goes into them understanding

37:16

that ultimately, when you're adopting an analytical

37:18

database, there's some thought that has

37:20

to happen. Some thought has to go

37:22

into how you actually organize the workloads,

37:25

because do you really just want to

37:27

take your, like, highly relational

37:29

workload in which there's actual world into analytics? Most

37:31

likely not. You could. It would work.

37:33

But actually, this is now how you get the most out of an analytical

37:36

database. You typically will do a little bit

37:38

more flattening of the data, not completely.

37:40

Like how supports joins. But to get the most

37:42

out of your use case, you may do a little

37:44

bit more, again, processing of a data

37:47

before querying it. And this is where Clickhouse

37:49

has a concept of materialized views. We can

37:51

take actually a highly sort of, you know,

37:54

do normalized data and then help you almost like

37:56

using EOT. And this is where DBT becomes important

37:58

to transform it to some. something you would actually want

38:01

to query. So that is kind of built in,

38:03

but you have to understand that you have to do that.

38:05

And that's where a lot of the education happens.

38:08

And in your experience of

38:10

working in this space, working with

38:12

end users now at Clickhouse,

38:14

also with Elastic, what are some of the most

38:16

interesting or innovative or unexpected ways

38:19

that you've seen people applying database

38:21

technologies, whether specific

38:23

to the tools that you worked on or

38:25

just more generally?

38:26

So with Elastic, there was actually a very interesting

38:29

use case. I remember it struck me, our

38:31

first user conference for Elastic Search,

38:33

we had somebody from NASA present on

38:36

the Mars Rover use case. And that just blew my mind, right?

38:38

I mean, like the telemetry that was

38:40

created on Mars, right,

38:42

got sent to Earth and put into Elastic search.

38:45

And that was just like that was just very,

38:47

I don't know, surprising to me that that, like a search technology

38:49

or analytical technology would get adopted in that

38:51

context. It shouldn't surprise me. And in the

38:53

end, from a technical perspective, that workload probably actually

38:56

even wasn't the most challenging because you don't have

38:58

that much bandwidth to transmit that much data. But

39:00

it was just very cool and very exotic.

39:03

Let's just put it that way. You know, for Clickhouse, it

39:05

also, what blows my mind is just the scale at which

39:08

this product can run. As I mentioned, it was developed

39:10

for an internet scale, kind

39:12

of web analytics use case. It can ingest

39:15

billions and trillions of rows. Just today

39:17

we published a case study with Ahrefs,

39:19

which is again, another vendor that does, basically

39:21

crawls the whole internet and stores their data

39:23

in Clickhouse. And it's just amazing the

39:25

scale at which you can run, but it doesn't mean that

39:28

you don't need it at a smaller scale. You still do, right?

39:30

And there's still these inflection points where, you

39:32

know, even for a much smaller dataset, you

39:35

need an analytical database just based on the

39:37

types of queries, interactive experiences you

39:39

can run. And in your own experience

39:42

of working in this space, what are the most interesting

39:44

or unexpected or challenging lessons that you've learned?

39:47

Unexpected lessons for me. I

39:49

think the main one, maybe I

39:51

mentioned in the beginning, which was when you

39:54

transition from commercial to open source

39:56

databases, as a product person, you do

39:58

have to think very differently. And. that,

40:00

like how you leverage the community is

40:02

something that you shouldn't underestimate that

40:05

it's a huge, huge value. The community is not

40:07

just a free kind of distribution channel about

40:09

your free users. It's a big channel,

40:11

first of all, for innovation. You just mentioned interesting

40:14

use cases.

40:14

A lot of these users just come from downloading

40:17

the product. Somebody just has an idea and they just

40:19

want to download a product and use it for free to prove

40:21

out their idea. They don't have any budget. Often

40:23

it's a passion project. So these types

40:25

of community users are just gold. And this

40:27

is something that I love about working with open

40:30

source products, that these types of

40:32

individuals and their ideas get nurtured

40:34

by the fact that the technology is free at scale.

40:37

Like this is a difference from a freemium product. A freemium product

40:39

typically is sort of scale limited, whereas an

40:41

open source distribution model in databases,

40:43

which by the way, I think has

40:44

worn out. I think it's pretty clear. The

40:47

typical sort of like distribution is

40:49

an at scale solution you can run. So

40:51

that's one thing that was kind of surprising to me. The second

40:53

thing was actually at Elastic

40:56

when we got to kind of be an at scale company,

40:59

we had this kind of fork in the road in terms of how

41:01

do we grow? Like from a platform perspective, we

41:03

were a really popular platform for

41:05

search and certain types of analytics, but how do we grow

41:07

the company? And the direction that the company

41:10

ultimately took was to add more

41:12

vertical solutions based on

41:14

this open platform. And so if you look at Elastic's

41:17

website right now, they talk about observability

41:19

and security and what they call enterprise search. And

41:22

how you kind of do this kind of growth is

41:24

you actually need to build out a solution based

41:26

on this database. You can try to build it organically,

41:29

but typically actually you kind of pursue an acquisition

41:31

strategy. And what was surprising to me was

41:33

with an open source product, when you do M&A,

41:36

when you look at companies that build

41:39

products and solutions, you can actually try

41:41

to find companies that have already built a product based

41:43

on your open database. And

41:45

then the integration costs are very low because

41:47

you just bring in this team, they already know your technology.

41:50

They've already built a solution on your stack,

41:52

on your technology stack. And so then the integration

41:54

play is much faster and that really helped us

41:56

out at Elastic.

41:57

And as you continue

41:59

to...

41:59

to iterate on the product that

42:02

you're involved with as you keep an eye on the broader

42:04

database market from a competitive

42:07

standpoint, from an educational standpoint,

42:09

what are some of the predictions that you have

42:12

for the future trends in the database

42:14

market?

42:15

Okay, so a couple of things. We

42:17

talked about OLAP versus OLTP and

42:20

my prediction is that OLAP does continue

42:22

to grow in prominence. Still

42:25

today, I think that

42:26

most users start with OLTP and then

42:28

sort of almost through trial and error arrive

42:31

at needing OLAP. I

42:33

do think that in the course of a

42:35

few years, we'll see more

42:37

of a pattern where you just simply start with both.

42:39

That's one of my predictions. I don't know

42:41

that it's gonna happen this year, but I do believe

42:44

just the amount of investment that's

42:46

happening in the OLAP space, and

42:48

by the way, right now, usually folks

42:50

call it like the real semanalytic space, I think

42:53

is going to lead to a lot more

42:55

awareness. And again, that's only specifically

42:58

Clickhouse. There's so many other technologies in

43:00

the space, but I think generally like the space

43:02

of OLAP and real semanalytics is going to lead

43:04

to developers starting with both. They're

43:06

gonna start with OLTP and OLAP, and this is

43:08

how they just they build

43:09

out their product. That's number one.

43:12

My second prediction more on the

43:14

internal team side is this cloud

43:16

data warehouse unbundling trend

43:19

continues. I do think that data

43:21

lakes will continue to rise in prominence

43:23

just because it just makes sense. Like there's so many things

43:25

that make sense about a data lake.

43:28

You have one kind of object store that's powering

43:31

many use cases, and you can leverage different open

43:33

technologies on top of it. Just that pattern makes

43:35

sense to me. This is why it's important for us to

43:37

invest in it. It doesn't mean that you won't have

43:39

some specialized storage because in the end, like

43:41

even with Clickhouse, we work pretty fast

43:43

on top of object stores, say with Parquet

43:45

or Icebeck format, but in the end, our native format

43:48

is even faster. So for some workloads, you

43:50

may still leverage specialized store, but for

43:53

other use cases, you probably don't want to. Like if you

43:55

have a use case where you want both a data

43:57

scientist and an app to have access to the same

43:59

data, why would you? you duplicate it. Like you'll want

44:01

just to keep it in one place and have two kind of

44:04

analytical engines pointing to it. I think that trend

44:06

is going to continue. And finally, from

44:08

the perspective of where we talked about vector

44:11

stores and Gen AI, I mean, something is going to

44:13

happen. I don't think the hype is going to completely flame

44:15

out and we're just all going to say like this was nothing.

44:17

I think it's going to lead to new applications. I

44:19

don't know that it's going to be quite as disruptive

44:22

as, you know, some sometimes

44:25

say, I think in the end, it comes back

44:27

to like, what experiences do we want to build? So

44:29

again, say I'm

44:29

building a product for marketing professionals.

44:32

Okay. Like I'm going to leverage large language

44:34

models to again, incorporate more aspects

44:36

of natural language into kind of

44:39

my suggestions, but I don't think

44:41

it's going to be everything. I think there's

44:43

still going to be a lot of domain knowledge that remains

44:46

outside of a large language model.

44:48

And I think that it's going to be kind of a blend

44:50

of approaches.

44:56

Thank you for listening. Don't

44:58

forget to check out our other shows. Podcast.init,

45:00

which covers the Python language, its community,

45:03

and the innovative ways of being used. And

45:05

the Machine Learning Podcast, which

45:07

helps you go from idea to production as machine

45:09

learning. Visit the site at dataengineeringpodcast.com,

45:13

subscribe to the show, sign up for the mailing

45:15

list and read the show notes. And if you've

45:17

learned something or tried out a product from a show,

45:19

then tell us about it. Email hosts

45:21

at dataengineeringpodcast.com with your

45:24

story. And to help other people

45:26

find the show, please leave a review on Apple podcasts

45:28

and tell your friends

45:29

and family. The

45:34

promise of these tools is pretty

45:36

great,

45:36

but I think it's early days for this tooling.

45:39

And there's a few players, but I think that there's still a lot

45:42

more that these tools can do and flipping

45:44

it more on the side of database vendors.

45:46

I think database vendors need to have more

45:48

built-in observability of the database

45:50

itself.

45:51

So it's easier to build these tools across

45:53

offerings. So that's, I would say

45:55

one of the bigger gaps that I would note.

45:57

Well, thank you very much for taking the time.

45:59

today to join me and share your

46:02

perspective and experience and expertise

46:05

on database product development and

46:07

ways to be thinking about the incorporation

46:09

of databases into applications and infrastructure.

46:12

It's definitely a very interesting problem domain

46:14

and it's great to see the trajectory

46:17

of Clickhouse and so appreciate

46:19

the time and energy that you're putting into that and I hope you enjoy

46:21

the rest of your day. Thank

46:22

you for having me Tobias.

Rate

Get this podcast via API

From The Podcast

Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More