Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello,
0:11
and welcome to the Data Engineering Podcast,
0:13
the show about modern data management.
0:17
Introducing Ruterstack Profiles. Ruterstack
0:19
Profiles takes the SAS guesswork and SQL
0:21
grunt work out of building complete customer profiles
0:24
so you can quickly ship actionable, enriched
0:26
data to every downstream team. You
0:29
specify the customer traits, then Profiles
0:31
runs the joins and computations for you to create
0:34
complete customer profiles. Get
0:36
all of the details and try the new product today
0:38
at DataEngineeringPodcast.com slash Ruterstack.
0:42
You shouldn't have to throw away the database to build
0:44
with fast-changing data. You should be able
0:46
to keep the familiarity of SQL and the proven
0:49
architecture of cloud warehouses, but swap
0:51
the decades-old batch computation model for
0:53
an efficient incremental engine to get complex
0:55
queries that are always up to date.
0:57
With Materialize, you can. It's the
1:00
only true SQL streaming database built
1:02
from the ground up to meet the needs of modern data
1:04
products. Whether it's real-time
1:06
dashboarding and analytics, personalization
1:08
and segmentation, or automation and alerting,
1:11
Materialize gives you the ability to work with fresh,
1:13
correct, and scalable results, all in
1:15
a familiar SQL interface. Go to
1:17
DataEngineeringPodcast.com slash
1:19
Materialize today to get two weeks free.
1:22
Your host is Tobias Macy, and today
1:24
I'm interviewing Tanya Braggen about her views
1:26
on the database products market. So, Tanya,
1:28
can you start by introducing yourself? Thank
1:30
you, Tobias, and it's great to be on the show. So
1:33
as you mentioned, my name is Tanya Braggen. I've been
1:35
in the data space for roughly
1:38
a decade and a half now. My beginnings
1:40
were really coming into the space more from a consulting
1:43
perspective. I was a student of computer
1:45
science and I worked for Deloitte and then went back
1:47
to grad school. And kind of how I got into the data space
1:49
is I was looking for my next job out
1:52
of grad school. And the advice
1:54
I got was, you know, go and interview for product
1:56
management jobs. And I happened to land at
1:58
a startup in the Seattle area called
1:59
ExtraHop Networks. And this was my first
2:02
data startup. It was specifically
2:04
in the networking kind of niche, but I learned a lot
2:06
about building analytics for
2:09
large amounts of data. And from there, I went on
2:11
to Elastic, the company behind Elasticsearch.
2:14
And this is really where I would say
2:16
the majority of my experience in a
2:18
data space has formed. And
2:20
in the past couple of years, I moved on to a company
2:22
called Clickhouse, which is another
2:24
company similarly to Elastic focused
2:27
on data analytics.
2:28
And you mentioned a bit about your history. Do
2:30
you remember where in that journey
2:32
you first started working in the data
2:35
space and what it is about it that made
2:37
you want to keep going in that trajectory?
2:39
Yeah, so at ExtraHop, I didn't
2:41
think of myself as really working in a data space because
2:43
we were building a solution specifically for network
2:46
engineers. But of course, a big aspect
2:48
of it was capturing all this networking data.
2:50
And we actually had a custom database
2:52
that we built specifically to run on these network
2:55
appliances. This was in the era when really
2:57
a lot of companies still were on premise and
2:59
how they captured network data was in these big appliances.
3:02
And to run efficiently inside that appliance, ExtraHop
3:04
built a custom database. And I knew of course, a lot
3:07
about it, but it wasn't something that we sold to the
3:09
general market. With Elastic, things are very different.
3:11
Elastic was one of the first, I would say,
3:13
really popular analytical databases
3:16
that was open source and just widely
3:18
adopted first for search and then for logging. And
3:20
that's when I really sort of got very interested
3:23
in the aspect of what a database, simply
3:25
just a database can enable in terms of use cases.
3:27
Because the kind of use cases Elastic enabled were
3:30
really, really broad and wide. And this is also
3:32
where I really just started enjoying working
3:34
with open source technologies and communities. For me,
3:36
this was a big just revelation
3:39
of how much you can learn from just somebody picking
3:41
up your product and using it for something unexpected.
3:44
And that was a large reason for why I joined
3:46
Clickhouse. This is also an open source database
3:49
and growing in popularity, primarily
3:51
due to the open source distribution. And there's
3:53
somebody working in the product
3:56
side of a database vendor,
3:59
what are some of the
3:59
the aspect of the database
4:02
market, the technology that you're
4:04
focused on and that are the pieces
4:09
of the technology and the ecosystem
4:12
that are most relevant to
4:14
your specific role and the types of
4:16
end users that you're interacting
4:18
with to get feedback on the product.
4:20
So as you kind of pointed out, I think even
4:23
just by asking this question, database in
4:25
the end is simply infrastructure. It enables
4:27
storing data. In the end, what
4:29
users want to do with it is enable real world
4:32
use cases, something that they're building, an application
4:34
that they're building. And those are the things that I really
4:36
look at. What are people building? Why
4:39
are they building it? Why this specific technology
4:41
and not that becomes a lever for
4:43
them to build it faster, better, and
4:46
why this sometimes just
4:48
causes a completely new technology
4:51
to come to market. But the last thing, the interesting
4:53
part was search, right? This
4:55
was in the era when websites
4:58
were still kind of new to having search as an
5:00
experience on their website. Of course, now we're all
5:02
very used to having a search bar. If you come to a website
5:05
and there's no search bar, you would be like, this is nuts.
5:08
Everybody must have a search bar. But when Elasticsearch became
5:10
popular, it wasn't yet the case. And
5:12
so explosion and interest in building search
5:15
technologies or search experiences rather, and
5:17
that enabled by search technologies, is what really
5:19
caused Elasticsearch to appear as a
5:22
really prominent player there. And for
5:24
me, I continue to watch new applications.
5:26
To me, what's really interesting is what
5:28
is the next trend? What is the next application
5:31
that everyone is going to build? And what will
5:33
they need for that? Because that's what ultimately
5:35
a database technology enables.
5:36
And going from Elastic
5:39
to Clickhouse, they're very
5:41
different engines, very different
5:43
target use cases. I'm sure that there's some
5:46
overlap in terms of the ways
5:48
that they're being applied. I'm wondering what
5:50
are some of the aspects of your learnings
5:52
from your time at Elastic that you've been able to bring
5:55
into Clickhouse to help inform some
5:57
of the product direction that you want
5:59
to drive towards?
5:59
Yeah, it's interesting that you say that Elastic and Clickhouse
6:02
are different. They're actually very similar in
6:04
many ways. Elastic started off as,
6:07
again, known as primarily the search technology.
6:09
So the main data structure that it used
6:11
was an inverted index to get a bunch of documents
6:14
indexed for very fast search. But then very
6:16
quickly, it added a columnar store to enable
6:18
analytics. And why? It's because
6:20
a search bar usually then results
6:23
in an experience of then looking at the
6:25
actual results that are brought back
6:27
and analyzing them. So it made sense to pair
6:29
this inverted index with a columnar store for analytics.
6:32
And so during my time in Elastic, I was actually
6:34
responsible for what was then called the
6:36
logging product line. We really thought of
6:39
analytics as just analyzing logs. Any
6:41
event was a log. And that's where
6:43
the biggest overlap is with technologies like Clickhouse
6:46
and other all-up databases. So while Elastic
6:48
didn't call themselves an all-up database, they were
6:50
absolutely one, and they still are, right?
6:53
Just they called themselves a search engine and just kind
6:55
of stuck with them. They called everything a search use
6:57
case. But in reality, they had a very, and
6:59
they still have a very popular analytic solution.
7:01
In terms of Clickhouse, I'll get back
7:04
to it, but kind of going to your original question,
7:06
like what aspects of my
7:08
Elastic experience apply now Clickhouse?
7:10
Again, a lot. Both databases are
7:12
open source. And so what I find is that
7:15
in product management, working with open source
7:17
products versus fully commercial products, it's
7:19
a very different ballgame. In open source,
7:21
you have this community of users that you
7:24
may never meet and you cannot necessarily
7:26
interview. So it's almost like the elements of consumer-oriented
7:29
product management come in. You have to almost
7:32
measure the sentiment in your user base
7:34
as opposed to knowing every
7:35
commercial user of your product.
7:37
You have to look again at adoption trends
7:40
versus buying trends. And it's
7:42
really interesting. Certainly my learning there
7:45
from Elastic mapped very much onto
7:47
my experience currently at Clickhouse. The
7:49
second part that maps very well is working
7:52
for a venture-backed fast-growing
7:54
company. Once you have venture investment,
7:56
it's just a very different ballgame versus say bootstrapping
7:59
a company. you're simply working on open source
8:01
project that doesn't have that aspect. At
8:03
Elastic, again, this was a really great learning.
8:05
It was just a rocket ship in terms of growth.
8:08
And so learning how to stick
8:10
with the pace of the company growth, how to evolve
8:12
during that time, was something that
8:15
I took forward with me. And the last part
8:17
is leading teams, which I think kind of comes with growth.
8:20
If you work for a high growing company, often
8:23
you are in a position to step into a leadership
8:25
role if you wish, certainly there's opportunity.
8:27
And then how do you then bring
8:29
new talent into the company? How do you
8:31
motivate new people to take
8:33
on the challenges that maybe you're doing today?
8:36
Those aspects absolutely map.
8:37
And another interesting aspect
8:40
of this particular area of the
8:42
industry is that databases
8:44
are kind of their own category of product
8:47
where there's a lot of pieces of data infrastructure,
8:49
but the database is typically
8:52
something that requires a certain amount of
8:54
time and diligence before
8:57
just bringing in to an infrastructure
8:59
because it is likely going to outlast
9:02
pretty much every other aspect of the application that's being
9:04
built on top of it because of the
9:06
weight of the data that is stored
9:09
there. And for people who are
9:11
thinking about database technologies,
9:13
how they want to structure their applications,
9:16
can we start by just enumerating the overarching
9:19
categories within the database
9:21
product market as it exists today?
9:23
Yeah, but you're absolutely right about
9:26
databases being so sticky, right?
9:28
Like being the center of gravity, almost of the infrastructure.
9:31
So yeah, like where
9:33
to start? So first of all, I would say
9:36
transactional databases are still the workhorse
9:38
of just a typical data workload.
9:41
And why? Because a
9:44
lot of the data is well served
9:46
by transactional databases. This is why and
9:49
this is why Postgres, MySQL,
9:51
also like traditionally the document databases
9:53
that have evolved to have more transaction capabilities
9:56
like MongoDB, those are commonplace. If
9:58
you're picking up a new application,
11:29
trends
12:00
of the industry. You mentioned that when you started at
12:02
Elastic, it was still fairly early on.
12:04
Search was an up and coming experience
12:08
that consumers were starting to grow
12:10
accustomed to and expect. I'm wondering
12:12
what are some of the major trends
12:15
in the industry, both as far as the
12:18
consumer patterns, the ways
12:20
the databases are being incorporated into
12:22
applications and infrastructure that have
12:24
driven the development and growth
12:27
of some of these new and emerging categories,
12:29
particularly for the very niche use cases.
12:32
Yeah. So I think, you
12:34
know, in addition to searches I mentioned for even
12:36
during Elastic Search, this area
12:38
of analyzing data was already becoming
12:41
big and there's so many sub use cases
12:43
there. And the trend again of needing
12:45
an analytical database for some
12:48
of these interactive application continues. I'll give
12:50
a couple of examples and actually here I'll start with Clickhouse
12:52
just because again, it's a newer technology drive driven
12:54
a little bit by some of the newer trends.
12:57
So originally Clickhouse and the name stands
12:59
for Clickstream Data Warehouse was
13:01
developed for a web analytical workload
13:04
basically. So Google Analytics is probably the most
13:06
common example that might come to mind if
13:08
you want to analyze the performance on your website, you
13:10
put something in your website
13:13
like a snippet of JavaScript and that sends events
13:15
back as to who visits your website and
13:17
why and you can go and analyze that
13:19
data. So that kind of data, which
13:22
is append mostly, right, and you
13:25
know, not changing, usually again, it's like a log of data,
13:28
but it comes at a really high rate
13:30
and the results and the kind
13:32
of analysis that you do looks both at
13:34
the most recent data and historical data and
13:37
asks questions of just a few columns of the data.
13:39
So it's a very typical kind of all up workload.
13:41
This is the workflow for which Clickhouse was originally
13:44
kind of built. But interestingly, like the kind of use
13:46
cases and applications I see now that
13:48
are being built on top of Clickhouse
13:50
and similar technologies are really
13:52
driven by this trend to build, I would say
13:54
productivity tools across all industry
13:57
verticals. So marketing professionals
13:59
as an example.
13:59
More and more tools are being
14:02
built to make marketing professionals
14:04
more effective. And why? Because ad tech
14:06
continues to grow. There's so many things
14:09
that a marketer needs to do today to optimize
14:11
spend in terms of driving leads.
14:14
It is absolutely a data-oriented
14:16
job. There's no way for you to do a good
14:18
job as a marketer without having access to
14:20
data and effective tools on top of that data to make
14:22
decisions. Basically, it's a must. So
14:25
everybody that has an effective marketing department is
14:27
buying these tools and drives development
14:29
of all of these SaaS startups in the marketing
14:32
space.
14:32
Same in the sales space. If you're a seller
14:35
today, in order for you to be
14:37
effective and to have an edge over competition,
14:40
again, is to use data to really understand
14:42
the trends in your region, to really understand
14:44
some of the view maybe that your marketing colleagues have,
14:47
but with a kind of a lens of a salesperson.
14:49
So again, all of these sales productivity
14:51
startups
14:52
need to analyze a lot of data and they have
14:54
to choose a database to do it at scale and
14:56
also efficiently, because if these are
14:58
SaaS services, it's not just about delivering
15:00
fast results. The database has to be optimized
15:03
for your workload for you to have positive margins.
15:05
And so this is why more specialized
15:07
analytical databases are getting adopted
15:10
for building some of these very data-intensive
15:13
interactive applications that ultimately
15:15
drive ROI for many businesses.
15:18
And I can talk about more applications, but
15:20
I wanted to hit on that because, again, it's really data-intensive
15:24
applications that need interaction,
15:26
that need real-time decision-making.
15:31
This episode is brought to you by Datafold, a
15:33
testing automation platform for data engineers
15:36
that finds data quality issues before the code
15:38
and data are deployed to production. Datafold
15:41
leverages data diffing to compare production
15:43
and development environments and column-level
15:45
lineage to show you the exact impact of every
15:47
code change on data, metrics, and BI
15:49
tools, keeping your team productive
15:52
and stakeholders happy. Datafold
15:54
integrates with DBT, the modern data
15:56
stack, and seamlessly plugs in your data CI
15:59
for team-wide and...
15:59
automated testing.
16:01
If you are migrating to a modern data stack,
16:03
Datafold can also help you automate data and
16:06
code validation to speed up the migration.
16:08
Learn more about Datafold by visiting dataengineeringpodcast.com
16:12
slash datafold today. Absolutely.
16:16
And within the different,
16:18
particularly newer segments
16:20
of the database market, what are the pieces
16:23
that you see growing most rapidly
16:25
or requiring or
16:27
at least gaining the most attention and potentially
16:30
leading to accelerated growth?
16:32
Yeah. So again,
16:34
going back to some of the newest trends,
16:37
again, unless you've been under a rock, you've
16:39
heard of OpenAI, you've heard of ChadGPT,
16:42
and you've heard of GenAI applications.
16:44
I think a lot of people are asking themselves right now,
16:47
first of all, how much attention
16:50
should I be paying to this trend? Is this something
16:52
that's going to completely change the way
16:54
I build products in my sector? Or
16:56
is it just incremental? And if it's
16:58
more disruptive, does it mean
17:01
that I need to change the way I build applications?
17:04
What does it mean to consume results
17:06
from a large language models? Do I have
17:08
to actually train one myself? So a lot of
17:10
people are asking those questions. And in
17:13
terms of application building, what's
17:15
becoming really clear is that
17:17
while hosted large language models
17:19
are quite adept in order to get
17:22
really good results for any particular domain,
17:24
you do have to fine tune
17:25
those results.
17:26
And in order to fine tune those results, at some
17:28
point, you have to, again, if
17:30
you know the space, you'll know the terminology, but you have
17:33
to develop these embeddings based
17:35
on the data that you have and combine that
17:37
with results that are coming back from a
17:40
pre-trained model that maybe you're consuming.
17:43
So there's a question right now of whether
17:45
to build an application that is
17:48
somehow powered by an LLM, that
17:50
you have to have a way to host your own embeddings, or
17:52
can you do this in some other hosted scenario?
17:54
So it becomes kind of a question for a lot of engineers
17:57
and developers out there is, do I need a specialized
17:59
back store or can I just use Postgres
18:02
and the built-in Postgres kind of vector store,
18:04
is that going to be enough? Same with
18:07
an all-app database. If you're using
18:09
Clickhouse, the question becomes, well, is Clickhouse
18:11
vector search
18:12
sufficient for my purposes or do I need
18:14
something even more specialized like Pinecone? I
18:16
believe it's still an open question. However,
18:19
if there's anything I've seen kind of in terms of trends
18:21
in technology space in general, it
18:24
is usually toward simplicity
18:26
and consolidation. So I think if it's possible
18:28
for existing databases to build in those
18:30
capabilities in a way that's sufficiently
18:33
performant and resource efficient,
18:35
then it will happen. If it's simply impossible,
18:37
if the architectures are so divergent and
18:40
these workloads are that important, there
18:42
may be
18:42
a third class of databases that gets developed.
18:45
But I think it's an open question. Yeah,
18:47
it's definitely interesting and
18:49
early days for the vector database
18:51
market. And yes, everybody has
18:53
their opinions as to which one is going to
18:55
win out, particularly if you happen to work
18:57
for that vector database vendor.
19:00
For sure.
19:01
And again, the way I see it is
19:03
like, certainly, again, I think transactional
19:06
analytical databases should be developing
19:08
these capabilities. Because if it's possible for you to serve
19:10
even a fraction of that market, somebody doesn't
19:13
have to get a new database. I'll give you an example for why
19:15
our customers ask for it. So we have customers
19:17
in a fraud analytics space where they're analyzing
19:20
a lot of information or to make a decision as
19:22
to say, a transaction is fraudulent or
19:24
some behavior is undesirable.
19:26
And they do it based on heuristics. So they have
19:29
an analytical database for that purpose. And
19:31
it was working very well for them. And now they want to augment
19:33
it with a couple of fraud detection
19:35
methods that are maybe reliant on
19:38
LLMs. They don't want to move all of this data. And they don't want
19:40
to ideally, they don't want to host two databases
19:42
with overlapping data. If possible,
19:45
they just want to host embeddings in Clickhouse and
19:47
combine that with the data they already have in Clickhouse.
19:49
So if we can deliver them performance that is sufficient
19:52
for their use case, of course, we will try to do that. Does
19:54
it mean that there's no even
19:57
more advanced use case for which a vector database
19:59
is necessary? No, it doesn't mean that. So it's
20:01
possible that both need to exist, that existing
20:04
databases need to add embeddings
20:06
and vector search capabilities, but still for
20:08
more specialized use cases, you may need a
20:10
dedicated vector database.
20:13
Circling back to
20:15
the stickiness of databases
20:17
as a piece of infrastructure, we've touched
20:19
on a few of the types of questions that teams should be
20:21
thinking about in that selection process,
20:24
but wondering if you can just talk through
20:26
some of the core elements
20:29
of performing proper due diligence
20:31
on this technology selection, some of the
20:34
technology concerns, some of the organizational
20:36
concerns, and just some of the ways
20:38
that teams should be approaching this
20:41
step of identifying, do
20:43
I even need a new database? Do I need a database
20:45
at all? And if so, which is the right
20:48
one for this particular use case?
20:50
Right. I was thinking about this
20:52
question ahead of time. And it's a
20:54
tough challenge, actually, because in order
20:56
to select a database, you have to really understand your
20:58
workload. And sometimes you don't, like you start
21:00
building an application and you don't yet know
21:03
what the shape of your workload is going to look like until
21:05
you've built the app or prototype the app,
21:07
or really kind of got to a point where real
21:10
world usage is driving certain
21:13
shapes of your workload. You may not know ahead
21:15
of time exactly how many columns
21:17
you're going to have in your data or which column,
21:21
for instance, will end up having a
21:23
certain cardinality of value. So you just simply don't know.
21:26
You could have a hypothesis, but you may not know.
21:28
So one thing I will say, you probably will
21:30
make the wrong decision at some point, like if
21:32
you have a database that simply doesn't scale, the
21:35
question then is how quickly can you migrate
21:38
or move some of that workload to another technology?
21:40
This is why at Clickhouse, we actually do focus specifically
21:43
on making that part of the journey easier. We just
21:45
anticipate that, of course, a lot of existing
21:47
folks who are users at some point will hit a
21:50
scaling limitation and they will need to quickly
21:52
onboard onto Clickhouse. And making that path
21:54
very simple is important. And then as
21:56
far as trying to do it upfront, I guess I
21:59
would say that, yes. just knowing that there
22:01
is even a transactional versus analytical workload
22:03
is important because they are quite different.
22:06
Transactional workloads ultimately are
22:08
more static, right? You have rows, of course, you
22:10
can kind of grow away, but you're
22:13
updating existing data in place. It's
22:15
a slower, I would say, growing workload,
22:17
whereas analytical workloads are kind
22:20
of more like changes. Imagine you've got a
22:22
more static inventory of
22:24
products. Your analytical workload would be anything
22:26
that has to do with changes in inventory. And of course, that
22:28
data set, kind of time index is going
22:30
to grow a lot faster. So anything
22:33
that grows really fast because it's really
22:35
more about changes in some other static data
22:37
set, that is an analytical workload. So knowing
22:40
that is the case, I would say from early
22:42
on establishing this pattern where you have both
22:44
a transactional analytical database
22:46
is valuable and then kind of basing
22:49
your technology decisions based
22:51
on that and kind of anticipating that
22:53
that is the case. I'm also seeing increasingly,
22:57
again, database vendors and database technologies
22:59
anticipating that for users in the first place. So
23:01
there's transactional databases building more
23:04
and more foreign data wrappers for analytical databases
23:06
and even almost helping their users detect
23:08
when they hit some sort of scaling limits in the transactional
23:11
database and saying, okay, like move it to an analytical
23:13
database and we'll still give you ability to kind
23:15
of query
23:16
across both and vice versa. Analytical
23:19
data stores build CDC capture,
23:21
like a change data capture capabilities to very
23:23
quickly. Detect changes in transactional
23:25
databases and onboarding those
23:28
workloads. So hopefully that helps. Like I would say
23:30
just even knowing that transactional versus
23:32
analytical workloads exist already helps
23:34
a lot.
23:35
And another interesting aspect
23:37
of this overall question of which
23:40
engine do I need, particularly
23:42
in that divide between OLTP
23:44
or online transactional processing and online
23:46
analytical processing, is do you
23:49
need both? If so, how do you
23:51
make them work better together where transactional
23:54
engines have long been the
23:56
solid workhorse of application
23:58
development for a long time? they were even
24:00
the engines used for data
24:03
warehousing in the early days of data warehousing
24:05
before we got column restores and MPP
24:07
databases. And now that
24:10
we do have column restore is available, we do
24:12
have MPP databases for being able to
24:14
parallelize that analytics, what do
24:16
you see as the major motivators for
24:19
having that be a separate set
24:21
of technologies, separate pieces of infrastructure
24:24
and some of the inefficiencies and
24:26
complexities that are driven as a result of that?
24:28
It's true, right? I think the only thing
24:30
I can think of is just the size
24:32
of analytical workloads grew,
24:34
you
24:35
know, again, exponentially, or grew
24:38
to such a point where transactional databases
24:40
became just not feasible for
24:42
the type of analysis that people want to do. And
24:45
also the expectations of the type of applications
24:47
you want to build changed. Because I think for a while,
24:49
when it came to analytics, it was sufficient to have
24:52
the kind of experience where you produce a
24:54
report, right, like you analyze something
24:56
and you produce a report and gets emailed to you,
24:58
you know, every day or every week or even
25:00
every month, you know, so like imagine kind of an internal
25:03
workload that is analytics focused,
25:05
like that was just kind of how internal teams
25:07
work for a long time. And that, of course,
25:09
would not work for any sort of, you know,
25:12
SaaS applications where interactive experience is
25:14
required. So I think their revolution actually started
25:16
with the SaaS part, people wanted to build
25:18
more interactive experiences on the websites
25:21
that kind of introduced technologies again,
25:23
first like Elasticsearch, you know, many others that
25:26
powered these applications. And now the question
25:28
is being asked internally by internal teams is why
25:30
shouldn't we adopt the same for internal
25:33
users? Why should they wait for a report?
25:35
Or why should they have the kind of query that you run
25:37
and then kind of go away and come back to in many minutes?
25:40
Those questions are being asked. And so
25:42
what we're seeing now is I think some of the things
25:45
that have made some of these SaaS services
25:47
successful, internal teams are asking themselves, why
25:49
shouldn't our internal users have that experience? Because if
25:52
they don't, they actually will go and try to consume those
25:54
SaaS services, right? And internal
25:56
teams are seeing kind of more and more demands
25:58
for interactive dashboards. interactive applications.
26:01
I would say with internal teams where this started,
26:03
at least in my experience, was on the financial
26:05
side. So financial sector for a while really
26:08
led in terms of just having high
26:10
expectations
26:10
for internal users. If you're
26:12
a trader, at the end of the day, you need to have an interactive
26:14
application that helps you make a decision
26:17
of what bets to place the next day. And
26:19
you can't wait for the following
26:21
day. You need that decision now. So any
26:23
internal stakeholders where they needed
26:26
to consume data very quickly and
26:28
interactively, I think this is what really
26:31
introduced the need for more specialized
26:34
databases and data stores for analytical workloads
26:36
that could support these interactive use
26:38
cases.
26:42
Data projects are notoriously complex.
26:44
With multiple stakeholders to manage across varying
26:47
backgrounds and tool chains, even simple
26:49
reports can become unwieldy to maintain. Miro
26:52
is your single pane of glass where everyone can discover,
26:55
track, and collaborate on your organization's data.
26:58
I especially like the ability to combine your technical
27:01
diagrams with data documentation and
27:03
dependency mapping, allowing your data
27:05
engineers and data consumers to communicate
27:07
seamlessly about your projects. Find
27:10
simplicity in your most complex projects with
27:12
Miro.
27:13
Your first three Miro boards are free when you
27:15
sign up today at dataengineeringpodcast.com
27:18
slash Miro. That's three free boards
27:21
at dataengineeringpodcast.com
27:23
slash M-I-R-O. One
27:27
of the shortcomings that is introduced
27:30
by virtue of splitting out
27:32
the analytical engine for its
27:34
speed of analysis and computation from
27:36
the transactional store that is getting
27:39
the data as it is generated is
27:41
the need for being able to either say, we're
27:43
going to batch this, and this is how long you
27:45
can expect to have data delayed when you're running
27:47
this report. Or you need to bring in something like
27:49
change data capture or some other streaming technology
27:52
to be able to feed the data directly over
27:54
to the analytical system. And a third
27:57
approach that I've seen applied
27:59
in some is federation
28:01
of queries where this is where things like Trino,
28:03
Presto come in. I know Clickhouse has
28:06
some support for things like foreign data
28:08
wrappers. I'm wondering what you see
28:10
as the overall trade-off,
28:13
some of the ways that teams should be thinking about how
28:15
best to make the analytical system
28:17
work as closely as possible with the transactional
28:20
store without introducing
28:22
arbitrary breakage when network connections
28:24
fail.
28:25
Yeah, there's several very interesting
28:28
topics here. So on the change
28:30
data capture side, I believe this
28:32
needs to be, again, just a built-in capability of
28:35
analytical databases. Clickhouse
28:37
handles it by... We have this concept
28:39
of a materialized Postgres and materialized
28:41
MySQL engine where we
28:44
basically... Yeah, we can create almost like a
28:46
logical view of your
28:48
MySQL or Postgres database and just query
28:51
it as well as capture changes from
28:55
these databases using these engines
28:57
that basically act as our CDC. I believe it
28:59
just needs to be built in and vice
29:01
versa. Old LTP databases should
29:03
have foreign data wrappers for the most popular analytical
29:06
databases that they see kind of in their ecosystem.
29:09
But you mentioned object stores
29:11
and kind of the data lake use case. This is another
29:13
really interesting evolution that we're seeing.
29:16
So again, primarily on internal
29:18
analytics side, what we've seen is
29:20
that cloud data warehouses, like
29:23
Snowflake, Redshift, BigQuery, they
29:26
of course have come to prominence in the past, say,
29:28
five years. And their big accomplishment
29:31
was moving all of these on-premise,
29:33
more traditional data warehouse workloads from Teradata,
29:36
Oracle, and so on into the cloud. And
29:38
it's great because now that these workloads are in
29:40
a cloud environment, teams, and again, primarily
29:43
it might be internal teams working on internal
29:46
analytical use cases or asking themselves, well,
29:48
does it make sense to keep these workloads
29:51
in a monolithic data warehouse? Or
29:53
does it make sense, for instance, to put some of these workloads
29:56
into a data lake and to
29:58
query using different data?
29:59
engines. And we are, I
30:02
would say that what I'm seeing is really
30:04
more the trend toward unbundling these
30:06
cloud data warehouses. Again, not every organization
30:09
is bought into it yet, but we're definitely seeing that trend
30:11
in some of the organizations that we work with where
30:14
they're saying, okay, now that we have this
30:16
data in a more open environment, in a cloud
30:19
provider of choice, we can start again, moving
30:21
the pieces where they belong. And the way
30:23
Clickhouse fits into it is it's becoming
30:26
more like a real time engine to
30:28
work on top of data lakes, as well as next
30:30
to data lakes, and helping kind
30:32
of that trend of unbundling what has
30:34
become kind of a monolithic
30:36
version, like of an on
30:38
premise data warehouse, but in the cloud, the cloud data
30:40
warehouse.
30:41
Another element of database
30:43
engines, the ways that they fit into
30:46
in particular analytical use cases
30:48
is that they're not the only
30:50
operator in that space. There's typically
30:53
a complex web of dependencies between
30:55
different systems, data is flowing in
30:57
and flowing out for different use cases.
30:59
And so it can be difficult to understand
31:01
what is actually happening at any moment in time
31:04
when it needs to debug something, which brings
31:06
in the question of data observability. And
31:09
that is a whole other market.
31:11
But from the perspective of somebody working
31:14
with teams building database engines,
31:17
what do you see as the role of the database
31:19
itself in cooperating and
31:21
enabling the observability
31:23
aspects from an analytical perspective
31:26
so that people who are operating these infrastructures
31:28
can have more confidence that they're
31:31
looking at the right things that they understand what's going
31:33
on and that they can tune the workloads as
31:35
needed.
31:36
As you mentioned, I'm more on
31:38
the side of a database vendor,
31:40
like working with data observability tools. So
31:42
the first thing I will mention just how
31:45
important data observability
31:46
tools
31:47
are starting to become to stakeholders,
31:50
it does seem like there's been an inflection where it's
31:52
just an expectation. And this is in addition
31:54
to other data management tooling that we see.
31:56
So, you know, data versioning
31:59
data So that tooling, I would
32:01
say, we're seeing a movement where
32:03
it starts to be used, I would say, much earlier
32:06
in the adoption of a data store, especially again,
32:08
for internal analytics, when you've
32:11
got many stakeholders and they
32:13
all need to understand what is the data catalog,
32:16
what is data lineage, like how are changes propagated.
32:19
Even for our own internal data warehouse team,
32:22
you might imagine our commercial
32:25
focus is around our cloud offerings, so our finance
32:27
team just lives and dies by this MRR number,
32:30
monthly recurring revenue. Well, this number gets generated
32:32
from many sources of data and any
32:35
change that may affect
32:37
how this number gets calculated, it's
32:40
critical for us to understand. If there's anything
32:43
that occurs that may taint
32:46
how we view this number, we report
32:48
it to the board, we're of course reported internally. So
32:51
companies have similar important
32:53
metrics and data
32:55
fields that they need to understand their
32:58
integrity. And so this is driving adoption
32:59
of tools that I've already mentioned, specifically
33:03
data orchestration, data versioning,
33:05
and data observability. The database vendor,
33:08
so what we do to enable
33:10
these tools, and there are tools that integrate
33:12
with Clickhouse. Some of them work, by the way, on
33:14
top of other tools, so for instance, DBT, a
33:17
pretty big player in this space, some
33:19
of them work very natively on top of that.
33:21
What they ask of us is a few things. One
33:24
is really good kind of self observability.
33:26
So every time anything
33:29
changes in the database, it needs to be observable.
33:32
And within Clickhouse, the way it's accomplished is,
33:34
we're a database, where would we put data about
33:36
ourselves? We put it in ourselves. Like when you spin up
33:38
Clickhouse, it has these system tables, as
33:41
we call them. Everything is in there. Any DDL
33:43
statement that you run, any log about
33:46
anything that happens is in our internal
33:48
system tables, you can query it, it's very easy,
33:50
it's right there. And we just happen to be also very efficient
33:53
at storing them. So it's not a big overhead on the database
33:55
itself. But that is what makes it
33:57
very easy for data observability partners
34:00
integrate with us, there's nothing we have to add for
34:02
them. All the data is there and
34:04
they can query it on day one. And then
34:07
the second part I would say is ability
34:10
to go deeper if need
34:12
be. So there needs to be some ability to turn
34:14
on kind of more advanced tracing and profiling
34:17
if something goes wrong. This is
34:19
where, you know, Clickhouse and other vendors
34:21
are starting to build in open standards
34:23
based ways to kind of self monitor more internals
34:25
of the database. So open telemetry
34:28
is kind of a really popular
34:29
increasingly popular way of monitoring
34:33
specifically say traces within a database
34:35
product, that is something you would turn on optionally
34:38
and use only if needed.
34:40
And then from somebody who's working
34:42
on the product side, dealing
34:45
with people who are trying to
34:47
understand how a given database
34:49
engine fits within their stack and within their
34:51
use case, what are some of the elements of customer
34:54
education that you find yourself coming back to
34:56
the most or areas of
34:59
maybe misunderstanding or misconceptions that people
35:01
have going into the tool selection
35:03
process?
35:04
So that's a really interesting question. And this
35:06
one may surprise you a little bit because like with
35:09
Clickhouse, and again, the last image was a little
35:11
bit different because we were a search technology and kind
35:13
of our terminology was very search oriented.
35:15
And people came to Elasticsearch with
35:17
an expectation that it was a search engine
35:20
first primarily and then everything else second.
35:22
With Clickhouse, we're mostly
35:24
like ANSI SQL compatible. And like from
35:26
a syntax perspective, for the most part,
35:28
you can kind of take your queries and just kind of port them
35:31
over. We do have some SQL extensions for analytics.
35:34
That's extra. But if you're like coming over from transactional
35:36
world, you might look at Clickhouse and
35:38
say, ah, you know, I just take my workloads there and everything's
35:40
fine. But where things kind of break up a
35:42
little bit, and this is something to pay attention to when adopting
35:45
any new database is in the end, the
35:47
devil's in the details when it comes to specifically
35:49
data organization and semantics.
35:52
So I'll give you one example. We have a
35:54
concept of a primary key, we call this a primary
35:56
key in Clickhouse. What it means in Clickhouse
35:58
is actually the key by which we.
35:59
And why is that important is because
36:02
for analytical workloads, how you've sorted the
36:04
records based on which key, basically
36:06
the data is organized kind of in order versus not
36:09
has a huge effect on how
36:11
fast you can query columns back for specific
36:13
types of aggregation. So for analytical workloads
36:16
like Clickhouse, the sorting order
36:18
of records basically on disk is very important. So
36:21
we call, when you create a table, we
36:23
say you should use a primary key and that primary
36:25
key should be something by which you will query. And
36:27
that's what we say. But of course,
36:29
in transaction order, primary key means something completely
36:31
different,
36:31
right? It's all about sort of constraints
36:34
and, you know, and so users
36:36
get very confused. They say, like, you look like SQL and
36:38
you walk like SQL, but you have this primary key that means
36:40
something completely different. So I guess for folks
36:42
building databases, my advice is don't
36:45
take terms that mean something else in very
36:47
popular databases and make them mean
36:49
something else entirely in your database. It's going to
36:52
be confusing. For us, I think it's too late
36:54
to unroll that one. But if, like, I was
36:56
the creator of Clickhouse back in the day, I probably
36:58
would have made a different decision on the name
37:00
of, like, primary key. And there's a few other small
37:02
examples like this.
37:03
Yeah, naming things is hard. Always
37:06
very hard.
37:07
But back to education, like, how do we educate
37:09
users? So, yes, we educate them on some of these
37:11
nuances. But actually, yeah, a lot of the education
37:14
goes into them understanding
37:16
that ultimately, when you're adopting an analytical
37:18
database, there's some thought that has
37:20
to happen. Some thought has to go
37:22
into how you actually organize the workloads,
37:25
because do you really just want to
37:27
take your, like, highly relational
37:29
workload in which there's actual world into analytics? Most
37:31
likely not. You could. It would work.
37:33
But actually, this is now how you get the most out of an analytical
37:36
database. You typically will do a little bit
37:38
more flattening of the data, not completely.
37:40
Like how supports joins. But to get the most
37:42
out of your use case, you may do a little
37:44
bit more, again, processing of a data
37:47
before querying it. And this is where Clickhouse
37:49
has a concept of materialized views. We can
37:51
take actually a highly sort of, you know,
37:54
do normalized data and then help you almost like
37:56
using EOT. And this is where DBT becomes important
37:58
to transform it to some. something you would actually want
38:01
to query. So that is kind of built in,
38:03
but you have to understand that you have to do that.
38:05
And that's where a lot of the education happens.
38:08
And in your experience of
38:10
working in this space, working with
38:12
end users now at Clickhouse,
38:14
also with Elastic, what are some of the most
38:16
interesting or innovative or unexpected ways
38:19
that you've seen people applying database
38:21
technologies, whether specific
38:23
to the tools that you worked on or
38:25
just more generally?
38:26
So with Elastic, there was actually a very interesting
38:29
use case. I remember it struck me, our
38:31
first user conference for Elastic Search,
38:33
we had somebody from NASA present on
38:36
the Mars Rover use case. And that just blew my mind, right?
38:38
I mean, like the telemetry that was
38:40
created on Mars, right,
38:42
got sent to Earth and put into Elastic search.
38:45
And that was just like that was just very,
38:47
I don't know, surprising to me that that, like a search technology
38:49
or analytical technology would get adopted in that
38:51
context. It shouldn't surprise me. And in the
38:53
end, from a technical perspective, that workload probably actually
38:56
even wasn't the most challenging because you don't have
38:58
that much bandwidth to transmit that much data. But
39:00
it was just very cool and very exotic.
39:03
Let's just put it that way. You know, for Clickhouse, it
39:05
also, what blows my mind is just the scale at which
39:08
this product can run. As I mentioned, it was developed
39:10
for an internet scale, kind
39:12
of web analytics use case. It can ingest
39:15
billions and trillions of rows. Just today
39:17
we published a case study with Ahrefs,
39:19
which is again, another vendor that does, basically
39:21
crawls the whole internet and stores their data
39:23
in Clickhouse. And it's just amazing the
39:25
scale at which you can run, but it doesn't mean that
39:28
you don't need it at a smaller scale. You still do, right?
39:30
And there's still these inflection points where, you
39:32
know, even for a much smaller dataset, you
39:35
need an analytical database just based on the
39:37
types of queries, interactive experiences you
39:39
can run. And in your own experience
39:42
of working in this space, what are the most interesting
39:44
or unexpected or challenging lessons that you've learned?
39:47
Unexpected lessons for me. I
39:49
think the main one, maybe I
39:51
mentioned in the beginning, which was when you
39:54
transition from commercial to open source
39:56
databases, as a product person, you do
39:58
have to think very differently. And. that,
40:00
like how you leverage the community is
40:02
something that you shouldn't underestimate that
40:05
it's a huge, huge value. The community is not
40:07
just a free kind of distribution channel about
40:09
your free users. It's a big channel,
40:11
first of all, for innovation. You just mentioned interesting
40:14
use cases.
40:14
A lot of these users just come from downloading
40:17
the product. Somebody just has an idea and they just
40:19
want to download a product and use it for free to prove
40:21
out their idea. They don't have any budget. Often
40:23
it's a passion project. So these types
40:25
of community users are just gold. And this
40:27
is something that I love about working with open
40:30
source products, that these types of
40:32
individuals and their ideas get nurtured
40:34
by the fact that the technology is free at scale.
40:37
Like this is a difference from a freemium product. A freemium product
40:39
typically is sort of scale limited, whereas an
40:41
open source distribution model in databases,
40:43
which by the way, I think has
40:44
worn out. I think it's pretty clear. The
40:47
typical sort of like distribution is
40:49
an at scale solution you can run. So
40:51
that's one thing that was kind of surprising to me. The second
40:53
thing was actually at Elastic
40:56
when we got to kind of be an at scale company,
40:59
we had this kind of fork in the road in terms of how
41:01
do we grow? Like from a platform perspective, we
41:03
were a really popular platform for
41:05
search and certain types of analytics, but how do we grow
41:07
the company? And the direction that the company
41:10
ultimately took was to add more
41:12
vertical solutions based on
41:14
this open platform. And so if you look at Elastic's
41:17
website right now, they talk about observability
41:19
and security and what they call enterprise search. And
41:22
how you kind of do this kind of growth is
41:24
you actually need to build out a solution based
41:26
on this database. You can try to build it organically,
41:29
but typically actually you kind of pursue an acquisition
41:31
strategy. And what was surprising to me was
41:33
with an open source product, when you do M&A,
41:36
when you look at companies that build
41:39
products and solutions, you can actually try
41:41
to find companies that have already built a product based
41:43
on your open database. And
41:45
then the integration costs are very low because
41:47
you just bring in this team, they already know your technology.
41:50
They've already built a solution on your stack,
41:52
on your technology stack. And so then the integration
41:54
play is much faster and that really helped us
41:56
out at Elastic.
41:57
And as you continue
41:59
to...
41:59
to iterate on the product that
42:02
you're involved with as you keep an eye on the broader
42:04
database market from a competitive
42:07
standpoint, from an educational standpoint,
42:09
what are some of the predictions that you have
42:12
for the future trends in the database
42:14
market?
42:15
Okay, so a couple of things. We
42:17
talked about OLAP versus OLTP and
42:20
my prediction is that OLAP does continue
42:22
to grow in prominence. Still
42:25
today, I think that
42:26
most users start with OLTP and then
42:28
sort of almost through trial and error arrive
42:31
at needing OLAP. I
42:33
do think that in the course of a
42:35
few years, we'll see more
42:37
of a pattern where you just simply start with both.
42:39
That's one of my predictions. I don't know
42:41
that it's gonna happen this year, but I do believe
42:44
just the amount of investment that's
42:46
happening in the OLAP space, and
42:48
by the way, right now, usually folks
42:50
call it like the real semanalytic space, I think
42:53
is going to lead to a lot more
42:55
awareness. And again, that's only specifically
42:58
Clickhouse. There's so many other technologies in
43:00
the space, but I think generally like the space
43:02
of OLAP and real semanalytics is going to lead
43:04
to developers starting with both. They're
43:06
gonna start with OLTP and OLAP, and this is
43:08
how they just they build
43:09
out their product. That's number one.
43:12
My second prediction more on the
43:14
internal team side is this cloud
43:16
data warehouse unbundling trend
43:19
continues. I do think that data
43:21
lakes will continue to rise in prominence
43:23
just because it just makes sense. Like there's so many things
43:25
that make sense about a data lake.
43:28
You have one kind of object store that's powering
43:31
many use cases, and you can leverage different open
43:33
technologies on top of it. Just that pattern makes
43:35
sense to me. This is why it's important for us to
43:37
invest in it. It doesn't mean that you won't have
43:39
some specialized storage because in the end, like
43:41
even with Clickhouse, we work pretty fast
43:43
on top of object stores, say with Parquet
43:45
or Icebeck format, but in the end, our native format
43:48
is even faster. So for some workloads, you
43:50
may still leverage specialized store, but for
43:53
other use cases, you probably don't want to. Like if you
43:55
have a use case where you want both a data
43:57
scientist and an app to have access to the same
43:59
data, why would you? you duplicate it. Like you'll want
44:01
just to keep it in one place and have two kind of
44:04
analytical engines pointing to it. I think that trend
44:06
is going to continue. And finally, from
44:08
the perspective of where we talked about vector
44:11
stores and Gen AI, I mean, something is going to
44:13
happen. I don't think the hype is going to completely flame
44:15
out and we're just all going to say like this was nothing.
44:17
I think it's going to lead to new applications. I
44:19
don't know that it's going to be quite as disruptive
44:22
as, you know, some sometimes
44:25
say, I think in the end, it comes back
44:27
to like, what experiences do we want to build? So
44:29
again, say I'm
44:29
building a product for marketing professionals.
44:32
Okay. Like I'm going to leverage large language
44:34
models to again, incorporate more aspects
44:36
of natural language into kind of
44:39
my suggestions, but I don't think
44:41
it's going to be everything. I think there's
44:43
still going to be a lot of domain knowledge that remains
44:46
outside of a large language model.
44:48
And I think that it's going to be kind of a blend
44:50
of approaches.
44:56
Thank you for listening. Don't
44:58
forget to check out our other shows. Podcast.init,
45:00
which covers the Python language, its community,
45:03
and the innovative ways of being used. And
45:05
the Machine Learning Podcast, which
45:07
helps you go from idea to production as machine
45:09
learning. Visit the site at dataengineeringpodcast.com,
45:13
subscribe to the show, sign up for the mailing
45:15
list and read the show notes. And if you've
45:17
learned something or tried out a product from a show,
45:19
then tell us about it. Email hosts
45:21
at dataengineeringpodcast.com with your
45:24
story. And to help other people
45:26
find the show, please leave a review on Apple podcasts
45:28
and tell your friends
45:29
and family. The
45:34
promise of these tools is pretty
45:36
great,
45:36
but I think it's early days for this tooling.
45:39
And there's a few players, but I think that there's still a lot
45:42
more that these tools can do and flipping
45:44
it more on the side of database vendors.
45:46
I think database vendors need to have more
45:48
built-in observability of the database
45:50
itself.
45:51
So it's easier to build these tools across
45:53
offerings. So that's, I would say
45:55
one of the bigger gaps that I would note.
45:57
Well, thank you very much for taking the time.
45:59
today to join me and share your
46:02
perspective and experience and expertise
46:05
on database product development and
46:07
ways to be thinking about the incorporation
46:09
of databases into applications and infrastructure.
46:12
It's definitely a very interesting problem domain
46:14
and it's great to see the trajectory
46:17
of Clickhouse and so appreciate
46:19
the time and energy that you're putting into that and I hope you enjoy
46:21
the rest of your day. Thank
46:22
you for having me Tobias.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More