Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:11
Hello, and welcome to the Data Engineering
0:13
Podcast, the show about modern data management. Daxter
0:17
offers a new approach to building
0:19
and running data platforms and data
0:21
pipelines. It is an open source,
0:23
cloud-native orchestrator for the whole development
0:25
lifecycle, with integrated lineage and observability,
0:27
a declarative programming model, and best-in-class
0:30
testability. Your team
0:32
can get up and running in minutes
0:34
thanks to Daxter Cloud, an enterprise-class hosted
0:36
solution that offers serverless and hybrid deployments,
0:39
enhanced security, and on-demand ephemeral test deployments.
0:42
Go to dataengineeringpodcast.com/daxter today to get
0:44
started, and your first 30 days
0:46
are free. Data lakes
0:49
are notoriously complex. For
0:51
data engineers who battle to build
0:53
and scale high-quality data workflows on
0:55
the data lake, Starburst powers petabyte-scale
0:57
SQL analytics fast, at a fraction
0:59
of the cost of traditional methods,
1:01
so that you can meet all
1:03
of your data needs, ranging from
1:05
AI to data applications to complete
1:07
analytics. Trusted by teams of all
1:09
sizes, including Comcast and DoorDash, Starburst
1:11
is a data lake analytics platform
1:13
that delivers the adaptability and flexibility
1:15
a lake has ecosystem promises. And
1:18
Starburst does all of this on an
1:20
open architecture, with first-class support for Apache
1:22
Iceberg, Delta Lake, and Hoodie, so
1:24
you always maintain ownership of your data. Want
1:28
to see Starburst in action? Go
1:30
to dataengineeringpodcast.com/starburst and get
1:32
$500 in credits to
1:34
try Starburst Galaxy today, the easiest and
1:37
fastest way to get started using Trino.
1:39
Your host is Tobias Macy, and today I'm
1:41
interviewing Paul Dix to talk about his investment
1:44
in the Apache Aero ecosystem and how it
1:46
led him to create the latest FAD and
1:48
database design. So Paul, can you start by
1:50
introducing yourself? Sure. I'm
1:52
Paul Dix. I'm the founder and CTO
1:54
of Influx Data. We are the makers
1:57
of InfluxDB, which is an open-source time
1:59
series database. Prior to that,
2:01
I have a lot of experience in industry. I'm obviously
2:03
a computer programmer by training, and I've
2:05
worked in a lot of large companies, small companies
2:08
all over. So. And
2:10
for folks who haven't listened to your
2:12
previous appearance on this show, where we
2:14
were talking about the Influx product suite
2:16
and your experience there, where you actually
2:18
hinted at the work that you've been
2:21
doing, where we're bringing you back to
2:23
talk about, can you just give a
2:25
refresher on how you first get started
2:27
working in data? So as
2:29
I mentioned, InfluxDB is a time series database.
2:31
Now how I got interested in this topic,
2:34
I mean, generally, like when I was in
2:36
school, I was interested in information retrieval, database
2:38
systems, that kind of stuff. But
2:41
in 2010, I was working
2:44
at a FinTech startup here in New York
2:46
City, and we had to
2:48
build a solution for working with a
2:50
lot of time series data. Later, when
2:52
I started this company, initially we were
2:54
building a product for doing server monitoring
2:56
and real-time application metrics and that kind
2:58
of thing. And to build a
3:00
backend for that, I had to build a solution
3:02
that was very similar to the
3:04
backend I had built for the FinTech company. So
3:07
I saw two different use cases.
3:09
One was in financial market data, and
3:11
the other in server monitoring and application
3:13
performance monitoring data. But the
3:16
backend solution for both was basically the
3:18
same thing. And at that point, I
3:20
realized building a database that could work
3:22
with time series data at scale and
3:24
make it easy for the user, was
3:26
a more interesting problem to solve. So
3:29
we pivoted the company to
3:32
focus on that, became InfluxDB,
3:34
and we've been building for that ever since.
3:37
So initially we had version 1.0,
3:40
the initial announcement of InfluxDB was in the
3:42
fall of 2013. We
3:45
released version 1.0 of InfluxDB in September
3:47
of 2016. We
3:49
released 2.0 in basically late 2019, early
3:51
2020. And
3:54
then just this last year, we released
3:56
version 3.0 of the database,
3:59
which is the... the significant
4:01
rewrite that you were hinting at
4:03
that basically caused us to adopt
4:05
all these new technologies and start
4:07
investing heavily in the Apache Aero
4:09
ecosystem. Now, bringing
4:11
us through to this part of the
4:13
conversation, I made
4:16
a little bit of a play on
4:18
the acronym with the introduction, but the
4:20
different letters of it are F-D-A-P, and
4:22
I'm wondering if you could just start
4:25
by describing the overall context of that
4:27
stack, what the different
4:29
components are and how they combine to
4:31
provide a foundational architecture for database engines.
4:35
Yeah, so the FDAF
4:37
stack is an acronym for the
4:39
different pieces. F stands
4:41
for flight, which is Apache Aero
4:43
Flight or Apache Aero Flight SQL.
4:47
A is actually Apache Aero, which
4:49
is essentially the foundational project under
4:51
which all these components reside, so
4:54
Aero is like the umbrella project
4:56
for everything. So
4:58
Apache Aero is an
5:01
in-memory columnar specification, so basically it's
5:03
a format for in-memory columnar data
5:05
so that you can do quick analytics on it.
5:08
D, which is Data Fusion,
5:10
which is a SQL processor,
5:13
it's a query parser, planner,
5:15
optimizer, and execution engine for SQL.
5:18
Specifically, it also follows the
5:20
Postgres dialect of SQL and
5:23
parquet, which is a file
5:26
format for persisting columnar data, but
5:28
also structured data, so you can
5:30
have nested structures. It's
5:32
essentially an open source implementation of
5:36
the Google Dremel research paper that came
5:38
out in the early offs. I'm
5:41
wondering if you can talk to
5:43
the design goals and constraints that
5:45
you were focused on in the
5:48
re-implementation of InfluxDB and how
5:50
that led you to the selection
5:52
of this composition of tools to
5:54
execute on that vision. Yeah,
5:57
so for InfluxDB 3.0, As
6:00
I mentioned, we basically did a
6:02
ground up rewrite of the database, which generally speaking
6:05
is not something you'd ever want to do. But
6:08
there are a number of problems we wanted to solve
6:10
for. So first
6:12
is this idea of infinite
6:14
cardinality, right? Within time series
6:17
databases, generally there's this idea
6:19
of the cardinality problem where
6:22
cardinality comes in dimensions that
6:24
you describe your data on,
6:27
right? So these could be like a
6:29
server name or a region or a
6:32
sensor ID, but you can also have
6:34
other dimensions like what user made this
6:36
request or what security token made the
6:38
request. And really when you think about it, the
6:41
dimensional data is basically just data that describes
6:44
different observations that you're
6:46
making. So when
6:48
people want infinite cardinality, they basically just want
6:50
to be able to say they want to
6:53
capture as much precision and information about these
6:55
observations that they're making. Traditional
6:58
time series databases like InfluxDB versions
7:00
one and two and others have
7:02
a problem essentially when this cardinality
7:05
gets super, super high. And
7:07
we had a bunch of, you know,
7:09
customers and users who were saying they wanted to
7:11
record this and use it for it, but
7:14
we didn't have a solution. It was basically
7:16
like a fundamental limitation of the architecture of
7:18
the database. So how do we
7:20
achieve infinite cardinality? How do
7:22
we achieve cheaper storage? Right. People
7:25
wanted to decouple the query processing and
7:27
the ingestion processing and indexing from the
7:29
actual storage of the data. And
7:31
they wanted to be able to ship historical data
7:33
off to cheaper object storage that could
7:35
be backed by spinning disk while
7:38
still making it so that queries against
7:40
recent data are super fast. Right. So
7:43
again, you're talking about a very fundamental
7:45
shift in the architecture of the database
7:47
to be able to enable, you know,
7:49
keeping everything in object storage while processing
7:52
recent data and memory and
7:54
all this other stuff. So is that. And
7:57
then the other big piece is essentially like
7:59
we wanted broader ecosystem compatibility.
8:01
In versions, InfluxDB versions
8:03
one and two have
8:06
their own query languages, their own data
8:08
formats. We wanted to
8:10
be able to integrate with a much broader
8:12
set of third-party tools. So specifically
8:14
we wanted to support SQL as
8:17
a query language in addition to
8:19
InfluxQL or older query
8:21
language. We wanted persistent
8:24
formats that could be read and used
8:27
in tools outside of InfluxDB.
8:31
And we wanted all of this essentially to be
8:33
super performance. And basically when we looked at this,
8:35
we're like, OK, there are fundamental
8:37
architecture changes of the database, which means we're essentially
8:39
going to have to rewrite most of it. And
8:42
this was at the beginning of 2020. And
8:45
at that time, I thought,
8:47
well, one, older versions of
8:49
InfluxDB are written in Go. That's kind of an
8:51
artifact of when we created the project back in
8:53
2013. Go
8:55
was starting to become
8:57
hot then. The Go 1.0 release was in
9:00
March of 2012. But
9:02
in 2020, the beginning of 2020, I
9:05
was very interested in Rust. And I
9:07
felt that Rust as a programming language
9:10
would be essentially the best
9:12
way to implement this kind of high-performance
9:14
server-side software. And
9:17
I also thought that we could
9:19
bring in other open source tools
9:21
and libraries that would help us
9:23
get there faster. Specifically, we didn't
9:25
want to create our own SQL
9:28
execution engine from scratch. That's a very,
9:30
very big investment. And there are other
9:32
systems out there that can do it.
9:35
And initially, we thought that we might be
9:37
pulling in something that was written in it,
9:39
either C or C++, which meant
9:41
bringing that code into a Rust
9:43
project is actually fairly straightforward. And
9:45
you have zero cost abstractions and
9:47
basically a very clean way to
9:49
integrate it. When
9:52
we started looking around, we saw that there
9:54
were actually some Rust projects that were super
9:56
interesting that would enable us to do this.
9:58
So one, persistent. format, right,
10:01
we wanted a format that
10:03
was more broadly addressable, right,
10:05
from other tools. And in
10:07
2020, the most obvious choice,
10:09
at least to us was per K, it
10:12
was still like per K came out, I
10:14
think in like 2016. So it
10:17
was beyond like early, early adapter
10:19
phase, it was getting more usage,
10:22
starting to get more usage in like other
10:24
big data processing systems, data warehouses. And we
10:27
felt that if we use that as the
10:29
persistence format, we'd one, get
10:32
the amount of compression we needed for our
10:34
data to make it like, you know, compact
10:36
at scale. But the other
10:38
is like make it so we could share it with
10:40
other third party systems. So that was
10:42
kind of an obvious choice. Then we knew
10:44
like, we need a fact analytics on
10:47
the data, right? So
10:49
that's when we started looking at arrow
10:51
as the like in memory calmer data
10:53
structure, right? One of the things I
10:55
mentioned is, you know, this need for
10:57
supporting high cardinality data. But
11:00
then the other need is essentially like doing
11:02
analytics style queries on time series data so
11:04
that you can do analysis, versions
11:06
one and two of influx DB, those kind
11:09
of analytics queries were like slow because of
11:11
the way the system was architected under the
11:14
hood. And we thought if we're
11:16
going to be able to do fast analytical queries
11:18
on time series data, it's
11:21
going to have to be in this calmer
11:23
format. So we kind of adopted arrows in
11:25
memory format for this data, which
11:27
then led to, you know, these other
11:29
pieces. And then in early 2020, we
11:31
looked at a number
11:34
of different query engines we could
11:36
potentially use. We looked
11:38
at ducty be which was still very
11:40
nascent at that time, we looked at click
11:42
houses engine, which again was nascent
11:44
compared to where it is now. And we also
11:46
looked at data fusion. And at the
11:49
end of the day, we decided that data
11:51
fusion would be our choice because you know,
11:55
it was written in Rust. And the thing is
11:57
like all three of those projects that we evaluated,
11:59
we realized there was going to be a lot of work that
12:01
we would have to do to be able
12:04
to support the time series use cases that
12:06
we were aiming for. And
12:08
we felt that if we're going to
12:10
have to do a lot of work and end up
12:12
contributing heavily to this query engine, we might
12:15
as well do it in a language that we
12:17
intend to use, which is Rust, right? DuckDB and
12:19
Clickhouse are both implemented in C++. And
12:22
we also felt that Data Fusion being
12:24
part of the Apache Foundation and being
12:26
part of the Arrow project, we're making
12:28
a bet that it would essentially start
12:31
to gather momentum and pick up steam and
12:33
there'd be other people who would contribute to
12:35
it over time. And over
12:37
the last three and a half
12:39
years that we've been heavily developing with
12:41
it and contributing to it, we've certainly
12:43
found that to be the cases. More
12:46
people have been adopting Parquet, more people
12:48
have been adopting Arrow, they've been contributing
12:50
to those two and Data Fusion.
12:53
And Flight and Flight SQL are
12:55
also becoming kind of a standard
12:58
RPC mechanism, essentially for exchanging
13:02
analytic data sets or millions of rows
13:04
quickly in a high performance way. And
13:08
each of those pieces of the
13:10
stack are definitely well engineered. They've
13:13
been gaining a lot of momentum.
13:15
There's been a lot of investment
13:17
in that overall ecosystem, but they
13:19
are all, I guess they're not
13:22
as narrowly scoped in particular Arrow as when
13:24
they first started, but they are all focused
13:26
on a particular portion
13:28
of the problem. And
13:30
in order to build them into a
13:32
cohesive experience, I'm curious, what was the
13:35
engineering effort that's necessary to actually build
13:38
a fully executable database
13:40
engine and platform experience on
13:42
top of those disparate parts?
13:46
Yeah, I mean, it's certainly true that when
13:49
Arrow first started, it essentially was like an
13:51
in-memory specification. And the dream there was
13:53
essentially that you have
13:55
data scientists who are trying
13:57
to do analysis in either Python or R. Right
14:00
and the thing is they almost always have to
14:02
get their data from one place and bring it
14:05
in and Exchange it to another thing. So the
14:07
vision there was essentially how do you do? Data
14:10
interchange between these different data science
14:12
tools and systems that is zero
14:14
copy zero cost
14:16
serialization deserialization writes super super fast
14:19
and Wes and
14:21
his team started with that and
14:23
then they felt saw like okay Wait a second
14:25
now people also have these needs to like persist
14:27
the data So we need a persistent
14:30
format. He brought in parquet because he also
14:32
helped to find parquet when it was first
14:34
created But that became
14:36
an obvious add-on and then you
14:38
know the RPC mechanism. They're like, okay that well
14:40
now you have servers that are running Things you
14:43
need a way to exchange the data again an
14:45
obvious add-on and data fusion
14:47
again Like you need if
14:49
you're working with this data like in Python You
14:51
have like pandas and are you have like these,
14:53
you know different things you've like either data frames
14:56
libraries or whatever But a lot of time people
14:58
just want to execute a SQL query and you
15:01
need an execution engine That
15:03
can work with this arrow format
15:05
natively. That's going to be super fast, right?
15:07
Anything that's fast in Python Isn't actually written
15:09
in Python. It's written in C C++ and
15:13
then wrapped so That's
15:16
what they realized from the data science perspective
15:18
now from the perspective of people creating a
15:20
data platform Like an entire data
15:22
platform or a database server or something like that
15:26
The thing that's tricky about it is a lot
15:29
of these formats are actually they're designed for
15:31
Exchanging like a set chunk of data, right?
15:34
Like parquet is an immutable format, right? It's
15:36
not meant to be updated you write a
15:38
parquet file and that's that Arrow
15:41
again, like you don't append to arrow buffers
15:44
on the fly like you create an arrow buffer It's well
15:46
defined and then you can hand it off. So Having
15:50
a system that's basically able to ingest
15:52
data Live right like
15:54
individual rights individual rows that you're
15:56
writing in and being able to
15:58
combine that with this historic data
16:00
set that's represented either as arrow
16:02
buffers in memory or parquet files
16:05
on disk, right? Moving all that
16:07
data around, that becomes the really
16:09
like the trickiest part of creating
16:11
like a larger scale data
16:13
platform. It's like, how do you move that
16:15
data around? How do you combine the real
16:18
time data with the historical data? And how
16:20
do you make that all fast? And
16:22
how do you make it easy to use? All
16:25
of that work is basically a non-trivial
16:27
amount of effort, but it's
16:30
certainly made easier by the fact that
16:32
you no longer have to create the
16:34
lower level primitives, right, to
16:36
build that data platform. You don't have to create
16:38
the query engine. You don't have to
16:40
create the file format, right? Those things
16:43
basically just exist. And
16:45
there, you know, I have
16:47
heard Wes refer to it as basically the composable
16:49
data stack, right? Which is you can
16:52
kind of pick and choose these pieces that you
16:54
want to work with, right? You can use
16:56
the Data Fusion query engine, but
16:59
not use parquet at all. And,
17:01
you know, not use flake if you don't want
17:03
to. It uses arrow under the hood, so that
17:05
kind of like comes along for the ride. But
17:08
yeah, like all of these different pieces are kind of like,
17:10
you know, they're designed to
17:13
be modular so that you can pick a
17:15
different persistence format if you want that. You
17:17
can pick a different execution engine, right? Within
17:20
the arrow ecosystem, one of the
17:22
things that Voltron data,
17:24
the company that Wes ended up starting
17:27
with some other people that backs a
17:29
lot of the arrow stuff as well,
17:31
one of the things they
17:33
created was this project called, I don't know how
17:35
to pronounce it, Velox, basically, V-E-L-O-X,
17:38
which is basically like this
17:40
execution engine that was created
17:42
in conjunction with some work
17:44
at Facebook to do stuff,
17:47
right? So the idea is you can pick and
17:49
choose these components and kind of tie them all
17:51
together into a larger,
17:54
like, operational system where you're
17:56
essentially solving problems around data
17:58
warehousing, real-time
18:00
analytics and essentially just
18:02
like working with what I
18:04
would say observational data at scale,
18:07
right? Where observational data could
18:09
be data from your
18:11
servers, applications, sensors, logs, whatever
18:13
it is. Are
18:18
you sick and tired of sales at data conferences?
18:20
You know, the ones run by large tech companies
18:23
and cloud vendors? Well, so am I
18:25
and that's why I started Data Council,
18:27
the best vendor neutral, no BS data
18:30
conference around. I'm Pete
18:32
Soderling and I'd like to personally invite you to
18:34
Austin this March 26 to 28th
18:37
where I'll play host to hundreds of attendees, 100 plus
18:40
top speakers and dozens of hot startups
18:42
on the cutting edge of data science,
18:44
engineering and AI. The
18:46
community that attends Data Council are some
18:48
of the smartest founders, data scientists, lead
18:50
engineers, CTOs, heads of data, investors and
18:53
community organizers who are all working together
18:55
to build the future of data and
18:57
AI. And as a
18:59
listener to the Data Engineering podcast, you can
19:01
join us. Get a special
19:04
discount off tickets by using the
19:06
promo code DEPOD20. That's D-E-P-O-D-2-0. I
19:11
guarantee that you'll be inspired by the folks at the
19:13
event and I can't wait to see you there. Another
19:18
interesting element of
19:20
building your platform on
19:22
top of all these open source components
19:25
is that by virtue
19:27
of it being a layered stack,
19:29
you can have additional integrations that
19:31
can come in at each of
19:33
those different layers rather than having
19:35
the main interface be the only
19:37
way of accessing the data that
19:40
it contains. It
19:42
also gives you the benefit of being
19:45
able to capitalize on the overall ecosystem
19:47
of investment and the network effects that
19:49
you get from those different open source
19:52
projects. So I'm wondering if
19:54
you can comment on some of the ways
19:56
that you've seen that benefit materialize in your
19:58
work of building this data. a platform
20:00
on top of these different components? Yeah,
20:03
so this is actually like one of
20:05
the things I'm most excited about for
20:08
these different pieces and for the work
20:10
we're doing, which is I think
20:14
we actually need to add another letter
20:16
to the acronym, the FDAP acronym, and
20:18
maybe like jumble them up. But
20:20
basically the other letter is I
20:22
for Apache Iceberg. So
20:25
Iceberg is essentially a catalog standard
20:27
for creating a data catalog of
20:30
essentially parquet files in object storage, right?
20:34
And we're basically building first class support for that
20:37
in impluxDB 3.0 where all of the data
20:40
that's ingested into an impluxDB
20:42
3.0 server can be exposed
20:44
essentially as Iceberg catalogs, which
20:47
is awesome because that's
20:49
a standard that was originally developed at
20:51
Netflix and that was open sourced out
20:53
into the Apache Foundation. And
20:55
it's quickly being adopted by
20:57
other companies, right? So Snowflake
21:00
just added support for Iceberg
21:02
as a format. Even
21:04
Databricks is adding support for it,
21:06
even though they have a competing
21:08
standard called Delta Lake. And
21:11
a lot within Amazon, the
21:13
Amazon Web Services, for example, they're
21:16
adding first class support for Iceberg
21:18
so that if you have
21:20
data that's exposed as an Iceberg catalog in
21:22
F3, you can then
21:25
query that data using any
21:28
of the Amazon query services
21:30
like Athena or Redshift or
21:32
all these different pieces. So
21:35
that I think is a really
21:38
interesting integration because it makes it
21:40
so that you can access this
21:42
data in bulk, right? So if you want
21:45
to need to train a machine learning model
21:47
or whatever, or query against this
21:49
data for doing large scale analytical
21:51
queries and be totally outside with
21:54
for impluxDB 3.0, for example, totally outside
21:56
the operational envelope of the system that's
21:58
managing all this real-time
22:00
data movement, being able to query in real-time,
22:04
you can basically do all these analytics
22:06
tasks completely disconnected from that. And
22:09
again, you could use
22:11
Data Fusion for that, but
22:13
you could also use Athena, right? Which
22:15
is based on a Java query engine
22:18
called Trino or
22:21
Presto or whatever it is now. Or
22:24
you could use DuckDB or Clickhouse or
22:26
any one of these other systems to
22:29
do your query processing and analytics against
22:31
that data. So that
22:34
integration I think is super interesting. The other
22:36
one that I think is interesting is within
22:40
the Arrow project. So
22:42
they have FlightSQL is
22:44
basically like an RPC mechanism for
22:46
essentially sending SQL queries to a
22:48
server and getting back millions of
22:50
rows really, really quickly. And they
22:52
have basically a new standard that
22:54
they've created that's kind of like
22:56
competing with ODBC. ODBC
22:59
is obviously the database connection standard.
23:01
It was for essentially transactional
23:04
databases and relational databases. The
23:07
Arrow one, once that
23:09
becomes a thing, I think it will be
23:11
a really like a standard way to
23:14
connect to analytical data
23:16
stores of any kind, whether it's
23:18
data warehouses or real-time data systems
23:20
or whatever. And I think those,
23:24
like having those things
23:26
be standards and have them contributed to
23:28
by many different companies, not just supported
23:30
by a single vendor, I
23:33
think will make it
23:35
the pace of innovation in this
23:37
space for these large
23:39
scale data use
23:41
cases which are only
23:43
gonna continue to increase and multiply. I
23:45
think it makes it so that we can
23:48
have basically many more
23:50
tools that can integrate with
23:52
each other. If
23:54
you look at data warehousing for the last 20
23:57
years, it's
23:59
long-term. Largely been like your data
24:02
warehouses are basically kind of
24:04
like data roach motels. Like
24:07
your data goes in and you have
24:09
to get all the data in the data warehouse, but then
24:11
if you want to do anything with it, you have to
24:13
send the query to the data warehouse and like all this
24:15
other stuff, right? And there's just not, there's
24:17
not this really good integration, like the data
24:19
warehouse just becomes this one place. So
24:22
being able to access it from a bunch
24:24
of different tools without having
24:27
one piece of software be the arbiter of the
24:29
entire thing, I think is really interesting. Absolutely.
24:32
And to your point
24:34
of Flight SQL being
24:36
a new RPC mechanism to
24:38
unlock a lot of potential and reduce
24:40
a lot of the pains, it just
24:43
makes me sad that I obtained all
24:45
of that scar tissue around ODBC for
24:47
nothing. I
24:49
mean, I think ODBC is going to be around
24:51
for a very long time. I don't think it's
24:54
going away. Yeah, absolutely.
24:57
And the counterpoint to
24:59
the benefits that you get building on
25:01
top of open source is that particularly
25:03
when you have a business that is
25:06
being powered by these components, you
25:08
adopt some measure of platform risk
25:10
because you're not the only person
25:13
who has a vision for the
25:15
future direction of these technologies. And
25:18
some of that future direction may or
25:20
may not be compatible with the vision
25:22
that you have for it. And I'm
25:24
curious how you think about that platform
25:26
risk and the mitigating factors that you
25:28
have in the engineering that you're doing
25:31
to account for any potential future shift
25:33
in the kind of vision and direction
25:35
of those products. Yeah,
25:38
I mean, you can wrap the
25:40
libraries with your own abstractions, but
25:42
the problem is that comes with
25:44
a high price, a high cost.
25:47
And the truth is even if you wrap it with your own
25:49
abstractions, if the libraries end up changing
25:51
significantly and you're like, okay, we need to replace it
25:54
with something else, it's gonna
25:56
be like a non-trivial task. The
25:58
best insurance. is
26:00
essentially to have enough people contributing to the
26:02
core of the thing to be
26:04
able to have some level of
26:07
influence on the direction of the project. Ultimately,
26:10
there's gonna be platform risk,
26:13
but I think, take
26:16
it from the other side, which is we
26:18
decide to develop all this stuff ourselves and keep
26:21
it close source and just whatever. Well,
26:24
the risk there is like, I mean,
26:26
that's just an absolute mountain of work
26:28
to do. And
26:30
I think it's like, as
26:34
these projects have matured, like I said, we've
26:36
seen other people contributing to them. So now
26:39
we regularly get performance improvements in
26:41
the query engine or new functions
26:43
in the query language. And all
26:45
of this stuff, we help manage
26:47
the project, we have people contributing
26:49
to, we make significant
26:52
investments into the open source pieces. But
26:55
those are things that we kind of get for
26:58
free as a
27:00
result, essentially, it means
27:02
that the risk we have if we kept
27:04
it all closed source is that our pace of
27:06
development would be outpaced, outmatched
27:09
by the set of people contributing
27:11
to this open thing, right?
27:13
We may be able to
27:15
get somewhere initially,
27:18
but eventually, the open source
27:20
people are gonna like outpace a
27:22
small team of proprietary developers. Now,
27:24
if you have unlimited resources, and
27:27
you can basically just like, you
27:29
know, create, you
27:32
know, a long lived team of people that you're able to fund
27:34
forever, then the situation changes.
27:37
But I think for startups in
27:40
the technology space, like, their
27:42
best bet is to adopt platform
27:44
pieces that are not, that, you know, that
27:47
you can contribute to, that can form the
27:49
basis of the things you're building, right, like,
27:51
and this is, you know, you
27:53
don't create your own operating system, right? You
27:55
use Linux, and you don't create your own
27:57
programming language, you use whatever language you're gonna
27:59
use. we use there. And I think all
28:02
that stuff happens, it happens
28:05
higher and higher. All these pieces
28:07
kind of like build on each other. In
28:09
this case, like when we're talking about the
28:11
FTAP stack and all these different components, they're
28:14
essentially the toolkit that you would
28:16
use to build a database, an
28:18
analytical database or a data warehouse,
28:20
right? So why create
28:22
those things from scratch, right? Your ultimate
28:24
goal is not really to create a
28:27
data warehouse, it's to deliver value for
28:29
your customers who are actually paying for
28:31
the solution. And they don't really care about
28:33
a data warehouse per se, they care
28:35
about solving their data problem for their
28:37
customers. So as
28:39
much as you can adopt to say
28:41
like, okay, this isn't gonna be
28:43
our thing that we innovate on. This is gonna be that, that's
28:46
not how we actually add value to this
28:48
market, to this thing that we're selling. This
28:51
is basically just like a barrier to entry.
28:54
And if you can adopt an open source
28:56
thing that like reduces the barrier, then right.
28:59
Absolutely, and by virtue of being
29:01
involved with and participating in the
29:03
open source projects that you're relying
29:05
on, you also get the benefit
29:07
of early warning of knowing that,
29:10
okay, this is the future direction that
29:12
the community would like to see. And
29:15
so now I can proactively plan for
29:17
those shifts in the underlying technology so
29:19
that I can accommodate them in the
29:21
end result that I am building on
29:24
top of it. Yeah,
29:26
well, and ultimately like the absolute
29:28
worst case scenario, right, is like
29:30
the community is gonna make some
29:32
weirdo changes. They're just completely incompatible
29:34
with what we need to do.
29:38
Great, then we can just fork the
29:40
project from whatever that last point was.
29:43
It's permissively licensed open source. We can fork
29:45
a project and then we have two options.
29:47
Do we make our fork closed
29:50
source? Or do we
29:52
make our fork something publicly available and you
29:54
just continue on from there, right? And
29:57
at that point you haven't adopted any. more
30:00
risk than you would have
30:02
had anyways, you know, your close source
30:04
thing. Although I will say, like, like
30:07
I mentioned, we, we spend a
30:10
lot of time contributing to these
30:12
community projects. So there's a
30:14
there's a good amount of effort that we
30:17
put forward that essentially doesn't benefit us directly.
30:19
Right? It's not that we're doing
30:21
this community thing or managing these like
30:23
efforts, a bit different people contributing or
30:26
whatever, because it's something we need specifically
30:28
for our products. But
30:30
again, the bet is that, you know, like,
30:34
okay, there are a bunch of things we'll do, they're
30:36
not direct benefit to us, but there are other things
30:38
coming in from the community that are so it
30:41
all kind of like, evens out.
30:43
And actually, in our, you know, in my
30:45
experience, it doesn't even out like we get
30:47
far more out of it that we give
30:49
when then we put in even though we
30:51
like, like I said, we try to
30:53
put in as much as we possibly can. It's
30:56
just that when you have, you
30:58
know, dozens of developers from around the
31:00
world and different companies contributing
31:02
to this thing, like, the
31:05
sum is going to be greater than when
31:07
what any one individual or one company produces
31:09
and puts into it. And
31:11
so looking at the
31:14
component pieces of this stack and
31:16
the overall architecture
31:18
and system requirements for a
31:21
database engine, what are the
31:24
additional pieces that you had to
31:26
build custom? What is the work
31:28
involved in building a polished user
31:31
experience on top of these different
31:33
components? And some of the
31:35
ways that you're thinking about what are the
31:37
appropriate abstraction layers? Or what are the appropriate
31:39
system boundaries for what these four
31:42
pieces of the stack do and the eventual inclusion
31:44
of iceberg? And what is the
31:46
responsibility of influx as
31:49
the database experience that needs to be built
31:51
on top of it? Yeah, so
31:53
I mean, basically, like, these components are
31:55
really just libraries, right? They're just programming
31:57
libraries that we use. So they're not
32:00
actually a piece of running software that will
32:02
do anything on its own. I mean
32:05
Data Fusion does have like a command
32:07
line tool where you can say like point
32:09
it at you know a file and execute
32:11
a query against it if it's CSV or
32:13
JSON or parquet right. But beyond
32:15
that it's not like a process
32:18
that will run on a server that will respond to
32:20
requests and all this other stuff. So you kind of
32:22
have to build all that scaffolding
32:24
around it right. You have to build a
32:26
server process and you have to just decide
32:28
what your API is going to be right.
32:31
For writing data in most people are not
32:33
going to want to write you
32:36
know arrow record batches or
32:38
parquet files in because those
32:40
two formats actually aren't super
32:43
easy to create yourself. Like
32:45
usually when people create those
32:47
formats they do it as a
32:49
transform from some other data
32:51
that's easier to work with like
32:53
CSV or JSON or whatever right.
32:55
So you have to decide
32:57
like how do you write data in what's that format
32:59
how do you translate it to arrow or
33:02
parquet. You need to
33:04
decide like for the query interface like
33:06
SQL the language but then how are
33:08
they going to make the request right.
33:10
It's going to be HTTP, JPC, whatever
33:13
and then what is the response format going to be. Do
33:16
you want to give them arrow? Do you want to give
33:18
them parquet? Do you want to give them CSV, JSON, something
33:21
else right. So all
33:24
those pieces you kind of have to decide on
33:26
and create right.
33:28
Basically the entire like piece
33:30
of server software and then there's you
33:33
know all the operational pieces which is if you
33:36
have to run this in
33:38
a Kubernetes cluster if you have to run this in the
33:40
cloud or whatever and also for
33:43
us for InplexDB 3
33:46
you know we have currently what we have
33:48
is a distributed version of the database where
33:50
we've it's comprised
33:52
of a number of different services that
33:55
run inside a Kubernetes cluster right and
33:57
we've separated out the ingestion tier from
33:59
the query tier from compaction
34:02
from a catalog that runs,
34:04
right? So
34:07
we basically had to create services
34:09
for each of those and APIs for how they
34:11
interact with each other. And then a bunch
34:13
of like tooling and stuff like that
34:15
to actually monitor, you know, spin this up on
34:17
the fly and monitor it, run it, all
34:20
that separate stuff. So I mean, it's still
34:22
like, if you're going to adopt
34:24
these components to build, you
34:26
know, a data system, there's still a
34:28
lot of work to do, but yeah. For
34:34
people who are interested
34:36
in building some database engine, or they
34:38
are interested in the functionality of any
34:40
of these different pieces, I'm curious what
34:43
you see as some of the other types
34:47
of projects that would benefit from
34:49
the capabilities of any or all
34:51
of those pieces of the stack,
34:54
and maybe some of the other elements
34:56
that could be built up and added
34:58
to that ecosystem to maybe reduce the
35:00
barrier to entry that you've had to
35:02
pay. Yeah,
35:04
I mean, so what
35:06
it seems like a
35:09
bunch of different kinds of projects are starting
35:11
to adopt and companies are starting to
35:13
adopt these pieces of the stack. So, you
35:16
know, I just saw one yesterday,
35:19
there was basically like a new
35:21
stream processing engine that essentially is
35:23
using Data Fusion and thus also
35:25
Arrow as the way to
35:28
do, you know, processing within the
35:30
stream processing engine, right? So
35:32
you can execute SQL queries against like data
35:34
coming in a stream, whatever. So there's that,
35:37
there are different kinds of
35:39
database systems, either time series
35:41
database or document database or
35:43
data warehouse or whatever. Like
35:45
I've seen a number of projects
35:49
in either open source or in companies that
35:51
are starting that to use those components. There's
35:55
another project right now where
35:57
contributors from Apple are basically.
36:00
putting in essentially a Spark execution
36:02
engine, which
36:05
is based on Data Fusion. Essentially
36:08
this is a replacement for the
36:10
open source Java Spark implementation that's
36:13
supposed to be faster and stuff like that. So basically
36:15
you see like one component within
36:17
Spark is being replaced with Data
36:20
Fusion as part of this. And
36:22
actually the creator of Data Fusion,
36:24
Andy Grove, was originally doing creating
36:27
Data Fusion for that use case inside
36:29
of NVIDIA. So
36:32
you see like all these different companies
36:34
like creating those different pieces. I
36:37
think it's still early in
36:39
for the Rust ecosystem of tools
36:43
to see what's gonna happen, like what open source
36:45
projects are going to become kind of big, right?
36:48
Right now when you think of like big
36:50
data processing tools, most of that
36:52
environment is in Java, right? It started
36:54
with Hadoop and then continued
36:57
with Spark and like all
36:59
the different components there and right and Kafka's
37:01
written in Java and Flink's written in Java,
37:03
right? So you have different stream processing systems
37:05
and all these things kind of integrate together.
37:08
What I anticipate is that over
37:11
the next 10 years, you see
37:13
a lot of those systems rewritten,
37:15
recreated, using Rust and
37:18
using Data Fusion and Arrow and
37:20
Parquet as the underlying
37:22
primitives. And ideally they wouldn't
37:24
just recreate the exact same thing, you
37:26
know, but instead of Java, it's at
37:28
Rust. There will certainly be some of that, but
37:31
ideally what they will do is they will take,
37:34
you know, a lot of lessons learned from those
37:36
previous versions of those pieces of software to
37:38
the like, okay, how can we make the
37:41
user experience better, right? So it's easier to
37:43
express the kind of things we wanna express
37:45
or how do we make operations better? So
37:47
it's easier to like operate these systems at
37:49
scale. So I think
37:51
it's really early yet though. It's
37:53
not clear to me like from
37:56
an open source perspective, what projects are gonna be
37:58
the winners here. that eventually
38:02
supersede the previous Java
38:05
systems. Absolutely. And
38:07
I've definitely been seeing a little bit of that
38:09
as well, even three to five years
38:12
ago of C++ being the
38:15
implementation target, particularly built around
38:17
the C-star framework for being
38:19
able to take advantage of
38:21
multi-CPU architectures, most notably
38:24
the CillaDB project as a
38:26
target to re-implement Cassandra and
38:28
then Red Panda taking on
38:31
the Kafka ecosystem. And
38:34
another interesting aspect of this
38:37
space is Arrow as the
38:39
focal point of that data
38:41
interchange has been gaining a
38:43
lot of ground. It started off as a
38:46
very nascent project. There's been a lot of
38:48
effort put into making that more of the
38:51
first target rather than
38:53
being a second consideration. And
38:56
it's been working on integrating with the
38:58
majority of the components of the data
39:00
ecosystem. I'm wondering what you
39:02
see as some of the remaining gaps
39:05
in coverage or some of the white
39:08
spaces in the overall Arrow ecosystem that
39:10
are either immature or completely absent
39:13
and spaces that you would like
39:15
to see the overarching data community
39:17
invest in building out
39:19
more capabilities and capacity. So
39:23
I think there's still probably some work
39:25
to be done within Arrow as a
39:27
specification itself for representing data
39:30
in a more compact form. For
39:34
some kinds of like columnar data, it's just not
39:36
as efficient as I think it could be. But
39:39
originally, I think that was a
39:42
result of one of the design goals, which
39:44
was essentially O of one lookup for any
39:46
individual element within the set.
39:49
I think if that constraint is loosened,
39:51
it opens up the possibility for other
39:53
kinds of compression techniques
39:55
and stuff like that that will make it
39:58
a better format for compressed
40:01
data in memory, which I think is
40:03
something that would be potentially interesting. I
40:07
think there's still a
40:09
question of like, okay, if
40:11
we're going to have a
40:13
stream processing system that uses
40:16
these tools, what does that look like?
40:18
Because Arrow as a format
40:20
actually is not well suited
40:23
for stream processing, right? Because it's a
40:25
columnar format, so the
40:28
conceit there is that you are sending
40:30
in many, many rows at the same
40:32
time, whereas when you think of stream
40:34
processing, you think of either
40:36
micro batching or individual rows like
40:38
one by one, right? So there's
40:41
no good translation layer between, okay,
40:43
if you're moving, if you care
40:45
about doing stream processing and you
40:47
want to move to Arrow or
40:49
batch processing, larger scale data
40:51
processing, how do you make that transition? And
40:55
what do the tools look like
40:57
for that? I think that's still very difficult,
40:59
right? And it's certainly like something
41:01
we've done in Inflock2b3, which is like
41:04
translating to Lime protocol,
41:06
individual rows being written in to
41:08
the Arrow stuff. I
41:10
think the
41:12
distributed query processing is something
41:15
that is probably going to
41:17
get more work. It's
41:20
definitely something that needs more work within the Data
41:22
Fusion piece itself. I
41:25
think later this year, I think in
41:27
a couple of months, hopefully they're going to
41:29
vote on whether Data Fusion becomes its own
41:32
top level Apache project outside of Arrow. My
41:35
best guess is that's going to happen. And
41:37
then what we'll probably see is like
41:39
Data Fusion will then have some sub
41:41
projects, one of which I think will
41:44
be around distributed query processing, which I
41:46
think will be important for
41:48
it really to become a contender
41:50
and a competitor in the larger
41:52
scale data warehousing space. What
41:57
else? I
42:00
don't know, like Parquet has gotten some
42:02
interesting improvements along the way. I think,
42:04
I don't know, there was like Geo
42:06
Parquet for representing geospatial data. I think
42:08
that's going to be super important. So
42:11
yeah. This might
42:13
be a little bit too far afield or
42:16
too deep in the weeds, but there was
42:18
also for a little while a bit of
42:20
contest between Parquet and ORC as the preferred
42:22
columnar serialization format. I'm wondering if you have
42:25
seen that the dust settle around that and
42:27
there has been a general consensus around one
42:29
or the other, or if those are still
42:31
kind of a case by case basis, do
42:34
what you think is right for a different
42:36
use cases. I
42:39
may just be biased because I'm looking
42:42
for Parquet, but I don't, I remember
42:44
that being a thing and I remember
42:46
looking at formats from a high level
42:49
back in the day, but I don't really
42:52
see ORC as a format
42:55
coming up nearly as much. It
42:58
seems to me that Parquet
43:00
is kind of one,
43:03
the mind share largely, and that's what
43:05
people kind of coalesced around. Now of
43:07
course, because we're talking
43:09
about data at scale, there are probably
43:12
like mountains of data in people's data
43:14
lakes and data warehouses that is represented
43:16
as ORC. So that's not going
43:18
to go away, but by and
43:20
large, what I see is that Parquet
43:23
seems to be the standard format
43:25
that all the big data vendors are coalescing
43:28
around. I've been
43:30
seeing a similar thing. And
43:33
then to the point of
43:35
streaming and record-based digestive data
43:37
versus the columnar approach for
43:39
Parquet and Arrow, I know
43:42
that Avro and Parquet have
43:44
a defined kind of translation
43:46
method of being able to
43:48
compact multiple Avro records into
43:50
a Parquet file. And
43:53
I'm curious if you're seeing anything
43:56
analogous for the Arrow ecosystem of
43:58
being able to maybe manage
44:00
that translation of multiple AVRO
44:02
records batched into an ARO
44:04
buffer that can then subsequently
44:06
be persisted into parquet or
44:09
using that AVRO to parquet translation as
44:11
the intermediary to then get loaded into
44:13
an ARO buffer? I
44:16
mean, I haven't really seen that. I mean, there's because,
44:18
I mean, it's pretty easy
44:20
to go from ARO to parquet or parquet to
44:22
ARO, right? Because there are, you know,
44:25
parquets within the ARO umbrella.
44:28
So people, people that the
44:30
product, you know, in the various projects have created
44:32
a bunch of like translation layers to do that.
44:35
I haven't seen, I really haven't
44:37
seen any like rise of like, oh, these
44:40
like row based formats into either
44:42
ARO or parquet,
44:45
it just seems to be like, kind
44:47
of one off. I honestly,
44:49
I don't see AVRO come
44:52
up that much. So mainly,
44:54
I think what I see the most what
44:56
people care about is like JSON data, just
44:59
because it's so easy, you know,
45:01
to change between different languages and different
45:03
services. And honestly, I
45:05
think proto buff more than,
45:07
more than AVRO or anything else. I
45:10
think that's mean maybe because of, you
45:12
know, the popularity of gRPC. As
45:16
you have been investing in
45:18
this ecosystem, building on top of the different
45:20
components, I'm wondering what are some of the
45:23
most interesting or innovative or unexpected ways that
45:25
you have seen some or all of those
45:27
pieces used together? So
45:30
honestly, stream processing was a surprise for
45:32
me, because I like I didn't, when
45:35
I think of like ARO and, and Data
45:38
Fusion, like I wasn't originally thinking that people
45:40
would use these things for stream processing systems,
45:43
right? I think more like, it's there around
45:45
like fast processing and do it, you know,
45:47
I execute a query against this data, whatever.
45:50
So having people seeing people pull that
45:52
stuff into the stream processing systems has
45:55
been very surprising. Elsewhere,
45:57
I'm not sure, like I think So
46:01
I've seen a few observability
46:04
solutions start to look seriously using
46:06
Parquet as the persistence format. That's
46:08
a little surprising too, mainly because
46:12
when I think about observability, it's largely like,
46:14
oh, you think of like metrics, log traces,
46:16
right? And generally what people
46:18
have done is they've created specialized formats
46:21
and backends for each of those individual
46:23
use cases. So I've
46:26
seen some people start to look
46:28
at seriously at having Parquet represent
46:30
like any of that kind of data, which
46:33
I think to me that's
46:35
definitely like one of our visions
46:37
long term is that being
46:39
able to store any kind of observational data
46:41
in influx and thus in Parquet, but
46:44
to see more observability vendors start to look at that
46:46
seriously has been a bit of a surprise too. And
46:51
in your experience of working in
46:53
this space, rebuilding the influx database
46:55
and investing more into the Aero
46:57
ecosystem, what are some of the
46:59
most interesting or unexpected or challenging
47:01
lessons that you've learned in the
47:03
process? I mean, one of
47:07
the lessons, which is somehow a lesson I always
47:09
really need to relearn as a software developer is
47:11
things always take longer than you expect them to
47:13
take. So this
47:16
project, like I said, you know, we
47:18
started seriously thinking about it about four
47:21
years ago, really serious development on it
47:23
for the last three and a half.
47:25
It's basically just a long, a long
47:27
road to create this kind of system.
47:30
So is that I've
47:33
been pleasantly surprised by the adoption
47:35
by actually
47:37
the level, the level of
47:39
contribution from outside people
47:42
at actually companies
47:44
of a very significant
47:46
size has been also a bit
47:49
of a surprise. I think
47:54
for companies that reach like
47:56
crazy scale, which
47:58
are companies that you know, the names of.
48:00
Like, I think many
48:02
of them are contributing to these projects, because
48:05
they kind of have to like, create
48:07
their own things, because literally nobody on earth has
48:09
the kind of scale problems they have, except for
48:11
maybe like 10 or 20 different
48:13
companies. So they end up having
48:16
to roll their own solution. And again,
48:18
I think the the
48:20
fact that these companies are contributing is something
48:23
I didn't expect, particularly this
48:25
early on. And
48:28
I think that speaks to, you know, the
48:30
thing we were talking about earlier, which is
48:32
like, what kind of platform risk is there
48:34
to adopting this code? And it's like, well,
48:36
the alternative is, you create all
48:38
this closed source software that is really like, there's not
48:40
a problem you're trying to solve. This is just like
48:42
the problem you have to solve to get to the
48:44
problem you're trying to solve. So that's,
48:47
that's been, like, I think
48:49
a pleasant surprise, seeing seeing this,
48:51
you know, mature over the last
48:54
few years. And for
48:56
people who are looking to build data
48:59
systems, data processing engines, what are
49:01
the cases where the Fdap stack
49:03
is the wrong choice? So
49:07
I, I don't think
49:09
it's particularly designed for all TP workloads,
49:11
right? So we have traditional relational databases
49:13
and stuff like that. Like, there
49:16
are places where you know, you can, it
49:19
would make sense to have it as like, essentially
49:22
like an interface point. But,
49:25
I mean, you can certainly use like data
49:27
fusion as your query engine in an old
49:30
TP workload. But to me, it
49:32
wouldn't make sense to use like arrow is
49:34
a way to ingest data or parquet. Because
49:36
really, when you think about all TP workloads,
49:38
you think about individual requests with individual record
49:41
updates and stuff like that. So
49:43
I really do think these tools are
49:45
more geared towards larger
49:48
scale analytical workloads against,
49:51
you know, data that you can largely view
49:53
as immutable, right? This is like observational data
49:56
and stuff like that. So, yeah.
50:00
And as you continue to build and
50:02
iterate on the new version of InfluxDB
50:04
and invest in the Aero ecosystem and
50:06
the components we've been discussing, what are
50:08
some of the things you have planned
50:10
for the near to medium term or
50:12
any particular projects or problem areas you're
50:14
excited to dig into? So
50:16
as I mentioned, the thing
50:18
I'm most excited about is essentially like more
50:20
integration, adding support
50:23
for Apache Iceberg. So
50:26
what that's going to, so there's already like a
50:28
Rust project to do Apache Iceberg, but it's not
50:30
like fully baked yet. So we may need to
50:32
contribute to that, or maybe the people who are
50:34
working on it will get it fully baked before
50:36
we actually get to the point where we're pulling
50:38
it in. So
50:42
Apache Iceberg is a big thing. I
50:44
think in the medium term,
50:46
the distributed processing stuff and data fusion
50:48
is going to be super interesting. And
50:51
then from InfluxDB's perspective, as
50:53
I mentioned, like we have right now
50:56
our commercial distributed version of
50:58
the database, but this year
51:00
we're coming out with the
51:02
open source version of the
51:04
monolithic single server version of
51:06
the database. And getting that
51:09
open source piece out there with
51:11
like a new version 3 API
51:13
that kind of represents a much
51:15
richer data model than previous versions
51:17
of InfluxDB that takes advantage of
51:19
what you can do with Arrow and
51:21
Parquet as the
51:23
formats that I'm
51:25
actually really, really excited about. Because then I
51:28
really think that from
51:30
a technology perspective, InfluxDB will actually
51:32
be able to fulfill the like
51:34
vision that we've had all along, which is that essentially
51:37
it is useful
51:39
for any kind of observational data you
51:41
could think of, not just like metrics
51:43
data from your servers or networks or
51:45
your apps. Are there
51:47
any other aspects of the work
51:50
that you've been doing on the
51:52
InfluxDB engine, the work that you've
51:54
been doing investing in and building
51:56
on top of the Arrow ecosystem
51:58
or the overall space of how
52:01
the Aero ecosystem might influence the future
52:03
direction of the data processing ecosystem that
52:05
we didn't discuss yet that you'd like
52:07
to cover before we close out the
52:09
show? I don't think
52:11
so. Like, I think we kind of, I
52:13
mean, I guess like more
52:16
broadly, like the way the way I
52:18
view like the data space right now
52:20
when you're talking about these like analytical
52:22
data is there's this
52:25
kind of like distinct
52:27
separation between like data warehousing on
52:29
one side, which is these large
52:31
scale analytical queries and stuff like
52:33
this, and like stream processing on
52:36
the other, which is more about like real time
52:38
data as it arrives. I think the
52:41
trend like really when I think about those two
52:43
things, like ultimately, like what developer
52:45
wants and what users want is basically some
52:48
magical oracle in the sky that they can
52:50
like send a query to
52:52
the where the result will come back in, you know, some
52:54
50 milliseconds. And
52:57
we have that we wouldn't need stream processing, we wouldn't
52:59
need like all these different things. But
53:02
I think as the technology improves
53:04
and things get better and better,
53:06
data warehousing is going to become more
53:08
real time. And the real time pieces
53:11
are going to, you know, move more
53:13
towards like data warehousing, because ultimately, like
53:15
people don't want to think about separating
53:17
stream from data warehousing, whatever. And
53:20
one of the things I'm excited about is essentially
53:22
the idea that these different building
53:25
blocks could potentially be the things
53:28
that people use to kind of close that
53:30
gap and create, you
53:32
know, a big data solution that
53:34
works either for real time data or for,
53:37
you know, big scale data warehousing. But
53:40
I thought people liked reinventing the lambda architecture.
53:45
Oh, no, yes, they do. They
53:49
just like to call it something new. Maybe
53:51
it's the kappa architecture. All
53:54
right. Well, for anybody who wants to get in
53:56
touch with you and follow along with the work
53:58
that you're doing on the web. have you add
54:01
your preferred contact information to the show notes. And
54:03
as the final question, I'd like to get your
54:05
perspective on what you see as being the biggest
54:07
gap in the tooling or technology that's available for
54:09
data management today. The biggest gap?
54:12
Oh, I
54:15
don't know. I don't
54:17
know, actually. I mean, obviously,
54:20
like, I
54:22
think the most interesting side of this is
54:24
essentially like, you know, time series data and
54:27
basically being able to represent, being able to
54:29
do analysis on data as time series. So
54:32
that's our focus. That's
54:34
what I think is the most interesting thing right now.
54:37
But yeah,
54:41
I still think that's an unsolved problem by
54:43
us or anybody else. So that's what we're
54:45
working towards. All right. Well,
54:48
thank you very much for taking the
54:50
time today to join me and share
54:52
the work that you've been doing, both
54:54
contributing to and building on top of
54:56
the Arrow ecosystem and the components thereof.
54:58
It's definitely a very interesting
55:00
area of effort. It's great to see the work that
55:03
you and your team are doing to help bring all
55:05
of us forward in that space. I appreciate the time
55:07
and energy you're putting into that, and I hope you
55:09
enjoy the rest of your day. Cool.
55:12
Thank you. Thank
55:19
you for listening. Don't forget to check
55:22
out our other shows, podcasts.init, which covers
55:24
the Python language, its community, and the
55:26
innovative ways it is being used. And
55:28
the Machine Learning Podcast, which helps you
55:30
go from idea to production with machine
55:32
learning. Visit the site at dataengineeringpodcasts.com, subscribe
55:35
to the show, sign up for the
55:37
mailing list and read the show notes.
55:40
And if you've learned something or tried out a product from the
55:42
show, then tell us about it. Email
55:44
hosts at dataengineeringpodcasts.com with your
55:46
story. And to help other people
55:48
find the show, please leave a review on Apple
55:51
Podcasts
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More