Podchaser Logo
Home
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Released Sunday, 25th February 2024
Good episode? Give it some love!
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Sunday, 25th February 2024
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello, and welcome to the Data Engineering

0:13

Podcast, the show about modern data management. Daxter

0:17

offers a new approach to building

0:19

and running data platforms and data

0:21

pipelines. It is an open source,

0:23

cloud-native orchestrator for the whole development

0:25

lifecycle, with integrated lineage and observability,

0:27

a declarative programming model, and best-in-class

0:30

testability. Your team

0:32

can get up and running in minutes

0:34

thanks to Daxter Cloud, an enterprise-class hosted

0:36

solution that offers serverless and hybrid deployments,

0:39

enhanced security, and on-demand ephemeral test deployments.

0:42

Go to dataengineeringpodcast.com/daxter today to get

0:44

started, and your first 30 days

0:46

are free. Data lakes

0:49

are notoriously complex. For

0:51

data engineers who battle to build

0:53

and scale high-quality data workflows on

0:55

the data lake, Starburst powers petabyte-scale

0:57

SQL analytics fast, at a fraction

0:59

of the cost of traditional methods,

1:01

so that you can meet all

1:03

of your data needs, ranging from

1:05

AI to data applications to complete

1:07

analytics. Trusted by teams of all

1:09

sizes, including Comcast and DoorDash, Starburst

1:11

is a data lake analytics platform

1:13

that delivers the adaptability and flexibility

1:15

a lake has ecosystem promises. And

1:18

Starburst does all of this on an

1:20

open architecture, with first-class support for Apache

1:22

Iceberg, Delta Lake, and Hoodie, so

1:24

you always maintain ownership of your data. Want

1:28

to see Starburst in action? Go

1:30

to dataengineeringpodcast.com/starburst and get

1:32

$500 in credits to

1:34

try Starburst Galaxy today, the easiest and

1:37

fastest way to get started using Trino.

1:39

Your host is Tobias Macy, and today I'm

1:41

interviewing Paul Dix to talk about his investment

1:44

in the Apache Aero ecosystem and how it

1:46

led him to create the latest FAD and

1:48

database design. So Paul, can you start by

1:50

introducing yourself? Sure. I'm

1:52

Paul Dix. I'm the founder and CTO

1:54

of Influx Data. We are the makers

1:57

of InfluxDB, which is an open-source time

1:59

series database. Prior to that,

2:01

I have a lot of experience in industry. I'm obviously

2:03

a computer programmer by training, and I've

2:05

worked in a lot of large companies, small companies

2:08

all over. So. And

2:10

for folks who haven't listened to your

2:12

previous appearance on this show, where we

2:14

were talking about the Influx product suite

2:16

and your experience there, where you actually

2:18

hinted at the work that you've been

2:21

doing, where we're bringing you back to

2:23

talk about, can you just give a

2:25

refresher on how you first get started

2:27

working in data? So as

2:29

I mentioned, InfluxDB is a time series database.

2:31

Now how I got interested in this topic,

2:34

I mean, generally, like when I was in

2:36

school, I was interested in information retrieval, database

2:38

systems, that kind of stuff. But

2:41

in 2010, I was working

2:44

at a FinTech startup here in New York

2:46

City, and we had to

2:48

build a solution for working with a

2:50

lot of time series data. Later, when

2:52

I started this company, initially we were

2:54

building a product for doing server monitoring

2:56

and real-time application metrics and that kind

2:58

of thing. And to build a

3:00

backend for that, I had to build a solution

3:02

that was very similar to the

3:04

backend I had built for the FinTech company. So

3:07

I saw two different use cases.

3:09

One was in financial market data, and

3:11

the other in server monitoring and application

3:13

performance monitoring data. But the

3:16

backend solution for both was basically the

3:18

same thing. And at that point, I

3:20

realized building a database that could work

3:22

with time series data at scale and

3:24

make it easy for the user, was

3:26

a more interesting problem to solve. So

3:29

we pivoted the company to

3:32

focus on that, became InfluxDB,

3:34

and we've been building for that ever since.

3:37

So initially we had version 1.0,

3:40

the initial announcement of InfluxDB was in the

3:42

fall of 2013. We

3:45

released version 1.0 of InfluxDB in September

3:47

of 2016. We

3:49

released 2.0 in basically late 2019, early

3:51

2020. And

3:54

then just this last year, we released

3:56

version 3.0 of the database,

3:59

which is the... the significant

4:01

rewrite that you were hinting at

4:03

that basically caused us to adopt

4:05

all these new technologies and start

4:07

investing heavily in the Apache Aero

4:09

ecosystem. Now, bringing

4:11

us through to this part of the

4:13

conversation, I made

4:16

a little bit of a play on

4:18

the acronym with the introduction, but the

4:20

different letters of it are F-D-A-P, and

4:22

I'm wondering if you could just start

4:25

by describing the overall context of that

4:27

stack, what the different

4:29

components are and how they combine to

4:31

provide a foundational architecture for database engines.

4:35

Yeah, so the FDAF

4:37

stack is an acronym for the

4:39

different pieces. F stands

4:41

for flight, which is Apache Aero

4:43

Flight or Apache Aero Flight SQL.

4:47

A is actually Apache Aero, which

4:49

is essentially the foundational project under

4:51

which all these components reside, so

4:54

Aero is like the umbrella project

4:56

for everything. So

4:58

Apache Aero is an

5:01

in-memory columnar specification, so basically it's

5:03

a format for in-memory columnar data

5:05

so that you can do quick analytics on it.

5:08

D, which is Data Fusion,

5:10

which is a SQL processor,

5:13

it's a query parser, planner,

5:15

optimizer, and execution engine for SQL.

5:18

Specifically, it also follows the

5:20

Postgres dialect of SQL and

5:23

parquet, which is a file

5:26

format for persisting columnar data, but

5:28

also structured data, so you can

5:30

have nested structures. It's

5:32

essentially an open source implementation of

5:36

the Google Dremel research paper that came

5:38

out in the early offs. I'm

5:41

wondering if you can talk to

5:43

the design goals and constraints that

5:45

you were focused on in the

5:48

re-implementation of InfluxDB and how

5:50

that led you to the selection

5:52

of this composition of tools to

5:54

execute on that vision. Yeah,

5:57

so for InfluxDB 3.0, As

6:00

I mentioned, we basically did a

6:02

ground up rewrite of the database, which generally speaking

6:05

is not something you'd ever want to do. But

6:08

there are a number of problems we wanted to solve

6:10

for. So first

6:12

is this idea of infinite

6:14

cardinality, right? Within time series

6:17

databases, generally there's this idea

6:19

of the cardinality problem where

6:22

cardinality comes in dimensions that

6:24

you describe your data on,

6:27

right? So these could be like a

6:29

server name or a region or a

6:32

sensor ID, but you can also have

6:34

other dimensions like what user made this

6:36

request or what security token made the

6:38

request. And really when you think about it, the

6:41

dimensional data is basically just data that describes

6:44

different observations that you're

6:46

making. So when

6:48

people want infinite cardinality, they basically just want

6:50

to be able to say they want to

6:53

capture as much precision and information about these

6:55

observations that they're making. Traditional

6:58

time series databases like InfluxDB versions

7:00

one and two and others have

7:02

a problem essentially when this cardinality

7:05

gets super, super high. And

7:07

we had a bunch of, you know,

7:09

customers and users who were saying they wanted to

7:11

record this and use it for it, but

7:14

we didn't have a solution. It was basically

7:16

like a fundamental limitation of the architecture of

7:18

the database. So how do we

7:20

achieve infinite cardinality? How do

7:22

we achieve cheaper storage? Right. People

7:25

wanted to decouple the query processing and

7:27

the ingestion processing and indexing from the

7:29

actual storage of the data. And

7:31

they wanted to be able to ship historical data

7:33

off to cheaper object storage that could

7:35

be backed by spinning disk while

7:38

still making it so that queries against

7:40

recent data are super fast. Right. So

7:43

again, you're talking about a very fundamental

7:45

shift in the architecture of the database

7:47

to be able to enable, you know,

7:49

keeping everything in object storage while processing

7:52

recent data and memory and

7:54

all this other stuff. So is that. And

7:57

then the other big piece is essentially like

7:59

we wanted broader ecosystem compatibility.

8:01

In versions, InfluxDB versions

8:03

one and two have

8:06

their own query languages, their own data

8:08

formats. We wanted to

8:10

be able to integrate with a much broader

8:12

set of third-party tools. So specifically

8:14

we wanted to support SQL as

8:17

a query language in addition to

8:19

InfluxQL or older query

8:21

language. We wanted persistent

8:24

formats that could be read and used

8:27

in tools outside of InfluxDB.

8:31

And we wanted all of this essentially to be

8:33

super performance. And basically when we looked at this,

8:35

we're like, OK, there are fundamental

8:37

architecture changes of the database, which means we're essentially

8:39

going to have to rewrite most of it. And

8:42

this was at the beginning of 2020. And

8:45

at that time, I thought,

8:47

well, one, older versions of

8:49

InfluxDB are written in Go. That's kind of an

8:51

artifact of when we created the project back in

8:53

2013. Go

8:55

was starting to become

8:57

hot then. The Go 1.0 release was in

9:00

March of 2012. But

9:02

in 2020, the beginning of 2020, I

9:05

was very interested in Rust. And I

9:07

felt that Rust as a programming language

9:10

would be essentially the best

9:12

way to implement this kind of high-performance

9:14

server-side software. And

9:17

I also thought that we could

9:19

bring in other open source tools

9:21

and libraries that would help us

9:23

get there faster. Specifically, we didn't

9:25

want to create our own SQL

9:28

execution engine from scratch. That's a very,

9:30

very big investment. And there are other

9:32

systems out there that can do it.

9:35

And initially, we thought that we might be

9:37

pulling in something that was written in it,

9:39

either C or C++, which meant

9:41

bringing that code into a Rust

9:43

project is actually fairly straightforward. And

9:45

you have zero cost abstractions and

9:47

basically a very clean way to

9:49

integrate it. When

9:52

we started looking around, we saw that there

9:54

were actually some Rust projects that were super

9:56

interesting that would enable us to do this.

9:58

So one, persistent. format, right,

10:01

we wanted a format that

10:03

was more broadly addressable, right,

10:05

from other tools. And in

10:07

2020, the most obvious choice,

10:09

at least to us was per K, it

10:12

was still like per K came out, I

10:14

think in like 2016. So it

10:17

was beyond like early, early adapter

10:19

phase, it was getting more usage,

10:22

starting to get more usage in like other

10:24

big data processing systems, data warehouses. And we

10:27

felt that if we use that as the

10:29

persistence format, we'd one, get

10:32

the amount of compression we needed for our

10:34

data to make it like, you know, compact

10:36

at scale. But the other

10:38

is like make it so we could share it with

10:40

other third party systems. So that was

10:42

kind of an obvious choice. Then we knew

10:44

like, we need a fact analytics on

10:47

the data, right? So

10:49

that's when we started looking at arrow

10:51

as the like in memory calmer data

10:53

structure, right? One of the things I

10:55

mentioned is, you know, this need for

10:57

supporting high cardinality data. But

11:00

then the other need is essentially like doing

11:02

analytics style queries on time series data so

11:04

that you can do analysis, versions

11:06

one and two of influx DB, those kind

11:09

of analytics queries were like slow because of

11:11

the way the system was architected under the

11:14

hood. And we thought if we're

11:16

going to be able to do fast analytical queries

11:18

on time series data, it's

11:21

going to have to be in this calmer

11:23

format. So we kind of adopted arrows in

11:25

memory format for this data, which

11:27

then led to, you know, these other

11:29

pieces. And then in early 2020, we

11:31

looked at a number

11:34

of different query engines we could

11:36

potentially use. We looked

11:38

at ducty be which was still very

11:40

nascent at that time, we looked at click

11:42

houses engine, which again was nascent

11:44

compared to where it is now. And we also

11:46

looked at data fusion. And at the

11:49

end of the day, we decided that data

11:51

fusion would be our choice because you know,

11:55

it was written in Rust. And the thing is

11:57

like all three of those projects that we evaluated,

11:59

we realized there was going to be a lot of work that

12:01

we would have to do to be able

12:04

to support the time series use cases that

12:06

we were aiming for. And

12:08

we felt that if we're going to

12:10

have to do a lot of work and end up

12:12

contributing heavily to this query engine, we might

12:15

as well do it in a language that we

12:17

intend to use, which is Rust, right? DuckDB and

12:19

Clickhouse are both implemented in C++. And

12:22

we also felt that Data Fusion being

12:24

part of the Apache Foundation and being

12:26

part of the Arrow project, we're making

12:28

a bet that it would essentially start

12:31

to gather momentum and pick up steam and

12:33

there'd be other people who would contribute to

12:35

it over time. And over

12:37

the last three and a half

12:39

years that we've been heavily developing with

12:41

it and contributing to it, we've certainly

12:43

found that to be the cases. More

12:46

people have been adopting Parquet, more people

12:48

have been adopting Arrow, they've been contributing

12:50

to those two and Data Fusion.

12:53

And Flight and Flight SQL are

12:55

also becoming kind of a standard

12:58

RPC mechanism, essentially for exchanging

13:02

analytic data sets or millions of rows

13:04

quickly in a high performance way. And

13:08

each of those pieces of the

13:10

stack are definitely well engineered. They've

13:13

been gaining a lot of momentum.

13:15

There's been a lot of investment

13:17

in that overall ecosystem, but they

13:19

are all, I guess they're not

13:22

as narrowly scoped in particular Arrow as when

13:24

they first started, but they are all focused

13:26

on a particular portion

13:28

of the problem. And

13:30

in order to build them into a

13:32

cohesive experience, I'm curious, what was the

13:35

engineering effort that's necessary to actually build

13:38

a fully executable database

13:40

engine and platform experience on

13:42

top of those disparate parts?

13:46

Yeah, I mean, it's certainly true that when

13:49

Arrow first started, it essentially was like an

13:51

in-memory specification. And the dream there was

13:53

essentially that you have

13:55

data scientists who are trying

13:57

to do analysis in either Python or R. Right

14:00

and the thing is they almost always have to

14:02

get their data from one place and bring it

14:05

in and Exchange it to another thing. So the

14:07

vision there was essentially how do you do? Data

14:10

interchange between these different data science

14:12

tools and systems that is zero

14:14

copy zero cost

14:16

serialization deserialization writes super super fast

14:19

and Wes and

14:21

his team started with that and

14:23

then they felt saw like okay Wait a second

14:25

now people also have these needs to like persist

14:27

the data So we need a persistent

14:30

format. He brought in parquet because he also

14:32

helped to find parquet when it was first

14:34

created But that became

14:36

an obvious add-on and then you

14:38

know the RPC mechanism. They're like, okay that well

14:40

now you have servers that are running Things you

14:43

need a way to exchange the data again an

14:45

obvious add-on and data fusion

14:47

again Like you need if

14:49

you're working with this data like in Python You

14:51

have like pandas and are you have like these,

14:53

you know different things you've like either data frames

14:56

libraries or whatever But a lot of time people

14:58

just want to execute a SQL query and you

15:01

need an execution engine That

15:03

can work with this arrow format

15:05

natively. That's going to be super fast, right?

15:07

Anything that's fast in Python Isn't actually written

15:09

in Python. It's written in C C++ and

15:13

then wrapped so That's

15:16

what they realized from the data science perspective

15:18

now from the perspective of people creating a

15:20

data platform Like an entire data

15:22

platform or a database server or something like that

15:26

The thing that's tricky about it is a lot

15:29

of these formats are actually they're designed for

15:31

Exchanging like a set chunk of data, right?

15:34

Like parquet is an immutable format, right? It's

15:36

not meant to be updated you write a

15:38

parquet file and that's that Arrow

15:41

again, like you don't append to arrow buffers

15:44

on the fly like you create an arrow buffer It's well

15:46

defined and then you can hand it off. So Having

15:50

a system that's basically able to ingest

15:52

data Live right like

15:54

individual rights individual rows that you're

15:56

writing in and being able to

15:58

combine that with this historic data

16:00

set that's represented either as arrow

16:02

buffers in memory or parquet files

16:05

on disk, right? Moving all that

16:07

data around, that becomes the really

16:09

like the trickiest part of creating

16:11

like a larger scale data

16:13

platform. It's like, how do you move that

16:15

data around? How do you combine the real

16:18

time data with the historical data? And how

16:20

do you make that all fast? And

16:22

how do you make it easy to use? All

16:25

of that work is basically a non-trivial

16:27

amount of effort, but it's

16:30

certainly made easier by the fact that

16:32

you no longer have to create the

16:34

lower level primitives, right, to

16:36

build that data platform. You don't have to create

16:38

the query engine. You don't have to

16:40

create the file format, right? Those things

16:43

basically just exist. And

16:45

there, you know, I have

16:47

heard Wes refer to it as basically the composable

16:49

data stack, right? Which is you can

16:52

kind of pick and choose these pieces that you

16:54

want to work with, right? You can use

16:56

the Data Fusion query engine, but

16:59

not use parquet at all. And,

17:01

you know, not use flake if you don't want

17:03

to. It uses arrow under the hood, so that

17:05

kind of like comes along for the ride. But

17:08

yeah, like all of these different pieces are kind of like,

17:10

you know, they're designed to

17:13

be modular so that you can pick a

17:15

different persistence format if you want that. You

17:17

can pick a different execution engine, right? Within

17:20

the arrow ecosystem, one of the

17:22

things that Voltron data,

17:24

the company that Wes ended up starting

17:27

with some other people that backs a

17:29

lot of the arrow stuff as well,

17:31

one of the things they

17:33

created was this project called, I don't know how

17:35

to pronounce it, Velox, basically, V-E-L-O-X,

17:38

which is basically like this

17:40

execution engine that was created

17:42

in conjunction with some work

17:44

at Facebook to do stuff,

17:47

right? So the idea is you can pick and

17:49

choose these components and kind of tie them all

17:51

together into a larger,

17:54

like, operational system where you're

17:56

essentially solving problems around data

17:58

warehousing, real-time

18:00

analytics and essentially just

18:02

like working with what I

18:04

would say observational data at scale,

18:07

right? Where observational data could

18:09

be data from your

18:11

servers, applications, sensors, logs, whatever

18:13

it is. Are

18:18

you sick and tired of sales at data conferences?

18:20

You know, the ones run by large tech companies

18:23

and cloud vendors? Well, so am I

18:25

and that's why I started Data Council,

18:27

the best vendor neutral, no BS data

18:30

conference around. I'm Pete

18:32

Soderling and I'd like to personally invite you to

18:34

Austin this March 26 to 28th

18:37

where I'll play host to hundreds of attendees, 100 plus

18:40

top speakers and dozens of hot startups

18:42

on the cutting edge of data science,

18:44

engineering and AI. The

18:46

community that attends Data Council are some

18:48

of the smartest founders, data scientists, lead

18:50

engineers, CTOs, heads of data, investors and

18:53

community organizers who are all working together

18:55

to build the future of data and

18:57

AI. And as a

18:59

listener to the Data Engineering podcast, you can

19:01

join us. Get a special

19:04

discount off tickets by using the

19:06

promo code DEPOD20. That's D-E-P-O-D-2-0. I

19:11

guarantee that you'll be inspired by the folks at the

19:13

event and I can't wait to see you there. Another

19:18

interesting element of

19:20

building your platform on

19:22

top of all these open source components

19:25

is that by virtue

19:27

of it being a layered stack,

19:29

you can have additional integrations that

19:31

can come in at each of

19:33

those different layers rather than having

19:35

the main interface be the only

19:37

way of accessing the data that

19:40

it contains. It

19:42

also gives you the benefit of being

19:45

able to capitalize on the overall ecosystem

19:47

of investment and the network effects that

19:49

you get from those different open source

19:52

projects. So I'm wondering if

19:54

you can comment on some of the ways

19:56

that you've seen that benefit materialize in your

19:58

work of building this data. a platform

20:00

on top of these different components? Yeah,

20:03

so this is actually like one of

20:05

the things I'm most excited about for

20:08

these different pieces and for the work

20:10

we're doing, which is I think

20:14

we actually need to add another letter

20:16

to the acronym, the FDAP acronym, and

20:18

maybe like jumble them up. But

20:20

basically the other letter is I

20:22

for Apache Iceberg. So

20:25

Iceberg is essentially a catalog standard

20:27

for creating a data catalog of

20:30

essentially parquet files in object storage, right?

20:34

And we're basically building first class support for that

20:37

in impluxDB 3.0 where all of the data

20:40

that's ingested into an impluxDB

20:42

3.0 server can be exposed

20:44

essentially as Iceberg catalogs, which

20:47

is awesome because that's

20:49

a standard that was originally developed at

20:51

Netflix and that was open sourced out

20:53

into the Apache Foundation. And

20:55

it's quickly being adopted by

20:57

other companies, right? So Snowflake

21:00

just added support for Iceberg

21:02

as a format. Even

21:04

Databricks is adding support for it,

21:06

even though they have a competing

21:08

standard called Delta Lake. And

21:11

a lot within Amazon, the

21:13

Amazon Web Services, for example, they're

21:16

adding first class support for Iceberg

21:18

so that if you have

21:20

data that's exposed as an Iceberg catalog in

21:22

F3, you can then

21:25

query that data using any

21:28

of the Amazon query services

21:30

like Athena or Redshift or

21:32

all these different pieces. So

21:35

that I think is a really

21:38

interesting integration because it makes it

21:40

so that you can access this

21:42

data in bulk, right? So if you want

21:45

to need to train a machine learning model

21:47

or whatever, or query against this

21:49

data for doing large scale analytical

21:51

queries and be totally outside with

21:54

for impluxDB 3.0, for example, totally outside

21:56

the operational envelope of the system that's

21:58

managing all this real-time

22:00

data movement, being able to query in real-time,

22:04

you can basically do all these analytics

22:06

tasks completely disconnected from that. And

22:09

again, you could use

22:11

Data Fusion for that, but

22:13

you could also use Athena, right? Which

22:15

is based on a Java query engine

22:18

called Trino or

22:21

Presto or whatever it is now. Or

22:24

you could use DuckDB or Clickhouse or

22:26

any one of these other systems to

22:29

do your query processing and analytics against

22:31

that data. So that

22:34

integration I think is super interesting. The other

22:36

one that I think is interesting is within

22:40

the Arrow project. So

22:42

they have FlightSQL is

22:44

basically like an RPC mechanism for

22:46

essentially sending SQL queries to a

22:48

server and getting back millions of

22:50

rows really, really quickly. And they

22:52

have basically a new standard that

22:54

they've created that's kind of like

22:56

competing with ODBC. ODBC

22:59

is obviously the database connection standard.

23:01

It was for essentially transactional

23:04

databases and relational databases. The

23:07

Arrow one, once that

23:09

becomes a thing, I think it will be

23:11

a really like a standard way to

23:14

connect to analytical data

23:16

stores of any kind, whether it's

23:18

data warehouses or real-time data systems

23:20

or whatever. And I think those,

23:24

like having those things

23:26

be standards and have them contributed to

23:28

by many different companies, not just supported

23:30

by a single vendor, I

23:33

think will make it

23:35

the pace of innovation in this

23:37

space for these large

23:39

scale data use

23:41

cases which are only

23:43

gonna continue to increase and multiply. I

23:45

think it makes it so that we can

23:48

have basically many more

23:50

tools that can integrate with

23:52

each other. If

23:54

you look at data warehousing for the last 20

23:57

years, it's

23:59

long-term. Largely been like your data

24:02

warehouses are basically kind of

24:04

like data roach motels. Like

24:07

your data goes in and you have

24:09

to get all the data in the data warehouse, but then

24:11

if you want to do anything with it, you have to

24:13

send the query to the data warehouse and like all this

24:15

other stuff, right? And there's just not, there's

24:17

not this really good integration, like the data

24:19

warehouse just becomes this one place. So

24:22

being able to access it from a bunch

24:24

of different tools without having

24:27

one piece of software be the arbiter of the

24:29

entire thing, I think is really interesting. Absolutely.

24:32

And to your point

24:34

of Flight SQL being

24:36

a new RPC mechanism to

24:38

unlock a lot of potential and reduce

24:40

a lot of the pains, it just

24:43

makes me sad that I obtained all

24:45

of that scar tissue around ODBC for

24:47

nothing. I

24:49

mean, I think ODBC is going to be around

24:51

for a very long time. I don't think it's

24:54

going away. Yeah, absolutely.

24:57

And the counterpoint to

24:59

the benefits that you get building on

25:01

top of open source is that particularly

25:03

when you have a business that is

25:06

being powered by these components, you

25:08

adopt some measure of platform risk

25:10

because you're not the only person

25:13

who has a vision for the

25:15

future direction of these technologies. And

25:18

some of that future direction may or

25:20

may not be compatible with the vision

25:22

that you have for it. And I'm

25:24

curious how you think about that platform

25:26

risk and the mitigating factors that you

25:28

have in the engineering that you're doing

25:31

to account for any potential future shift

25:33

in the kind of vision and direction

25:35

of those products. Yeah,

25:38

I mean, you can wrap the

25:40

libraries with your own abstractions, but

25:42

the problem is that comes with

25:44

a high price, a high cost.

25:47

And the truth is even if you wrap it with your own

25:49

abstractions, if the libraries end up changing

25:51

significantly and you're like, okay, we need to replace it

25:54

with something else, it's gonna

25:56

be like a non-trivial task. The

25:58

best insurance. is

26:00

essentially to have enough people contributing to the

26:02

core of the thing to be

26:04

able to have some level of

26:07

influence on the direction of the project. Ultimately,

26:10

there's gonna be platform risk,

26:13

but I think, take

26:16

it from the other side, which is we

26:18

decide to develop all this stuff ourselves and keep

26:21

it close source and just whatever. Well,

26:24

the risk there is like, I mean,

26:26

that's just an absolute mountain of work

26:28

to do. And

26:30

I think it's like, as

26:34

these projects have matured, like I said, we've

26:36

seen other people contributing to them. So now

26:39

we regularly get performance improvements in

26:41

the query engine or new functions

26:43

in the query language. And all

26:45

of this stuff, we help manage

26:47

the project, we have people contributing

26:49

to, we make significant

26:52

investments into the open source pieces. But

26:55

those are things that we kind of get for

26:58

free as a

27:00

result, essentially, it means

27:02

that the risk we have if we kept

27:04

it all closed source is that our pace of

27:06

development would be outpaced, outmatched

27:09

by the set of people contributing

27:11

to this open thing, right?

27:13

We may be able to

27:15

get somewhere initially,

27:18

but eventually, the open source

27:20

people are gonna like outpace a

27:22

small team of proprietary developers. Now,

27:24

if you have unlimited resources, and

27:27

you can basically just like, you

27:29

know, create, you

27:32

know, a long lived team of people that you're able to fund

27:34

forever, then the situation changes.

27:37

But I think for startups in

27:40

the technology space, like, their

27:42

best bet is to adopt platform

27:44

pieces that are not, that, you know, that

27:47

you can contribute to, that can form the

27:49

basis of the things you're building, right, like,

27:51

and this is, you know, you

27:53

don't create your own operating system, right? You

27:55

use Linux, and you don't create your own

27:57

programming language, you use whatever language you're gonna

27:59

use. we use there. And I think all

28:02

that stuff happens, it happens

28:05

higher and higher. All these pieces

28:07

kind of like build on each other. In

28:09

this case, like when we're talking about the

28:11

FTAP stack and all these different components, they're

28:14

essentially the toolkit that you would

28:16

use to build a database, an

28:18

analytical database or a data warehouse,

28:20

right? So why create

28:22

those things from scratch, right? Your ultimate

28:24

goal is not really to create a

28:27

data warehouse, it's to deliver value for

28:29

your customers who are actually paying for

28:31

the solution. And they don't really care about

28:33

a data warehouse per se, they care

28:35

about solving their data problem for their

28:37

customers. So as

28:39

much as you can adopt to say

28:41

like, okay, this isn't gonna be

28:43

our thing that we innovate on. This is gonna be that, that's

28:46

not how we actually add value to this

28:48

market, to this thing that we're selling. This

28:51

is basically just like a barrier to entry.

28:54

And if you can adopt an open source

28:56

thing that like reduces the barrier, then right.

28:59

Absolutely, and by virtue of being

29:01

involved with and participating in the

29:03

open source projects that you're relying

29:05

on, you also get the benefit

29:07

of early warning of knowing that,

29:10

okay, this is the future direction that

29:12

the community would like to see. And

29:15

so now I can proactively plan for

29:17

those shifts in the underlying technology so

29:19

that I can accommodate them in the

29:21

end result that I am building on

29:24

top of it. Yeah,

29:26

well, and ultimately like the absolute

29:28

worst case scenario, right, is like

29:30

the community is gonna make some

29:32

weirdo changes. They're just completely incompatible

29:34

with what we need to do.

29:38

Great, then we can just fork the

29:40

project from whatever that last point was.

29:43

It's permissively licensed open source. We can fork

29:45

a project and then we have two options.

29:47

Do we make our fork closed

29:50

source? Or do we

29:52

make our fork something publicly available and you

29:54

just continue on from there, right? And

29:57

at that point you haven't adopted any. more

30:00

risk than you would have

30:02

had anyways, you know, your close source

30:04

thing. Although I will say, like, like

30:07

I mentioned, we, we spend a

30:10

lot of time contributing to these

30:12

community projects. So there's a

30:14

there's a good amount of effort that we

30:17

put forward that essentially doesn't benefit us directly.

30:19

Right? It's not that we're doing

30:21

this community thing or managing these like

30:23

efforts, a bit different people contributing or

30:26

whatever, because it's something we need specifically

30:28

for our products. But

30:30

again, the bet is that, you know, like,

30:34

okay, there are a bunch of things we'll do, they're

30:36

not direct benefit to us, but there are other things

30:38

coming in from the community that are so it

30:41

all kind of like, evens out.

30:43

And actually, in our, you know, in my

30:45

experience, it doesn't even out like we get

30:47

far more out of it that we give

30:49

when then we put in even though we

30:51

like, like I said, we try to

30:53

put in as much as we possibly can. It's

30:56

just that when you have, you

30:58

know, dozens of developers from around the

31:00

world and different companies contributing

31:02

to this thing, like, the

31:05

sum is going to be greater than when

31:07

what any one individual or one company produces

31:09

and puts into it. And

31:11

so looking at the

31:14

component pieces of this stack and

31:16

the overall architecture

31:18

and system requirements for a

31:21

database engine, what are the

31:24

additional pieces that you had to

31:26

build custom? What is the work

31:28

involved in building a polished user

31:31

experience on top of these different

31:33

components? And some of the

31:35

ways that you're thinking about what are the

31:37

appropriate abstraction layers? Or what are the appropriate

31:39

system boundaries for what these four

31:42

pieces of the stack do and the eventual inclusion

31:44

of iceberg? And what is the

31:46

responsibility of influx as

31:49

the database experience that needs to be built

31:51

on top of it? Yeah, so

31:53

I mean, basically, like, these components are

31:55

really just libraries, right? They're just programming

31:57

libraries that we use. So they're not

32:00

actually a piece of running software that will

32:02

do anything on its own. I mean

32:05

Data Fusion does have like a command

32:07

line tool where you can say like point

32:09

it at you know a file and execute

32:11

a query against it if it's CSV or

32:13

JSON or parquet right. But beyond

32:15

that it's not like a process

32:18

that will run on a server that will respond to

32:20

requests and all this other stuff. So you kind of

32:22

have to build all that scaffolding

32:24

around it right. You have to build a

32:26

server process and you have to just decide

32:28

what your API is going to be right.

32:31

For writing data in most people are not

32:33

going to want to write you

32:36

know arrow record batches or

32:38

parquet files in because those

32:40

two formats actually aren't super

32:43

easy to create yourself. Like

32:45

usually when people create those

32:47

formats they do it as a

32:49

transform from some other data

32:51

that's easier to work with like

32:53

CSV or JSON or whatever right.

32:55

So you have to decide

32:57

like how do you write data in what's that format

32:59

how do you translate it to arrow or

33:02

parquet. You need to

33:04

decide like for the query interface like

33:06

SQL the language but then how are

33:08

they going to make the request right.

33:10

It's going to be HTTP, JPC, whatever

33:13

and then what is the response format going to be. Do

33:16

you want to give them arrow? Do you want to give

33:18

them parquet? Do you want to give them CSV, JSON, something

33:21

else right. So all

33:24

those pieces you kind of have to decide on

33:26

and create right.

33:28

Basically the entire like piece

33:30

of server software and then there's you

33:33

know all the operational pieces which is if you

33:36

have to run this in

33:38

a Kubernetes cluster if you have to run this in the

33:40

cloud or whatever and also for

33:43

us for InplexDB 3

33:46

you know we have currently what we have

33:48

is a distributed version of the database where

33:50

we've it's comprised

33:52

of a number of different services that

33:55

run inside a Kubernetes cluster right and

33:57

we've separated out the ingestion tier from

33:59

the query tier from compaction

34:02

from a catalog that runs,

34:04

right? So

34:07

we basically had to create services

34:09

for each of those and APIs for how they

34:11

interact with each other. And then a bunch

34:13

of like tooling and stuff like that

34:15

to actually monitor, you know, spin this up on

34:17

the fly and monitor it, run it, all

34:20

that separate stuff. So I mean, it's still

34:22

like, if you're going to adopt

34:24

these components to build, you

34:26

know, a data system, there's still a

34:28

lot of work to do, but yeah. For

34:34

people who are interested

34:36

in building some database engine, or they

34:38

are interested in the functionality of any

34:40

of these different pieces, I'm curious what

34:43

you see as some of the other types

34:47

of projects that would benefit from

34:49

the capabilities of any or all

34:51

of those pieces of the stack,

34:54

and maybe some of the other elements

34:56

that could be built up and added

34:58

to that ecosystem to maybe reduce the

35:00

barrier to entry that you've had to

35:02

pay. Yeah,

35:04

I mean, so what

35:06

it seems like a

35:09

bunch of different kinds of projects are starting

35:11

to adopt and companies are starting to

35:13

adopt these pieces of the stack. So, you

35:16

know, I just saw one yesterday,

35:19

there was basically like a new

35:21

stream processing engine that essentially is

35:23

using Data Fusion and thus also

35:25

Arrow as the way to

35:28

do, you know, processing within the

35:30

stream processing engine, right? So

35:32

you can execute SQL queries against like data

35:34

coming in a stream, whatever. So there's that,

35:37

there are different kinds of

35:39

database systems, either time series

35:41

database or document database or

35:43

data warehouse or whatever. Like

35:45

I've seen a number of projects

35:49

in either open source or in companies that

35:51

are starting that to use those components. There's

35:55

another project right now where

35:57

contributors from Apple are basically.

36:00

putting in essentially a Spark execution

36:02

engine, which

36:05

is based on Data Fusion. Essentially

36:08

this is a replacement for the

36:10

open source Java Spark implementation that's

36:13

supposed to be faster and stuff like that. So basically

36:15

you see like one component within

36:17

Spark is being replaced with Data

36:20

Fusion as part of this. And

36:22

actually the creator of Data Fusion,

36:24

Andy Grove, was originally doing creating

36:27

Data Fusion for that use case inside

36:29

of NVIDIA. So

36:32

you see like all these different companies

36:34

like creating those different pieces. I

36:37

think it's still early in

36:39

for the Rust ecosystem of tools

36:43

to see what's gonna happen, like what open source

36:45

projects are going to become kind of big, right?

36:48

Right now when you think of like big

36:50

data processing tools, most of that

36:52

environment is in Java, right? It started

36:54

with Hadoop and then continued

36:57

with Spark and like all

36:59

the different components there and right and Kafka's

37:01

written in Java and Flink's written in Java,

37:03

right? So you have different stream processing systems

37:05

and all these things kind of integrate together.

37:08

What I anticipate is that over

37:11

the next 10 years, you see

37:13

a lot of those systems rewritten,

37:15

recreated, using Rust and

37:18

using Data Fusion and Arrow and

37:20

Parquet as the underlying

37:22

primitives. And ideally they wouldn't

37:24

just recreate the exact same thing, you

37:26

know, but instead of Java, it's at

37:28

Rust. There will certainly be some of that, but

37:31

ideally what they will do is they will take,

37:34

you know, a lot of lessons learned from those

37:36

previous versions of those pieces of software to

37:38

the like, okay, how can we make the

37:41

user experience better, right? So it's easier to

37:43

express the kind of things we wanna express

37:45

or how do we make operations better? So

37:47

it's easier to like operate these systems at

37:49

scale. So I think

37:51

it's really early yet though. It's

37:53

not clear to me like from

37:56

an open source perspective, what projects are gonna be

37:58

the winners here. that eventually

38:02

supersede the previous Java

38:05

systems. Absolutely. And

38:07

I've definitely been seeing a little bit of that

38:09

as well, even three to five years

38:12

ago of C++ being the

38:15

implementation target, particularly built around

38:17

the C-star framework for being

38:19

able to take advantage of

38:21

multi-CPU architectures, most notably

38:24

the CillaDB project as a

38:26

target to re-implement Cassandra and

38:28

then Red Panda taking on

38:31

the Kafka ecosystem. And

38:34

another interesting aspect of this

38:37

space is Arrow as the

38:39

focal point of that data

38:41

interchange has been gaining a

38:43

lot of ground. It started off as a

38:46

very nascent project. There's been a lot of

38:48

effort put into making that more of the

38:51

first target rather than

38:53

being a second consideration. And

38:56

it's been working on integrating with the

38:58

majority of the components of the data

39:00

ecosystem. I'm wondering what you

39:02

see as some of the remaining gaps

39:05

in coverage or some of the white

39:08

spaces in the overall Arrow ecosystem that

39:10

are either immature or completely absent

39:13

and spaces that you would like

39:15

to see the overarching data community

39:17

invest in building out

39:19

more capabilities and capacity. So

39:23

I think there's still probably some work

39:25

to be done within Arrow as a

39:27

specification itself for representing data

39:30

in a more compact form. For

39:34

some kinds of like columnar data, it's just not

39:36

as efficient as I think it could be. But

39:39

originally, I think that was a

39:42

result of one of the design goals, which

39:44

was essentially O of one lookup for any

39:46

individual element within the set.

39:49

I think if that constraint is loosened,

39:51

it opens up the possibility for other

39:53

kinds of compression techniques

39:55

and stuff like that that will make it

39:58

a better format for compressed

40:01

data in memory, which I think is

40:03

something that would be potentially interesting. I

40:07

think there's still a

40:09

question of like, okay, if

40:11

we're going to have a

40:13

stream processing system that uses

40:16

these tools, what does that look like?

40:18

Because Arrow as a format

40:20

actually is not well suited

40:23

for stream processing, right? Because it's a

40:25

columnar format, so the

40:28

conceit there is that you are sending

40:30

in many, many rows at the same

40:32

time, whereas when you think of stream

40:34

processing, you think of either

40:36

micro batching or individual rows like

40:38

one by one, right? So there's

40:41

no good translation layer between, okay,

40:43

if you're moving, if you care

40:45

about doing stream processing and you

40:47

want to move to Arrow or

40:49

batch processing, larger scale data

40:51

processing, how do you make that transition? And

40:55

what do the tools look like

40:57

for that? I think that's still very difficult,

40:59

right? And it's certainly like something

41:01

we've done in Inflock2b3, which is like

41:04

translating to Lime protocol,

41:06

individual rows being written in to

41:08

the Arrow stuff. I

41:10

think the

41:12

distributed query processing is something

41:15

that is probably going to

41:17

get more work. It's

41:20

definitely something that needs more work within the Data

41:22

Fusion piece itself. I

41:25

think later this year, I think in

41:27

a couple of months, hopefully they're going to

41:29

vote on whether Data Fusion becomes its own

41:32

top level Apache project outside of Arrow. My

41:35

best guess is that's going to happen. And

41:37

then what we'll probably see is like

41:39

Data Fusion will then have some sub

41:41

projects, one of which I think will

41:44

be around distributed query processing, which I

41:46

think will be important for

41:48

it really to become a contender

41:50

and a competitor in the larger

41:52

scale data warehousing space. What

41:57

else? I

42:00

don't know, like Parquet has gotten some

42:02

interesting improvements along the way. I think,

42:04

I don't know, there was like Geo

42:06

Parquet for representing geospatial data. I think

42:08

that's going to be super important. So

42:11

yeah. This might

42:13

be a little bit too far afield or

42:16

too deep in the weeds, but there was

42:18

also for a little while a bit of

42:20

contest between Parquet and ORC as the preferred

42:22

columnar serialization format. I'm wondering if you have

42:25

seen that the dust settle around that and

42:27

there has been a general consensus around one

42:29

or the other, or if those are still

42:31

kind of a case by case basis, do

42:34

what you think is right for a different

42:36

use cases. I

42:39

may just be biased because I'm looking

42:42

for Parquet, but I don't, I remember

42:44

that being a thing and I remember

42:46

looking at formats from a high level

42:49

back in the day, but I don't really

42:52

see ORC as a format

42:55

coming up nearly as much. It

42:58

seems to me that Parquet

43:00

is kind of one,

43:03

the mind share largely, and that's what

43:05

people kind of coalesced around. Now of

43:07

course, because we're talking

43:09

about data at scale, there are probably

43:12

like mountains of data in people's data

43:14

lakes and data warehouses that is represented

43:16

as ORC. So that's not going

43:18

to go away, but by and

43:20

large, what I see is that Parquet

43:23

seems to be the standard format

43:25

that all the big data vendors are coalescing

43:28

around. I've been

43:30

seeing a similar thing. And

43:33

then to the point of

43:35

streaming and record-based digestive data

43:37

versus the columnar approach for

43:39

Parquet and Arrow, I know

43:42

that Avro and Parquet have

43:44

a defined kind of translation

43:46

method of being able to

43:48

compact multiple Avro records into

43:50

a Parquet file. And

43:53

I'm curious if you're seeing anything

43:56

analogous for the Arrow ecosystem of

43:58

being able to maybe manage

44:00

that translation of multiple AVRO

44:02

records batched into an ARO

44:04

buffer that can then subsequently

44:06

be persisted into parquet or

44:09

using that AVRO to parquet translation as

44:11

the intermediary to then get loaded into

44:13

an ARO buffer? I

44:16

mean, I haven't really seen that. I mean, there's because,

44:18

I mean, it's pretty easy

44:20

to go from ARO to parquet or parquet to

44:22

ARO, right? Because there are, you know,

44:25

parquets within the ARO umbrella.

44:28

So people, people that the

44:30

product, you know, in the various projects have created

44:32

a bunch of like translation layers to do that.

44:35

I haven't seen, I really haven't

44:37

seen any like rise of like, oh, these

44:40

like row based formats into either

44:42

ARO or parquet,

44:45

it just seems to be like, kind

44:47

of one off. I honestly,

44:49

I don't see AVRO come

44:52

up that much. So mainly,

44:54

I think what I see the most what

44:56

people care about is like JSON data, just

44:59

because it's so easy, you know,

45:01

to change between different languages and different

45:03

services. And honestly, I

45:05

think proto buff more than,

45:07

more than AVRO or anything else. I

45:10

think that's mean maybe because of, you

45:12

know, the popularity of gRPC. As

45:16

you have been investing in

45:18

this ecosystem, building on top of the different

45:20

components, I'm wondering what are some of the

45:23

most interesting or innovative or unexpected ways that

45:25

you have seen some or all of those

45:27

pieces used together? So

45:30

honestly, stream processing was a surprise for

45:32

me, because I like I didn't, when

45:35

I think of like ARO and, and Data

45:38

Fusion, like I wasn't originally thinking that people

45:40

would use these things for stream processing systems,

45:43

right? I think more like, it's there around

45:45

like fast processing and do it, you know,

45:47

I execute a query against this data, whatever.

45:50

So having people seeing people pull that

45:52

stuff into the stream processing systems has

45:55

been very surprising. Elsewhere,

45:57

I'm not sure, like I think So

46:01

I've seen a few observability

46:04

solutions start to look seriously using

46:06

Parquet as the persistence format. That's

46:08

a little surprising too, mainly because

46:12

when I think about observability, it's largely like,

46:14

oh, you think of like metrics, log traces,

46:16

right? And generally what people

46:18

have done is they've created specialized formats

46:21

and backends for each of those individual

46:23

use cases. So I've

46:26

seen some people start to look

46:28

at seriously at having Parquet represent

46:30

like any of that kind of data, which

46:33

I think to me that's

46:35

definitely like one of our visions

46:37

long term is that being

46:39

able to store any kind of observational data

46:41

in influx and thus in Parquet, but

46:44

to see more observability vendors start to look at that

46:46

seriously has been a bit of a surprise too. And

46:51

in your experience of working in

46:53

this space, rebuilding the influx database

46:55

and investing more into the Aero

46:57

ecosystem, what are some of the

46:59

most interesting or unexpected or challenging

47:01

lessons that you've learned in the

47:03

process? I mean, one of

47:07

the lessons, which is somehow a lesson I always

47:09

really need to relearn as a software developer is

47:11

things always take longer than you expect them to

47:13

take. So this

47:16

project, like I said, you know, we

47:18

started seriously thinking about it about four

47:21

years ago, really serious development on it

47:23

for the last three and a half.

47:25

It's basically just a long, a long

47:27

road to create this kind of system.

47:30

So is that I've

47:33

been pleasantly surprised by the adoption

47:35

by actually

47:37

the level, the level of

47:39

contribution from outside people

47:42

at actually companies

47:44

of a very significant

47:46

size has been also a bit

47:49

of a surprise. I think

47:54

for companies that reach like

47:56

crazy scale, which

47:58

are companies that you know, the names of.

48:00

Like, I think many

48:02

of them are contributing to these projects, because

48:05

they kind of have to like, create

48:07

their own things, because literally nobody on earth has

48:09

the kind of scale problems they have, except for

48:11

maybe like 10 or 20 different

48:13

companies. So they end up having

48:16

to roll their own solution. And again,

48:18

I think the the

48:20

fact that these companies are contributing is something

48:23

I didn't expect, particularly this

48:25

early on. And

48:28

I think that speaks to, you know, the

48:30

thing we were talking about earlier, which is

48:32

like, what kind of platform risk is there

48:34

to adopting this code? And it's like, well,

48:36

the alternative is, you create all

48:38

this closed source software that is really like, there's not

48:40

a problem you're trying to solve. This is just like

48:42

the problem you have to solve to get to the

48:44

problem you're trying to solve. So that's,

48:47

that's been, like, I think

48:49

a pleasant surprise, seeing seeing this,

48:51

you know, mature over the last

48:54

few years. And for

48:56

people who are looking to build data

48:59

systems, data processing engines, what are

49:01

the cases where the Fdap stack

49:03

is the wrong choice? So

49:07

I, I don't think

49:09

it's particularly designed for all TP workloads,

49:11

right? So we have traditional relational databases

49:13

and stuff like that. Like, there

49:16

are places where you know, you can, it

49:19

would make sense to have it as like, essentially

49:22

like an interface point. But,

49:25

I mean, you can certainly use like data

49:27

fusion as your query engine in an old

49:30

TP workload. But to me, it

49:32

wouldn't make sense to use like arrow is

49:34

a way to ingest data or parquet. Because

49:36

really, when you think about all TP workloads,

49:38

you think about individual requests with individual record

49:41

updates and stuff like that. So

49:43

I really do think these tools are

49:45

more geared towards larger

49:48

scale analytical workloads against,

49:51

you know, data that you can largely view

49:53

as immutable, right? This is like observational data

49:56

and stuff like that. So, yeah.

50:00

And as you continue to build and

50:02

iterate on the new version of InfluxDB

50:04

and invest in the Aero ecosystem and

50:06

the components we've been discussing, what are

50:08

some of the things you have planned

50:10

for the near to medium term or

50:12

any particular projects or problem areas you're

50:14

excited to dig into? So

50:16

as I mentioned, the thing

50:18

I'm most excited about is essentially like more

50:20

integration, adding support

50:23

for Apache Iceberg. So

50:26

what that's going to, so there's already like a

50:28

Rust project to do Apache Iceberg, but it's not

50:30

like fully baked yet. So we may need to

50:32

contribute to that, or maybe the people who are

50:34

working on it will get it fully baked before

50:36

we actually get to the point where we're pulling

50:38

it in. So

50:42

Apache Iceberg is a big thing. I

50:44

think in the medium term,

50:46

the distributed processing stuff and data fusion

50:48

is going to be super interesting. And

50:51

then from InfluxDB's perspective, as

50:53

I mentioned, like we have right now

50:56

our commercial distributed version of

50:58

the database, but this year

51:00

we're coming out with the

51:02

open source version of the

51:04

monolithic single server version of

51:06

the database. And getting that

51:09

open source piece out there with

51:11

like a new version 3 API

51:13

that kind of represents a much

51:15

richer data model than previous versions

51:17

of InfluxDB that takes advantage of

51:19

what you can do with Arrow and

51:21

Parquet as the

51:23

formats that I'm

51:25

actually really, really excited about. Because then I

51:28

really think that from

51:30

a technology perspective, InfluxDB will actually

51:32

be able to fulfill the like

51:34

vision that we've had all along, which is that essentially

51:37

it is useful

51:39

for any kind of observational data you

51:41

could think of, not just like metrics

51:43

data from your servers or networks or

51:45

your apps. Are there

51:47

any other aspects of the work

51:50

that you've been doing on the

51:52

InfluxDB engine, the work that you've

51:54

been doing investing in and building

51:56

on top of the Arrow ecosystem

51:58

or the overall space of how

52:01

the Aero ecosystem might influence the future

52:03

direction of the data processing ecosystem that

52:05

we didn't discuss yet that you'd like

52:07

to cover before we close out the

52:09

show? I don't think

52:11

so. Like, I think we kind of, I

52:13

mean, I guess like more

52:16

broadly, like the way the way I

52:18

view like the data space right now

52:20

when you're talking about these like analytical

52:22

data is there's this

52:25

kind of like distinct

52:27

separation between like data warehousing on

52:29

one side, which is these large

52:31

scale analytical queries and stuff like

52:33

this, and like stream processing on

52:36

the other, which is more about like real time

52:38

data as it arrives. I think the

52:41

trend like really when I think about those two

52:43

things, like ultimately, like what developer

52:45

wants and what users want is basically some

52:48

magical oracle in the sky that they can

52:50

like send a query to

52:52

the where the result will come back in, you know, some

52:54

50 milliseconds. And

52:57

we have that we wouldn't need stream processing, we wouldn't

52:59

need like all these different things. But

53:02

I think as the technology improves

53:04

and things get better and better,

53:06

data warehousing is going to become more

53:08

real time. And the real time pieces

53:11

are going to, you know, move more

53:13

towards like data warehousing, because ultimately, like

53:15

people don't want to think about separating

53:17

stream from data warehousing, whatever. And

53:20

one of the things I'm excited about is essentially

53:22

the idea that these different building

53:25

blocks could potentially be the things

53:28

that people use to kind of close that

53:30

gap and create, you

53:32

know, a big data solution that

53:34

works either for real time data or for,

53:37

you know, big scale data warehousing. But

53:40

I thought people liked reinventing the lambda architecture.

53:45

Oh, no, yes, they do. They

53:49

just like to call it something new. Maybe

53:51

it's the kappa architecture. All

53:54

right. Well, for anybody who wants to get in

53:56

touch with you and follow along with the work

53:58

that you're doing on the web. have you add

54:01

your preferred contact information to the show notes. And

54:03

as the final question, I'd like to get your

54:05

perspective on what you see as being the biggest

54:07

gap in the tooling or technology that's available for

54:09

data management today. The biggest gap?

54:12

Oh, I

54:15

don't know. I don't

54:17

know, actually. I mean, obviously,

54:20

like, I

54:22

think the most interesting side of this is

54:24

essentially like, you know, time series data and

54:27

basically being able to represent, being able to

54:29

do analysis on data as time series. So

54:32

that's our focus. That's

54:34

what I think is the most interesting thing right now.

54:37

But yeah,

54:41

I still think that's an unsolved problem by

54:43

us or anybody else. So that's what we're

54:45

working towards. All right. Well,

54:48

thank you very much for taking the

54:50

time today to join me and share

54:52

the work that you've been doing, both

54:54

contributing to and building on top of

54:56

the Arrow ecosystem and the components thereof.

54:58

It's definitely a very interesting

55:00

area of effort. It's great to see the work that

55:03

you and your team are doing to help bring all

55:05

of us forward in that space. I appreciate the time

55:07

and energy you're putting into that, and I hope you

55:09

enjoy the rest of your day. Cool.

55:12

Thank you. Thank

55:19

you for listening. Don't forget to check

55:22

out our other shows, podcasts.init, which covers

55:24

the Python language, its community, and the

55:26

innovative ways it is being used. And

55:28

the Machine Learning Podcast, which helps you

55:30

go from idea to production with machine

55:32

learning. Visit the site at dataengineeringpodcasts.com, subscribe

55:35

to the show, sign up for the

55:37

mailing list and read the show notes.

55:40

And if you've learned something or tried out a product from the

55:42

show, then tell us about it. Email

55:44

hosts at dataengineeringpodcasts.com with your

55:46

story. And to help other people

55:48

find the show, please leave a review on Apple

55:51

Podcasts

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features