Open source, on-disk vector search with LanceDB by Practical AI: Machine Learning, Data Science | Podchaser

Episode from the podcastPractical AI: Machine Learning, Data Science

Open source, on-disk vector search with LanceDB

Released Tuesday, 19th December 2023

Good episode? Give it some love!

Open source, on-disk vector search with LanceDB

Open source, on-disk vector search with LanceDB

Tuesday, 19th December 2023

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:07

Welcome to Practical AI. If

0:10

you work in artificial intelligence, aspire

0:12

to, or are curious how AI-related

0:14

technologies are changing the world, this

0:16

is the show for you. Thank

0:19

you to our partners at Fastly for shipping

0:21

all of our pods super fast to wherever

0:24

you listen. Check them out

0:26

at fastly.com. And to our

0:28

friends at Fly, deploy your app servers and

0:30

database close to your users. No

0:32

ops required. Learn more at

0:35

fly.io. Welcome

0:44

to another episode of Practical

0:46

AI. This is Daniel Weitnack.

0:48

I am CEO and founder

0:50

at Prediction Guard. And

0:52

I'm joined as always by my co-host, Chris

0:54

Benson, who is a tech strategist at

0:56

Lockheed Martin. How are you doing,

0:59

Chris? I'm doing good today. How's it going,

1:01

Daniel? Oh, it's going great. I was

1:03

just, well, we were just remarking before

1:05

actually starting the recording that one of

1:07

the great things about doing

1:09

these episodes is that we get the

1:11

excuse to bring on the show the

1:13

coolest open source

1:16

and tooling and other projects

1:18

that I'm using day to

1:20

day and get the chance to interact

1:22

with. And one of those is LanceDB.

1:25

And we're really excited today to

1:28

have with us Chung Shoo, who

1:30

is the CEO and co-founder at

1:32

LanceDB. Welcome. Thanks. Hey,

1:34

guys. Super excited to be here. Thanks for

1:37

having me on. Yeah, yeah.

1:39

Well, first off, congrats on

1:41

all your success. I was

1:43

scrolling through LinkedIn and saw

1:45

a video of LanceDB up

1:47

on the NASDAQ screen in

1:50

Times Square. So that was cool to see.

1:52

That must mean good things, I'm assuming. Yeah,

1:55

it is possible for the

1:58

Brex and also. SMCC.

2:01

So big thanks goes out to them.

2:03

Cool. Cool. Yeah. Well, I mentioned, um,

2:06

I've had a chance to look through

2:08

some of what you're doing and actually

2:10

use it day to day. Actually, that

2:12

was a result of a previous episode

2:15

that was I think titled, you know,

2:17

vector databases beyond the hype with Prashant.

2:19

I think the question that we asked him was

2:22

like, Oh, there's all these vector databases you've

2:24

compared all of them. What

2:26

are some of the things that stand out or

2:28

some of the vector databases that

2:31

stand out in terms of what

2:33

they're doing technically or how they're

2:35

approaching things. And one of them he

2:37

called out was Lance DB. I think

2:39

in particular he was talking about kind

2:42

of on disc index stuff. And so

2:44

I'm sure we'll get into that in a

2:46

little bit more, but that's how I

2:48

got into it. So I recommend listeners

2:50

maybe go back and get some context

2:53

from that episode. But as we get

2:55

into things, could you maybe

2:57

give us a little bit of a picture

2:59

as to how Lance DB

3:01

came about? I know there's a lot of hyped

3:04

vector database stuff out there and

3:06

people not might not sort of

3:08

realize how these things were developed,

3:10

how they came about, what the

3:12

motivation was. And so if you

3:15

could just give us a little bit of a sense

3:17

of that, at least for Lance DB. Yeah,

3:19

absolutely. And first I wanted to also, uh,

3:22

give a big shout out to Prashant as well. As

3:24

you were saying, there's a lot of hype and noise

3:26

in this area. There are a lot of different choices

3:29

and for users and developers

3:31

who are building generative AI

3:34

tooling and applications, it's always kind

3:36

of confusing, like which one is

3:38

good? And should you listen to

3:40

the marketing from one tool

3:42

versus another? So it's

3:45

great to see someone with an engineering background

3:47

who can write so well to actually take

3:49

the time and just try out a ton

3:51

of different tools and interview a bunch of

3:53

different companies and come to his

3:55

own conclusions. I am super happy and

3:57

excited that he's a fan of Lance DB and. we

4:00

hope to make that better for him and also

4:02

all of our users. So, you

4:04

know, back to the land CV, I think, so

4:07

we started the company two years ago

4:09

at this point, and we

4:11

didn't start out as a vector

4:13

database company, actually, because I

4:16

think if you kind of remember, chat

4:18

GPT is barely one year old. Yeah.

4:20

The dawn of AI. Yes,

4:22

exactly. And

4:25

so the original motivation was

4:27

actually serving companies, building

4:29

computer vision and building new data infrastructure

4:31

for computer vision. So I had been

4:34

working in this space for a long

4:36

time. I've been building data and machine

4:38

learning tooling for about almost two decades

4:40

at this point. I started out my

4:42

career as a financial quant and then

4:45

became involved in Python open source. It

4:47

was one of the original co-authors, the

4:49

pandas library. And that really got me

4:51

sort of excited about open source about

4:54

Python and building tools for data scientists

4:56

and machine learning engineers. And

4:58

so at the time, this was in 2020 and

5:00

2021, what I observed was at the company

5:05

I was working for to be TV. So

5:08

the streaming company. So we dealt

5:10

with both machine learning problems for

5:13

tabular data and also for unstructured

5:15

data, like images and the video

5:17

assets and things like that. And

5:19

what I had noticed was that

5:21

anytime a project touched this

5:24

multimodal data for AI, from images

5:26

to like the text for, you

5:28

know, let's say subtitles or summaries

5:31

to the poster images, these projects

5:33

always took a lot

5:35

longer, they were much harder to

5:37

maintain and it was difficult to actually

5:41

put into production at the same time.

5:43

So my co-founder, a late when I

5:45

had met during my days at Cloudera,

5:48

he was working at cruise and sort of dealing

5:50

with the same issues. And so we

5:52

put our heads together and our conclusion was that,

5:54

Hey, it's not the sort

5:57

of top application or workflow layer

5:59

or orchestration. layer that's the problem, it's

6:02

the underlying data infrastructure. If you look at

6:04

sort of what's been out there, like, you

6:06

know, Parquet and Org has been around, and

6:09

they've been great for tabular data,

6:11

but they really, really suck for

6:13

managing unstructured data. And so

6:15

we essentially said, hey, what

6:17

would it take to build a

6:20

single source of truth where we

6:22

can toss in the tabular data

6:24

plus the unstructured data and give

6:27

much better performance at a much

6:29

lower cost, a total cost of

6:31

ownership, an easier foundation

6:33

to build on top of for companies

6:35

dealing with a lot of vision data.

6:38

And so this comes in handy when

6:40

you want to explore your large vision

6:42

data sets for, you know, let's say

6:44

all time is driving. This comes

6:46

in really handy for things like

6:49

recommender systems and things like that. So

6:51

we started out building out that layer,

6:54

that storage layer in the open source.

6:57

And that took about a year's

6:59

worth of effort to really get

7:01

to a shape that is usable,

7:03

kind of like Parquet or Org

7:05

and other formats in these tools.

7:08

And that was when Generative

7:10

AI became really diverse

7:13

onto the scene and became sort of

7:15

a revolutionary technology. And what

7:17

happened at the time was we

7:20

had originally built in vector index for

7:22

our computer vision users to say, hey,

7:25

let's duplicate a bunch of images, or let's

7:27

find the most relevant samples for training for

7:30

active learning and things like that. And

7:32

it was sort of that open source

7:34

community that discovered to, hey, this can

7:36

be really good for Generative AI as

7:39

well. That's when we

7:41

sort of separated out another repo to say,

7:43

hey, this is a vector database. And

7:46

it's much easier to communicate with the

7:48

community than to say, hey, you're looking

7:51

for a vector search, use

7:53

this columnar format. And

7:55

so that's how we got onto this path. It's

8:00

been a couple of moments now as we were

8:02

going through that, but I was just curious when

8:05

you were talking about kind of going through

8:07

the analysis on the top workflow versus whether

8:09

it was infrastructure and you said y'all concluded

8:11

infrastructure. I was just wondering, you kind of

8:13

went on past that into that, but I

8:16

was kind of wondering how did

8:18

y'all come to that determination? For those of us who are

8:20

not deeply into that thought process, I was wondering where your

8:22

head was at when you were doing that. Yeah,

8:25

it wasn't an easy decision

8:27

or a conclusion. Looking

8:30

back, it was like 2022 initially

8:33

seemed pretty crazy when we sort of first

8:35

came up on it. If you

8:37

think about it, it's like why would you make a new

8:39

data format like in 2022? Parquet

8:41

has been working so well. I

8:44

think it was really observing the pain

8:47

in our own teams and also we

8:49

went out and interviewed a lot of

8:51

folks managing unstructured data. For

8:53

them, it was one, data

8:56

was split into many different places. The

8:59

metadata might be managed in Parquet and

9:01

then raw assets are just dumped onto local

9:03

hard drives or S3 and then

9:06

you might have other tabular data

9:08

managed in other systems and they

9:10

would always talk about how painful

9:12

it is to stitch everything

9:14

together and manage it all together. Some

9:17

of the outcomes are like it's really

9:19

hard to maintain those data sets in

9:21

production. You have a

9:23

Parquet data set that has the metadata

9:25

and then links to S3 or

9:28

something like that to all the images and then

9:30

somebody moves the S3

9:33

directory or something like that and now all of your

9:35

data sets are broken or something like

9:37

we would interview folks are like, hey, what

9:39

are you doing to explore your

9:41

digital data sets and things like that? They're

9:43

like, well, I use MacBook and

9:45

there's this app on that called Finder and

9:48

if you single click on a folder, it shows you

9:50

a bunch of some things. It's

9:52

sort of this horrible way to actually

9:54

work with your data but it was because

9:56

it was so hard to manage all of

9:58

that. machine learning engineers

10:01

and researchers were stuck with the subpar

10:03

tools. You mentioned kind

10:05

of this transition of thinking from

10:08

some of the original use cases that

10:10

you were talking about with computer vision

10:13

to this world of generative AI

10:15

that we're living in now. From

10:18

my impression, from an outsider's perspective,

10:20

it seems like Lance DB has

10:23

kind of positioned itself

10:25

very well to serve this kind

10:27

of generative AI use cases, which

10:29

I'm sure we'll talk about in

10:31

a lot more detail later on.

10:33

I'm wondering from your perspective, how

10:36

has that overwhelming demand for the

10:38

generative AI use case kind of

10:41

changed your mindset and direction as a company

10:43

and a project and open

10:45

source tooling and all of that? And

10:48

how do you envision the kind of

10:50

what you're targeting as the use cases

10:52

moving forward, I guess? I

10:54

think certainly generative AI has

10:56

brought in a lot of

10:58

different changes in new thinking.

11:01

One was the sort of

11:04

focus around use cases of semantic

11:07

search and just retrieval in general.

11:09

I think with

11:11

the advent of generative AI, I

11:13

think retrieval becomes much more important

11:16

and then ubiquitous. For

11:18

us, what that means is, you

11:20

know, increased investments in terms of

11:23

getting the index to work really well and

11:25

really scalable. Then sort

11:28

of making that data management piece to

11:30

work really well as well and

11:33

integrating with frameworks for RAG and

11:36

for agents and for just

11:38

generative AI in particular. When

11:40

we started out, inevitably

11:43

we were dealing with multi-terabyte

11:46

to petabyte scale like

11:48

vision data sets and things like that.

11:50

We're still dealing with a lot of

11:52

that. But for generative AI, I think

11:55

there was a renewed focus on

11:57

ease of use because a lot of users are

11:59

coming in who don't have

12:02

years of experience in data engineering

12:04

or machine learning engineering. What

12:07

they're looking for is an easy

12:09

to use and easy to

12:11

install package that doesn't require

12:14

you to be an expert in any

12:17

of these underlying technologies. We

12:19

also spent some effort into, okay, that

12:22

was the motivation behind us making

12:24

LAMDB, so vector database, one open

12:26

source, and two, embed it. Because

12:29

we felt like there were lots

12:31

of options on the market

12:33

that required you to figure out,

12:35

okay, what is the instance I need?

12:37

How many instances do I need? What

12:39

type of it? Okay, now

12:41

I have to chart the data and blah, blah, blah. Coming

12:45

from that data background, what

12:47

I had been working with a lot is SQLite

12:50

or DuckDB that just

12:52

runs as part of your application code

12:54

and would just talk to files that

12:56

live anywhere. It was super

12:59

easy to install and use. That's

13:02

what gave us that inspiration to

13:04

make an embedded vector database. You

13:07

had just got into this idea of

13:10

embedded databases, which, well,

13:12

embeddings are related, but

13:14

that's another topic. But

13:17

the idea that LAMDB is embedded, you mentioned

13:19

DuckDB and other things that kind of

13:21

opt in and operate in

13:23

the same sort of sphere. I'm

13:26

wondering, for those that

13:28

maybe are trying to

13:30

position LAMDB's vector database

13:33

tooling within a kind of wider

13:36

ecosystem of vector databases

13:38

and plug-ins to other

13:40

databases that support vector search, could

13:42

you explain a little bit about

13:44

what does it mean that LAMDB

13:46

is embedded? What does that mean

13:49

practically for the user? Maybe people

13:51

aren't familiar with that term quite

13:53

as much. What does

13:55

that mean practically for the user? Are there other

13:57

kind of general ways that you can use LAMDB?

13:59

you would differentiate LanceDB's

14:02

tooling and the database versus

14:04

some other things out there.

14:07

So I love sort of geeking out about these topics.

14:09

So at the very bottom layer in

14:12

terms of technology, I think there's a

14:14

couple of things that fundamentally sets LanceDB

14:16

apart. One, as you

14:18

mentioned, is the fact that it's

14:20

embedded or runs in process. I

14:23

think we are one of two that can

14:25

run in process in Python and we're the

14:27

only one in JavaScript that runs in

14:30

process. Number two is the

14:32

fact that we have a totally new

14:34

storage layer through Lance columnar format. What

14:37

this allows us to do is add

14:39

data management features on top of the

14:41

index. And then number three is

14:43

the fact that the indices,

14:45

the vector indices, and others

14:47

in LanceDB are disk-based

14:50

rather than memory-based, so that it

14:52

allows us to separate compute and

14:54

storage and allows us to scale

14:56

up a lot better. So those

14:58

are kind of the big value

15:00

propositions that these technological choices bring

15:02

to users of LanceDB. So

15:05

number one, ease of use. Number

15:07

two, hyperscalability. Number

15:09

three, cost effectiveness.

15:12

And then number four, the ability to manage

15:14

all of your data together, and not just

15:16

the vectors, but also, if you think about

15:18

it, the metadata and also

15:21

the raw assets, whether they're images,

15:23

text, or videos. Could

15:27

you describe a typical

15:29

use case of a developer

15:32

doing this, where you're taking

15:34

those features that are distinguishing

15:36

LanceDB from other possibilities, other

15:38

competition, but just talk about

15:40

what that workflow looks like,

15:42

or if there is a major one

15:45

or a couple, and just get it very

15:47

grounded. So somebody that's listening can understand how

15:50

they're going to do it from A to Z when

15:52

they're integrating LanceDB into their workflow. So

15:54

there's a couple of sort of

15:56

prototypical workflows that we see

15:58

from our users. at

16:00

the smaller scale for LAMDB,

16:03

you're installing it via PIP

16:06

or NPM or something like that. In

16:08

general, you get some input data that comes

16:10

in as like a pandas data frame or

16:12

maybe a polar data frame. And

16:15

then you interface with an

16:17

embedding model. You can do that yourself or

16:19

you can actually configure the LAMDB table and

16:21

say, hey, use OpenAI

16:24

embeddings or hey, use these

16:26

hugging face embeddings. LAMDB

16:28

can actually take care of all that. So it's

16:30

a pretty quick sort of data

16:33

frame to LAMDB and then you

16:35

can search it and then that

16:37

comes out as data frames

16:39

or Python Dix or things like

16:41

that. That plugs into the rest

16:43

of your workflow that are likely

16:45

data frame or pedantic or Python

16:47

Dix based. So that's number

16:49

one. And then kind of number two

16:51

is really these large scale use cases

16:53

where some of our users have anywhere

16:56

from like 100 million to multiple

16:59

billions of vectors in one

17:01

table. And that's

17:03

a much bigger production deployment.

17:06

And typically what makes LAMDB stands

17:08

out in that area is one

17:10

straight easy for them to process

17:12

the data using a distributed engine

17:14

like Spark. And they can write

17:16

concurrently and get that done really

17:18

quickly. I think we're one

17:20

of the few that offers GPU acceleration in

17:22

terms of indexing. So even for

17:25

those really large data sets, you can index

17:27

pretty quickly. And then number three

17:29

is because we're able to

17:31

actually separate the compute and storage,

17:34

even at that large vector

17:36

size, you don't really

17:38

need that many krijnodes. Like

17:41

you can actually just have one

17:44

or two like fairly average

17:46

and commodity krijnodes that runs on

17:49

your storage of choice, depending on

17:51

what latency requirements you want. And

17:54

then just have a very simple architecture

17:56

for these types of architectures. The

17:59

krijnodes are stateless. and they don't need to talk to each

18:01

other. So when you need to scale up or

18:03

when a node drops out and has to come back in, there's

18:06

no sort of leader election. There's no

18:08

coordination. It really lowers the

18:10

complexity of that whole stack. So

18:12

another great example of this

18:14

kind of architecture and the benefits that

18:16

it brings is NEON, the

18:18

NEON database. So I think Nikita,

18:22

the founder, recently had a

18:24

good Twitter thread about the

18:26

difference between NEON and

18:29

other databases. And

18:32

he called it shared data versus

18:34

shared nothing architecture. And I think

18:36

that's also what we kind of

18:38

strive to deliver in LansDB versus

18:40

other vector databases. Yeah,

18:42

I know one of the things that I

18:45

really enjoyed in trying out

18:47

a lot of things with LansDB is

18:49

I can pull up a

18:51

collab notebook and try out,

18:53

I can import LansDB. I can import

18:56

a subset of the kind of database

18:58

that I'm going to be working, or

19:00

the data that I'm working with. It

19:02

all runs fine. I don't have to

19:04

set up some client server type of

19:06

scenario. And then

19:09

when people ask, well, how are you going to push

19:11

this out to a larger

19:13

scale, the appeal of just saying, hey, well,

19:15

we can just throw up this LansDB

19:18

database on S3 and

19:21

then connect to it. That's a

19:23

very appealing thing for people because

19:25

also those storage layers are available

19:27

everywhere from on prem to

19:30

cloud to whatever sort of scenarios you're

19:32

working with. So it's very, very flexible

19:34

for people. Could you explain a little

19:36

bit? Because this is something like I've

19:39

been asked a couple times. So this

19:41

is my selfish question because I have

19:43

you on the line. So

19:45

you're helping me with my own day

19:48

to day work. But when I'm talking

19:50

to some people, clients that

19:52

I'm working with, I'm like, oh, we can just push this up

19:54

on S3 and then access

19:56

it. Usually their question is something

19:58

like, well, like, Because they have in

20:01

their mind a database has a compute

20:03

node and somehow

20:06

the performance of queries into the

20:08

database is tied to the

20:10

sizing of that compute node and maybe

20:12

how that's sort

20:14

of clustered or

20:16

sharded across the database. And

20:19

then this idea, oh, I'm just going to

20:21

have even just a lambda function that connects

20:23

to S3 and does a query. In

20:27

some ways it like breaks things in people's mind.

20:30

And so a lot of times their question

20:32

was like, how does that work? How can

20:34

a query to this large amount of data

20:36

be efficient when the data is just like

20:38

sitting there in S3 or in

20:41

another place? So could you help me

20:43

with my answer, I guess, is what I'm asking.

20:45

Yeah, absolutely. So this goes back to

20:48

what we talked about earlier with separation

20:50

of compute and storage. If

20:53

you've been sort of steeped in data

20:55

warehousing, data engineering land, this

20:57

has been a big arc of

20:59

data warehouse innovation in the past

21:01

decade by allowing us to

21:03

scale up the storage versus the compute

21:05

separately. This is the thing that

21:07

makes these systems seem magical

21:10

where you can process a

21:12

huge amount of data on what seems

21:14

like pretty commodity

21:16

or pretty weak compute. And

21:19

so the analogy that I like to make with these

21:21

situations is kind of like a

21:23

lot of us are familiar with, let's

21:26

say, like DuckDB demos or videos. And

21:29

you could see instances where

21:31

DuckDB is processing hundreds

21:33

of gigabytes of data on just

21:36

a laptop and in a very fast amount

21:39

of time. And they are able to spit

21:41

out results almost interactively.

21:44

There are companies from

21:48

like Motherduck to there's

21:50

a new company called Valplan that is

21:52

looking to essentially distribute DuckDB

21:55

queries on AWS Lambdas. It's

21:57

basically the same thing. It's all about the

21:59

separation. of computer storage. And

22:01

that's only possible if you have

22:04

the right underlying data architecture for

22:06

storing vectors and the data itself.

22:09

And just for someone that

22:11

is not a database

22:14

developer, can you describe in

22:16

any words the generalities of

22:18

that data structure that enables

22:20

such a thing? Yeah,

22:22

so it's two things. One is the

22:25

columnar format. So typically, from Gen

22:27

AI to machine learning, you can

22:29

have very wide tables. But typically,

22:31

a single query only needs a

22:34

couple of columns. So columnar format

22:36

allows you to only have to

22:38

fetch and look at a very

22:40

small subset of that data. Number

22:42

two is that columnar

22:44

format needs to be

22:46

paired with an index, like the

22:49

vector index in this particular scenario.

22:51

And that vector index, in order

22:53

to give this separation of compute and

22:55

storage, has to be based on disk.

22:57

So you have to store

22:59

the data on disk, not force the user

23:02

to hold everything into memory, and

23:04

then be able to access that very quickly. And

23:07

then number three is how to connect

23:10

that index with the columnar format.

23:12

So a columnar format like parquet

23:15

does not give you the ability to

23:17

do fast random access. So even if

23:19

you have that good index, using parquet,

23:21

you would not be able to get

23:23

interactive performance in terms of queries. And

23:25

it's only by having a

23:28

new columnar format like LANs that can

23:30

give you random access and

23:32

fast scans that you can successfully

23:34

put these two together and deliver the things.

23:37

So those are the three big

23:39

pillars that I think in our data architecture

23:41

that makes us possible. While we were

23:43

talking here, I'm going through GitHub on your

23:45

repo and stuff, and was

23:47

surprised at something that kind of prompting

23:49

the next question. It looks like you're

23:52

really addressing a wide

23:55

range of different types of needs. And so there's

23:58

obviously Python, as you mentioned, and Python. would

24:00

expect, but you have JavaScript. And then

24:02

I was delighted to discover that there's

24:04

a Rust client in there, which is

24:06

when I'm not doing AI specific

24:08

things. Most of the time I'm that's my

24:11

language of choice these days. Could you talk

24:13

a little bit about kind of two things,

24:15

uh, the broader, like what you're trying to

24:17

achieve, like how you choose what languages to

24:19

support, um, and how you're getting there. And

24:21

then, uh, if you'll scratch my itch, uh,

24:24

what is your intention with that Rust client?

24:26

Is it ready? What does it do? Just

24:29

because I'm fascinated with that. Sorry. Yeah,

24:31

absolutely. Uh, I love talking about Rust.

24:33

Um, the Rust package is actually

24:36

not a client, but so on the core

24:38

of the, both the data format and the

24:40

vector database is actually in Rust. So the

24:42

Rust crate that we have is

24:44

actually the database or the embedded database.

24:46

Um, and so, and

24:48

we actually build, uh, for example, the

24:51

JavaScript, again, the same thing with JavaScript.

24:53

It's not just a client, but it's also

24:55

an embedded database in JavaScript. So that

24:58

is actually based on top of the

25:00

Rust crate and kind of like you

25:02

have in so like polars or

25:04

something like that, you have like a Rust

25:06

core and then you connect that into JavaScript.

25:10

So we had actually

25:12

started out in 2022 writing in C++

25:14

because Parquet is written in

25:18

C++, you know, like serious

25:21

data people and database people write in C++. Until

25:23

they find Rust, of course. Right.

25:26

And it was sort of a hack

25:28

project during Christmas time in 2022 at

25:30

the end of 2022 where

25:34

we had to get a hack

25:36

project for a customer

25:38

actually, and where we had to actually re-implement

25:41

partially the repath for a lands

25:43

format. And what we

25:45

found was just, it was

25:47

so good that we decided to

25:49

just actually rewrite everything in Rust.

25:52

I think biggest things were, we

25:54

were a lot more productive. We rewrote

25:57

Roughly six months of solid. Support.

26:00

Us. Development in about three

26:03

weeks with Ross and we had.

26:05

This was like us learning rust

26:07

as beginners as we went along.

26:09

A lot of that. Initial

26:12

Rust code. Has again

26:14

been rewritten over the past year, but

26:16

it just made us feel a lot

26:18

more productive. And the number two is

26:20

the safety that Rust offers you. has

26:23

been amazing with C Plus was like

26:25

every release. Didn't have

26:27

a good feeling. It was almost like you

26:29

know, where's that next sidewalk and the com

26:31

from. Whereas. With Ross I

26:34

you know we felt very

26:36

confident making multiple releases per

26:38

week with major features and

26:41

we did not see. Anywhere,

26:44

Near the sort of issues that

26:46

we saw a simple sauce or

26:48

it. So everything been really great

26:50

and know that that like Ross

26:52

has become really popular. Now.

26:54

For actually even with factor databases like

26:56

Quadrant Rust, Pine Cone, they're not open

26:59

sores but entire. They publicly said that

27:01

they've written in their whole stack and

27:03

Ross as well. so when we're question

27:06

from he along the same line before

27:08

I let it go because we fit

27:10

that. A sweet spot that I love

27:12

you think in this is not specific

27:15

to land Speedy but based on what

27:17

you're saying clearly you're thinking ahead on

27:19

these things on his we go forward

27:22

and you see both the the ai.

27:24

Applications and you see the different types

27:26

of workflows and infrastructures you know becoming

27:28

broader and more supportive. The multi language

27:31

aspect of getting out of only Python

27:33

for instance. Do you do for see

27:35

that as a convergence for you're seeing

27:38

language agnosticism developing in the space as

27:40

it has in other areas of computer

27:42

science? Or do you think that we're

27:45

still kind of be kind a locked

27:47

in on the current sets of In

27:49

for Surfer and tooling. Very Python oriented

27:51

for the indefinite future. What does your

27:54

thinking? Along those lines, so

27:56

i think generative i'd definitely changes the

27:58

picture and that i think there a

28:00

very large TypeScript

28:02

JavaScript community that

28:04

has been brought into

28:06

the arena to build AI

28:09

tools. And so I think

28:12

this is also an underserved

28:14

segment where it's not

28:16

just vector databases, but data tooling in

28:18

general lags far behind

28:21

in JavaScript's life, TypeScript land

28:23

versus Python. And I

28:25

think there's a real opportunity for

28:27

the open source community to create

28:30

good tools for this part of the

28:32

community as well. I

28:34

want to hear about some

28:37

of the actual use cases

28:39

that you've seen people implement

28:41

with LanceDB. Maybe if

28:43

there's ones that stand out like, oh,

28:45

this was cool because whatever it was,

28:48

they used it at scale, or it's

28:50

like fits a very typical generative AI

28:52

use case or whatever. And then maybe

28:55

something that surprised you in terms of,

28:57

oh, I didn't always when you

28:59

put a project out into the world, there's

29:01

these things where, oh, I really

29:03

didn't expect people to be using it that

29:06

way. But yeah, that sort of makes

29:08

sense. So do you think of anything

29:10

that fits into one or both of

29:12

those categories? The use cases

29:14

for LanceDB in the community that I see

29:16

falls into three or

29:19

four large buckets. One is of

29:21

course, generative AI, rag, and things

29:23

like that. And I

29:26

think it's not so much the

29:28

use of LanceDB

29:30

that I think is really cool, but

29:33

it's the applications that people build with

29:35

it that is really cool and amazing.

29:37

And I think a lot of the applications

29:40

that people build that is cool,

29:42

that really takes advantage of LanceDB

29:44

is things where you need

29:47

rag to be very agile and

29:49

that you need it to be

29:51

really sort of tightly bundled

29:54

with your application, you can sort of

29:56

call this rag from anywhere

29:58

and have it reach. term pretty quickly

30:00

and without too much complexity. And so

30:03

this is where I see a lot

30:05

of folks from your

30:08

standard chat bots and chat

30:10

with documentation to things like

30:12

productivity tools, where they build

30:15

things that help people organize

30:17

their daily schedules to much

30:19

more high stakes things

30:23

in production and like code generation

30:26

or like healthcare and

30:28

legal and things like that. And

30:30

so there, I think typically you see vector

30:33

dataset sizes from like the tens

30:35

of thousands up to single

30:38

digit millions of vectors

30:40

in typically. And

30:42

so production means you really scale up both

30:44

the number of datasets that you have

30:46

and then the number of vectors that you

30:49

have. One

30:51

of the cool things that I've seen that

30:53

takes advantage of LAN TV and LAN format

30:55

uniquely is there's

30:58

a code analysis tool that

31:00

sort of analyzes your GitHub

31:02

repository and plugs it into

31:04

a rag like

31:06

customer success sort of tool.

31:10

And what they want to be able to do

31:12

is say query the state of the database like

31:14

this today versus yesterday versus a week

31:16

ago to say, hey, was this issue

31:18

fixed or not? And like

31:21

what's still outstanding? And so

31:23

LAN TV uniquely gives you this ability

31:25

to version your table and also do

31:27

time travel. So you can say any

31:29

data vector database can do like, give

31:31

me the temos similar things to this

31:34

input uniquely, but what LAN TV gives

31:36

you the ability to do is say,

31:38

give me the temos similar

31:40

as of yesterday or as of a week

31:42

ago. And we do that sort

31:45

of automatically for you. Yeah. And

31:47

I think the other big

31:49

buckets are e-commerce and a

31:51

search and recommender engines. This

31:54

is like the traditional use case for

31:56

vector databases. And there you tend

31:58

to see much bigger single data. data sets. So

32:00

that are, you know, say, I want to store like

32:02

item embeddings, maybe that's, you know, up to a couple

32:05

of million up to 10 million on

32:07

a store item embedding that could get up to like

32:09

hundreds of millions. And you don't

32:11

have as many tables, but you have potentially

32:13

have very large tables, right. And then, of

32:15

course, the last bucket is this like computer

32:18

vision, like AI native computer vision,

32:20

either generative computer vision, or things

32:23

like autonomous vehicles and things like

32:25

that. And there's a whole sort

32:28

of combination of more

32:30

complicated use cases that enables

32:32

active learning, data application, things like

32:35

that. And the thing that

32:37

is very unique about the use case of

32:39

land CBN there is companies

32:41

that are managing all of their training

32:43

data and land CBN land format as

32:46

well. So you can use the vector

32:48

database to find the most interesting samples.

32:50

And then you can actually use the

32:53

tooling on top of the format

32:55

to essentially keep your GPU utilization

32:57

high and keep your GPU fed

33:00

very quickly during training, or if you're fine

33:02

tuning, or, you know, if you're running evals

33:04

and things like that. Yeah, so

33:06

cool. I, one of the things

33:09

that has been most fun for

33:11

me recently is this combination of

33:14

an LLM, Lance DB and

33:16

Duck DB, where like you

33:18

can create these really cool.

33:20

So if I'm using an

33:23

open LM that can generate like

33:25

SQL queries or something, but I

33:27

have like all of these different

33:30

SQL tables, like what we're doing

33:32

is like putting descriptions of the

33:34

SQL fields and tables in

33:36

Lance DB and actually on the fly,

33:39

like matching and pulling those to

33:41

generate a prompt, which goes to the LM

33:43

to generate the SQL code, which is executed

33:45

with Duck DB. And this gives you like

33:48

the kind of really nice natural

33:51

language query to

33:53

your data type of scenario, which has been

33:55

really fun to play with. That's really good

33:57

to hear. Actually, sorry to interrupt. So because

34:00

You kind of nerd sight me. So get it

34:02

out there. So one of the things that's really

34:04

cool about DuckDV is its

34:06

extension mechanism. So

34:09

I think they've also published

34:11

like a extension framework for

34:13

Rust-based extensions. And so we have sort

34:15

of a basic integration going there. And

34:17

I think in New Year, what you

34:19

can expect from us is actually we're

34:21

going to be spending a little more

34:24

time to make that

34:26

integration be more rich, meaning

34:29

our goal is for you to be

34:31

able to say, to write like a DuckDV

34:33

UDF to do vector search. And

34:36

then the results come back as like

34:38

a DuckDV table where you can then

34:41

run additional queries, like DuckDV queries on

34:43

top of that. And so,

34:45

and sort of the same thing with

34:48

like polars, right? So you can, and

34:51

the goal is to essentially make it so

34:54

that like vector database is no longer a

34:56

thing that you even have to think about.

34:59

People are generally more familiar with like DuckDV

35:01

or polars as the sort of that

35:03

tool that just stitches together the workflow. So

35:06

we just want that to make it

35:08

feel even smoother and more transparent. A

35:11

couple of moments ago, when you were talking about

35:13

the use cases, you were talking about like autonomous

35:15

vehicles and stuff. And I was wondering if we

35:17

could pull that thread a little bit more. It

35:19

seems like it is a fantastic, Chris, like drones.

35:22

Yeah, I love drones. And I love things

35:24

that are not by data centers. I

35:27

love things that are off on

35:29

the edge, whether it be for inference

35:31

or including training concerns that you may

35:34

not have all the things that we're

35:36

so spoiled with, with our cloud providers

35:39

out there. And it seems like, you know, there's

35:41

many types of opportunities

35:43

to use that. What's your

35:45

thinking around that? Have you seen any use

35:47

cases? Any ideas for the future in

35:50

that kind of autonomous on the edge world?

35:53

Yeah, definitely. So we certainly have, so

35:55

some of our users are like

35:58

robotics or device companies. now

38:00

in this sort of practical AI

38:02

space, because that's where you're

38:04

living, what excites you about

38:07

whatever it is the next six months,

38:09

the next year, and what you think

38:11

is kind of coming as this tooling

38:13

rolls out there further and further, people

38:15

learn to apply it better and better.

38:18

What's exciting for you? That's a

38:20

great question. I think there are

38:22

lots of things that I think hold

38:24

a lot of promise in the next

38:26

six to 12 months. I think

38:29

we'll see one is

38:31

this explosion of

38:34

retrieval, kind of information retrieval tools.

38:36

So we already see a lot

38:38

of companies are adding like generative

38:40

AI in customer success

38:43

management and

38:45

like documentation and things like that. And

38:47

so I think we'll see a lot

38:49

of applications providing value

38:52

that is, you know, that can be

38:54

also personalized and, you know, not just

38:57

like chat GPT stop answers, but actually

38:59

personalized to their own data or their

39:01

own, you know, cases or things like

39:03

that. And then number two is

39:06

I see a lot of

39:08

successes in very domain

39:10

specific agents that are able to

39:12

dive deep into legal

39:14

or healthcare or some domain

39:16

very specifically and build things

39:18

that seem sort of magical,

39:20

whether it's compliance or

39:23

driving better outcomes or, you know,

39:25

creating things that would democratize

39:28

a lot of these sort of

39:30

like very deep expertise type of

39:33

domains. And then I

39:35

think a little bit further out are generalized,

39:38

like low code and no code tools

39:41

for you to build, you know, very

39:43

sophisticated applications using generative

39:46

AI through code generation and sort

39:48

of creative, let's say creative interfaces

39:50

and things like that. So those are

39:52

things I think we'll deliver in the

39:55

short term. And then, you

39:57

know, personally, like I love

39:59

games and I'm I'm actually super excited about

40:01

what genitive ad brings to gaming. We

40:04

talked about open world and things like that.

40:06

And this is, this can

40:08

be really open where you

40:10

could just get lost for a

40:12

long, long time in a generative world. Awesome.

40:15

Thank you so much for taking time to

40:17

talk with us and please pass

40:19

on my thanks to the Lance DB

40:21

team for making me look good in

40:23

my, in my day job by giving

40:26

me great, great tools that work really

40:28

well. Appreciate what you

40:30

all are doing. And yeah, I

40:32

just looking forward to seeing what

40:34

comes over, over the coming

40:36

months. And yeah, encourage our listeners to

40:38

check out the show notes, all

40:41

the links to Lance DB, try it out. It only

40:43

takes a few minutes and hope

40:45

to talk to you again soon. Thanks so much. Thank you,

40:47

Danny. I thank you, Chris. It was

40:49

super fun talking to you with you guys. And

40:51

if you have any feedback, please let us know.

40:53

We hope to make you look even better in

40:55

the new year. Thank

41:05

you for listening to Practical AI. Your

41:08

next step is to subscribe now, if

41:10

you haven't already. And if

41:13

you're a long time listener of the show, help

41:15

us reach more people by sharing Practical AI with

41:17

your friends and colleagues. Thanks

41:19

once again to Fastly and Fly for

41:21

partnering with us to bring you all

41:23

Change Talk podcasts. Check out what they're

41:25

up to at fastly.com and fly.io. And

41:28

to our beat freaking residents, Breakmaster Cylinder

41:30

for continuously cranking out the best beats

41:32

in the viz. That's all for now.

41:35

We'll talk to you again next time.

Rate

Get this podcast via API

From The Podcast

Practical AI: Machine Learning, Data Science

Making artificial intelligence practical, productive & accessible to everyone. Practical AI is a show in which technology professionals, business people, students, enthusiasts, and expert guests engage in lively discussions about Artificial Intelligence and related topics (Machine Learning, Deep Learning, Neural Networks, GANs, MLOps, AIOps, LLMs & more). The focus is on productive implementations and real-world scenarios that are accessible to everyone. If you want to keep up with the latest advances in AI, while keeping one foot in the real world, then this is the show for you!

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More