Podchaser Logo
Home
AI Infrastructure

AI Infrastructure

Released Friday, 28th April 2023
Good episode? Give it some love!
AI Infrastructure

AI Infrastructure

AI Infrastructure

AI Infrastructure

Friday, 28th April 2023
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:00

Episode 359 of CppCast

0:03

with guest Ashot Vardanyan recorded

0:05

24th of April 2023.

0:08

This episode is sponsored by JetBrains,

0:11

smart IDEs to help with C++,

0:13

and Sonar, the home of clean code.

0:31

In this episode, we talk about some

0:33

new blog posts, the

0:35

annual developer survey, and

0:38

Compiler Explorer support in Sonar.

0:43

Then, we are joined by Ashot Vardanyan.

0:47

Ashot talks to us about AI and improving

0:50

the infrastructure that it runs on.

1:00

Welcome to episode 359 of

1:02

CppCast, the first podcast for

1:04

C++ developers by C++ developers. I'm

1:07

your host, Timo Dummler, joined by

1:09

my co-host, Phil Nash. Phil,

1:11

how are you doing today? I'm

1:13

all right, Timo. Just back from the ACC conference

1:16

last week, which I know you were at too,

1:18

because I saw you there. So how about you?

1:20

Back at home in Finland now? Yeah, exactly.

1:22

I arrived last night. I'm still quite

1:25

tired from the whole thing, but kind

1:27

of recovering. Yeah,

1:28

it was an awesome conference.

1:30

Yeah, things too have been back to sort of almost

1:32

normal levels. Last year was definitely

1:35

way down on attendance. So it's interesting to see

1:37

conferences getting more back to normal. So quite

1:40

hopeful for C++ on C this year.

1:43

All right. At the top of every episode, I'd

1:45

like to read a piece of feedback.

1:46

This time, we got an email from Peter, who

1:48

was commenting about episode 356 with

1:51

Andreas Weiss about safety-critical C++.

1:54

Andreas had said that the model implies

1:57

waterfall. Peter writes, Actually,

2:00

it's not a very strong implies. Waterfall

2:03

is a way to organize activities that implies a strict

2:06

ordering and stage gates, etc.

2:08

The V-model talks about appropriate levels

2:10

of system decomposition and having tests that correspond

2:12

to the elaborated requirements at each level.

2:15

The two concepts are orthogonal to each other, although

2:17

many organizations using the V-model do

2:20

follow waterfall. As a counterexample,

2:22

there is a standard called AAMI-TIR45

2:26

called Guidance on the Use of Agile Practices

2:29

in the Development of Medical Device Software that

2:31

describes how to practice agile development

2:34

in a way that the FDA will accept. This

2:36

features the V-model but no waterfall. Well,

2:39

thank you very much Peter for the clarification. It's

2:41

much appreciated and we will put the link to

2:43

AAMI-TIR45 in

2:45

the show notes.

2:46

I'm sure everybody's familiar with that document already,

2:48

but yeah, we'll put it in the show notes.

2:52

Talking of feedback, we actually got quite a bit

2:54

of feedback, still more feedback

2:56

about the ongoing RSS

2:58

feed saga.

3:00

It does seem that it wasn't quite as fixed as

3:02

I thought it was last time. And

3:04

the way to think is that different people were seeing

3:06

different behaviours. Some

3:08

weren't seeing the last episode with Matthew

3:10

Benson. Some weren't seeing the episode

3:13

before that with Herb Satter. Some

3:15

were seeing that one twice and some weren't seeing

3:17

either of them. So after

3:19

a bit of digging, I discovered that both of

3:21

those episodes were missing a GUID, which

3:24

is actually not strictly required by the RSS

3:26

spec, but

3:27

many clients do rely on it. And

3:30

you can see why that

3:31

may have different behaviours and different clients.

3:33

So hopefully that explains it all now.

3:36

I've added those GUIDs back in as

3:38

well as an extra check just to make sure that

3:40

doesn't happen again, hopefully.

3:42

And I've checked with all of the clients that I know about

3:44

and they all now seem to be complete and up to

3:46

date everywhere. So again, sorry

3:48

about that, but if you do still see anything

3:51

not quite right with the guide, do

3:53

please continue to let us know. We want to make sure

3:55

that

3:55

we get it right.

3:57

That also means that some people may

3:59

now have... of one or two older episodes

4:02

added back in their feed.

4:04

So

4:05

if you don't recognize the episode

4:07

with Herb Sutter or Matthew Benson,

4:09

or you think you've maybe missed an episode somewhere, do have a

4:11

look back in your podcast player

4:13

to see if it's back in your history unplayed.

4:15

You may have a bonus episode, so maybe

4:18

there's a plus side to this as well. But hopefully that's

4:20

all sorted now.

4:22

Thank you Phil for fixing all of this. I don't really know

4:24

how any of this works. Me neither,

4:26

apparently. Really appreciate

4:28

you sorting that out.

4:30

We'd like to hear your thoughts about the show. You

4:32

can always reach out to us on Twitter or Mastodon

4:35

or email us at feedback at cppkas.com.

4:39

Joining us today is Ashad Vardanyan,

4:41

also called Ash. Ash is

4:43

the founder of Unum and the

4:45

organizer of Armenia's C++ user group.

4:48

His work lies in the intersection of theoretical

4:50

computer science, high performance computing,

4:53

and systems design, including everything

4:55

from GPU algorithms and SIMD assembly

4:57

for x86 arm to drivers and

4:59

Linux kernel bypass for storage and

5:01

networking I.O. Ash, welcome to the show.

5:03

Hello

5:04

guys, happy to be here.

5:06

Good to have you here. So your bio is really

5:08

quite interesting,

5:09

but I did actually found another bio

5:12

from you with most of the

5:14

same information on your GitHub page.

5:17

It also says there that you're an artificial intelligence

5:19

and computer science researcher. And

5:21

it also says, and that really caught my attention,

5:23

you have a background in astrophysics.

5:26

Now I have a background in astrophysics too, so I'm

5:28

very curious what your astrophysics background is all

5:30

about. Well, if only I remembered

5:32

much. So for some

5:35

time, I was really curious about

5:37

theoretical physics, but then I didn't

5:39

feel like I'm smart enough to do it for a lifetime.

5:42

I didn't feel like I can

5:44

contribute this much for some time,

5:47

almost in the free time, I was building up some

5:49

simulation software.

5:51

And back then people were just using packages

5:53

like root and many others that were written

5:55

at CERN for physics simulations

5:58

and stuff. And I always.

5:59

kind of went the opposite direction.

6:02

So I approached all of my researchers,

6:05

advisors, whatever,

6:07

I convinced them that instead of using an

6:09

existing package for all kinds of simulations, I'll

6:11

just rewrite everything for

6:14

GPUs.

6:15

And then when I came back to my

6:17

advisors and showed them that I can run their simulation

6:21

on the laptop faster than they do it on the cluster,

6:24

they were all kind of shocked. And then I kind of started

6:26

combining all of this with some of my theoretical computer

6:28

science research and decided it's time

6:30

to leave the university and just focus on AI

6:33

for the rest of my life.

6:34

That is such a cool story. So I also

6:37

did do some astrophysical simulations

6:40

in my time. I remember there was

6:42

lots of horrible Fortran code written

6:44

by Soviet professors in

6:46

the 70s. But I never

6:48

had any brilliant ideas like that. But

6:51

yeah, so that's quite fascinating what you're saying

6:53

about the GPU stuff. So you will

6:56

get more into your bio and your work in just a few minutes.

6:58

But we do have a couple of news articles to talk about.

7:01

So feel free to comment on any of these, OK?

7:03

Sure, sure. All right, so the first thing,

7:05

upcoming conferences. So

7:08

you received an email from Inbal Levi,

7:11

who is one of the organizers of Core

7:13

C++. And she

7:15

writes, our conference Core

7:17

C++ is approaching fast. And we're really excited

7:19

about it. I was wondering if you could mention

7:22

it in your podcast. Well, dear Inbal, yes,

7:24

we can. Core C++ is

7:26

taking place in Tel Aviv, Israel. It's

7:28

actually in a new venue, although the old venue

7:30

was also pretty awesome. So I'm curious what the new venue

7:32

is like. From June 6 until 7,

7:35

with workshops on June 5 and 8,

7:38

they have amazing speakers. The keynotes

7:40

will be given by Bianus Drostropp and Daisy

7:42

Holman.

7:44

And tickets are available at CoreCPP.org.

7:47

And I received another email from meeting C++ that was

7:49

not specifically for

7:51

CPP cards. It was kind

7:53

of just a regular mailing

7:55

that they sent out. But it was also very exciting, actually,

7:58

because they have announced the dates for meeting C++. last 2023.

8:01

So that's a big conference in Berlin. It's

8:04

going to take place on the 12th to the 14th

8:06

of November this year in Berlin and

8:09

be aware this is Sunday to Tuesday and

8:11

it's not Thursday to Saturday as it has always

8:13

been in the past. Interesting. Uh, it will be a hybrid

8:15

conference with a three tracks on site, one

8:18

pre-recorded online track,

8:20

and they have announced two keynotes already, one by

8:22

Kevin Henney and another by Lydia Pinscher.

8:24

And the third keynote, the closing keynote will

8:26

be announced later.

8:28

So that's good to hear that that conference

8:31

is also coming back. Or actually both of them are coming

8:33

back this year. Been to both

8:35

of them. They're pretty awesome. Both of them. I

8:37

think Phil, you've also been to, have you been to

8:39

Core C++? I went to the first one. Yes.

8:42

Not been to one since, unfortunately. Yeah,

8:44

I was at the one last year as well. It was also pretty

8:46

awesome.

8:47

Um, so just for completeness sake, there

8:50

are a few more upcoming C++ conferences in the

8:52

next couple of months. I just want to briefly mention them as well

8:54

to kind of remind everyone that this is happening.

8:56

So there's a C++ now in Aspen,

8:59

Colorado, which is just a couple of weeks away at

9:01

this point is from the seventh through the 12th

9:03

of May,

9:04

and it's capped at 140 participants

9:07

and they sent out a mail blast as well, uh,

9:10

this week saying that they still have 20 slots

9:12

left. So you

9:14

want to grab one of the last slots to go

9:16

to C++ now in beautiful Aspen, Colorado,

9:18

buy your ticket now.

9:20

And there's obviously CBP on C, which is your conference,

9:22

Phil. Do you want to talk about that one?

9:25

Of course. Yeah. So the full schedule, but

9:27

by the time this airs, that should be available. Uh,

9:29

we'll be going live shortly after we've recalled this

9:32

and it will also announce our third

9:34

keynote speaker. So, uh, like, uh, again,

9:37

is it meeting C++? We hold one of them

9:39

back into a bit later. So

9:41

a little bit of a news era as well,

9:43

but the rest of the schedule will all be online

9:45

by the time you hear this.

9:47

Right. And then there is the one in Madrid coming

9:49

up actually this week. So by the time

9:52

you hear it, the conference is already going

9:54

to be

9:54

happening. So it's probably too late to

9:56

direct people to that one. But there's also the Italian

9:59

content. conference and

10:02

Rome on the 10th of June. That's

10:05

also not very far away. And

10:07

finally, there's also CPP North and Toronto,

10:10

Canada coming up 17th to 19th of July. So

10:13

that's another one I'm really looking forward to.

10:17

Okay, enough conferences. There's another

10:19

thing. There is

10:22

the 2023 annual C++ developer survey, which

10:24

is now out. It is an annual

10:26

survey by the ISO C++ standards committee

10:28

and the standard C++ Foundation. And

10:31

they would really appreciate your feedback to

10:33

share your experiences as a C++ developer as it

10:36

only takes 10 minutes to complete.

10:38

So please participate. If you have 10 minutes to

10:40

spare, you can do so on surveymonkey.com

10:42

slash r slash ISO CPP minus 2023.

10:46

This link will be in the show notes on cppcast.com.

10:49

And a summary of the survey

10:51

results will be posted publicly on isocpp.org.

10:56

This is one of the three big C++ community surveys that go out

10:58

every year. This one JetBrains

11:00

do their own survey and

11:02

meeting C++ as that ongoing survey. Sometimes

11:07

I have new questions as well. But between

11:09

the three of those, I know from having worked with

11:11

our two tour vendors now that we do watch

11:13

those closely to see what the trends are and

11:15

who's using what. So please

11:18

do fill those out because it helps the

11:20

whole community to do that. All right.

11:22

And we have one news item from the tooling world that you already

11:25

mentioned. So it's a nice segue. Thank you, Phil. Actually

11:28

back to you because this is about the company where you work.

11:31

So sonar has announced that you can now run sonar

11:33

static analysis inside compiler

11:34

Explorer. Yeah.

11:37

I'm really, really excited about this because it

11:39

really makes a

11:41

huge difference being able to just enable sonar analysis

11:43

on some code you already have running

11:45

in compiler Explorer. And you'll get

11:47

a massive amount of data that you can use to get the data to run

11:49

the Explorer and you'll get a much more

11:52

detailed set of

11:53

warnings or rules

11:55

that can really break down what might be wrong with your code.

11:57

Even if it compiles,

11:59

you might still. get some more insight, I help you to

12:01

clean it up.

12:02

And I've actually been working on some,

12:04

some videos to go along with that. So

12:06

if they're ready by the time this is, I'll put some links

12:08

to those in the show next to the, I'll just, just

12:11

to show you some use cases. Right.

12:14

And so finally there were two blog posts

12:17

the last couple of weeks that caught my attention. The first

12:19

one was by Viktor Sverovich,

12:21

the guy who wrote the format and the FMT

12:24

library.

12:25

It's called C++ 20 modules

12:27

in Clang.

12:28

So clan 16 that we already discussed a

12:30

little bit, I think a few episodes back. She

12:32

has pretty good support for C++ 20 modules,

12:35

kind of out of the box.

12:36

So Viktor actually went ahead and compiled his

12:39

FMT library with modules.

12:41

It requires a bit of manual work, but you can make

12:43

it work. So this has been done before by Daniela

12:45

Inge. She has a talk about that called

12:48

short tour of C++ modules that

12:50

she did a while back.

12:51

But yeah, so Viktor now has repeated this

12:54

exercise. I think with clan 16, it's quite

12:56

a lot simpler now to do. But surprisingly,

12:58

Viktor found that there wasn't actually an immeasurable

13:00

speed up in compile times, which is kind of one of

13:02

the things that, you know, modules

13:05

was promising to, to give us. And

13:08

so, so he was digging a bit deeper in that

13:10

blog post and he traced the issue down to the fact that

13:12

Clang is actually ignoring external template.

13:15

So it's kind of recompiling the templates instantiations

13:18

all over again. So this is kind of

13:20

exactly the thing that, you know, modules were supposed

13:22

to get rid of. So

13:25

yeah, it's kind of interesting to see if, if, and when

13:27

Clang is going to fix that or

13:30

improve that on what the underlying issue

13:32

is. I'm not

13:33

an expert there, but

13:34

definitely interesting to see that there

13:36

is still some work there to do, but it's kind of, kind of works,

13:39

but kind of also

13:40

doesn't quite give you all the benefits yet.

13:42

Yeah. That's seemed to be a quality

13:45

of implementation issue,

13:46

hopefully, which means that we

13:49

can move past that and we might still get the benefits

13:51

of it

13:51

down the line. Yeah. I did

13:53

like the fact that

13:56

Victor actually starts off the

13:58

post saying that the free headline.

13:59

save plus plus 20 features,

14:01

modules, coroutines,

14:03

and the third one.

14:05

So it's up to you to fill in the blank there. Well,

14:09

I mean, if he means core language features,

14:11

I guess he means concept. You

14:13

decide. I mean, it could also be

14:15

ranges. I mean,

14:18

I don't know. Interesting.

14:21

Very interesting. Okay.

14:24

And one last interesting blog post that I want to

14:26

mention on the show that caught my attention was

14:28

called Horrible Code Clean Performance by

14:30

Ivica Bogos-Savidevich. I hope

14:33

I pronounced this name not too

14:36

completely wrong. So the

14:38

title of that blog post is actually a homage

14:41

to another blog post and video

14:43

that came out a couple of months ago, Clean

14:45

Code Horrible Performance,

14:48

which is also super interesting and kind of controversial.

14:50

And we haven't really covered it in the show, but like, yeah,

14:53

look up that one as well. That's super interesting. But

14:55

yeah, this one is called Horrible Code Clean Performance.

14:58

And basically what he's doing there is that Ivica

15:01

is implementing a simple substring search

15:03

algorithm. So you kind of have a bigger string, and

15:05

then you have a smaller string, and you search for the first

15:07

occurrence of a smaller and the bigger string.

15:10

And he's kind of implementing it naively first.

15:12

So he has like a kind of a

15:14

char pointer and a size basically for both of

15:16

them. And then he's implementing like the naive loop that you

15:18

would do. And then he

15:21

does like an optimization where he does

15:23

it in a much more ugly way, but then it kind of stores

15:27

the first bit of the string that you're searching for

15:29

in a really clever way. So he ends up with this like

15:31

much more convoluted code, but he

15:33

finds that that actually runs 20% faster.

15:36

And so I thought that was really interesting.

15:38

There was also quite an interesting Reddit discussion about

15:41

that too.

15:43

So I don't actually see an interesting backwards and forwards in

15:45

the comment with someone called

15:47

Timo Dumler, who

15:49

said he wasn't able to reproduce Ivica's

15:51

results on his own machine.

15:54

So looks like that's actually still ongoing.

15:56

So that may have even changed by the time this is but

15:59

uh

15:59

Did you get anywhere with that Tim? Yeah, so

16:02

the first thing I thought,

16:04

when I was reading that blog post, I was like, that can't be.

16:06

Compilers are more clever than this. So I kind of

16:09

did the benchmark myself and it came out as

16:11

there is no benefit. And then I

16:14

told Ivica about this and he was like, oh yeah, but

16:17

you need to make sure that

16:19

the function isn't inline. Then you need to make sure that the

16:21

compiler doesn't actually see the string you're looking

16:24

for because it doesn't see its size. Otherwise

16:26

it can do clever

16:27

compile time optimizations. And I was like, oh yeah, obviously

16:30

this is what's going on

16:32

in my benchmarks. So that's all

16:34

wrong. So I need to redo it. But we

16:36

did agree on the fact that if

16:38

you do the same benchmark just with std string

16:40

find, then it's actually faster than

16:42

either of these versions on either of those machines.

16:45

It's interesting that you're mentioning this. std

16:48

string find is a bit

16:50

weird. So I always feel like

16:53

some of the libraries can have a few

16:55

more specialized versions for different

16:57

substring search. So I was

16:59

doing a few

17:02

measurements and benchmarks in the previous years.

17:04

And substring search is one of my favorite problems,

17:06

like tiny ones to tackle. So you

17:09

can always take a smart algorithm and

17:11

then try to optimize it in

17:13

the CSE way and expect some

17:16

performance improvements. But the best thing that worked

17:18

for me in substring search is very

17:20

trivial heuristics. And especially

17:22

combining them with some

17:24

single instruction multiple data intrinsics,

17:27

you get the biggest benefits.

17:30

So if you take, let's say, substring

17:32

search in the case where you have the needle that

17:35

is at least four characters long, and

17:37

you essentially cast it to 32-bit

17:40

unsigned integer, and then go

17:42

through the haystack comparing at every

17:45

byte step, the following four

17:47

bytes, cast it to Uint32 to your Uint32, you get an

17:49

improvement over std

17:53

string find and over a few other things,

17:56

which is a bit surprising because this is such an obvious

17:58

thing. Of course, it does.

17:59

doesn't work in the rare cases when the four characters

18:03

are matching quite often. But

18:05

like, say, the fifth one is a

18:07

different one. And then the coolest thing

18:09

that worked out was actually taking the AVX

18:12

registers, the AVX2 registers.

18:15

And they can fit 256 bits. So

18:18

that's how many bytes.

18:21

That's 32 bytes. And

18:24

within 32 bytes, you will be

18:26

able to do how many of

18:29

such comparisons? Eight such comparisons.

18:31

So what you can do, you can prefetch

18:34

four AVX2 registers

18:37

at one byte offsets. And

18:40

this way, with a 35-byte step, you

18:43

can actually compare 35 characters at a time, checking

18:48

if any one of those offsets matches to

18:50

your four-byte thing. So it sounds

18:52

a bit convoluted. And this is exactly the point,

18:54

I guess, of the article, like convoluted

18:56

code, good performance, versus

18:58

the opposite. But the difference is staggering.

19:01

So when you take STT shrink and you

19:03

call find, most of the time, you're getting 1.5

19:06

gigabytes per second worth of throughput,

19:08

give or take. Well, I guess me

19:10

and you, Tim, are with astrophysics

19:13

background. Anything within the 10x

19:16

order of magnitude difference is accurate enough.

19:19

It's on the part,

19:21

right? Yeah, exactly. But

19:24

what I was able to show on some of the conferences

19:27

was that with this basic intrinsic

19:30

thing, you can actually get to 12 gigabytes

19:32

per second per core. So

19:35

it's exceptionally efficient.

19:37

It's much faster than the LIP-C implementations

19:40

and the STT string. And you can literally

19:42

fit it in so many places. So I was doing

19:44

benchmarks for both AVX instructions,

19:47

AVX 512 and ARM MION. And

19:49

back then, SVE, Scalable Vector Extensions,

19:52

were not available on ARM. So I couldn't

19:54

do those. But even on ARM, the performance

19:56

benefits were huge and the energy efficiency

19:58

was absurd compared

19:59

to any code that even a super

20:02

smart compiler like Intel's compiler

20:04

can optimize. Yeah,

20:07

that is such a cool trick. That is

20:09

very, very interesting. Do you... Let me

20:11

just... Because this is my kind of topic. I love

20:13

this kind of stuff. So I just need to follow up on this very

20:15

briefly. Do you actually

20:18

have to write the SIMD instructions

20:20

by hand? I use a SIMD library, or do you just

20:22

write the algorithm in a way that lends itself to

20:25

auto vectorization?

20:26

I don't know. So actually, part

20:28

of this talk was specifically about measuring

20:30

auto vectorization capabilities versus naive

20:33

code versus handwritten intrinsics. And

20:35

there were people from Intel's teams validating

20:38

my numbers and checking if their

20:40

compilers can actually reproduce some of

20:42

the vectorization that is quite easy to

20:44

do by hand. And maybe since that

20:46

point, like five-ish years

20:48

ago, six-ish years ago, I almost

20:51

completely abandoned the idea of writing

20:53

a library for SIMD instructions. Whenever

20:55

I need top tier performance, I just

20:58

manually implement a few different

21:00

versions. Generally, I don't go

21:02

down to the level of assembly, but I almost

21:04

always use the intrinsics. And

21:06

I would generally have, let's say,

21:09

a function object

21:10

that is templated that will have,

21:13

let's say, a few different instantiations.

21:16

One of them will be, let's say, with linear code,

21:19

with serial code. Another one will be with

21:21

ARM Neon, another one with ARM SVE,

21:24

another one with AVX, another one with SSE,

21:27

and potentially the last one would be AVX 512

21:30

whenever necessary.

21:32

That's interesting. So actually, in audio, there's a

21:34

very, very similar problem. You often have

21:36

to do convolution.

21:39

And you can do convolution in Fourier space, and this

21:41

is kind of the proper way to do it. But sometimes you want to

21:43

do convolution in time domain. And then it's

21:45

essentially the same thing, except you don't compare,

21:48

but you multiply. Also,

21:50

you have a big array

21:52

and a small

21:54

array, and you multiply it, and then you move it

21:56

by one byte, you multiply it, you move it by one

21:58

frame and multiply it, and so on and so on.

21:59

And so this is why I

22:02

found that this is one of the things that

22:04

like the compiler just can't

22:06

figure out for you if you don't write it

22:08

in a particular way that like

22:10

kind of lends itself to auto-recturization or like

22:12

typically I think I've seen that you kind of have to

22:15

use a SIMD library to get this right. And

22:17

I've never tried it with just raw intrinsics because

22:19

that's kind of

22:21

non-portable and you have to do it multiple times. So

22:23

I clearly haven't

22:25

dug into this as deeply as you have, but

22:27

I should

22:29

look into this again because I think it's kind of a really

22:32

interesting problem

22:32

space. Just for sake of completeness, what

22:35

talk is that? Where can people

22:37

see that? Because we should totally put that in the show

22:39

notes, I think.

22:40

That's actually interesting.

22:43

I'm pretty sure there's a GitHub repository

22:46

that implements this. I think the talk

22:48

specifically that one was in

22:50

Russian for CPP Russia. I think

22:53

the AI tools should advance a little bit and we'll

22:55

have automatic translation. Or maybe

22:57

I can just repeat this talk somewhere, but it also

23:00

touches on a few topics that almost

23:02

no other talk I've ever seen

23:04

kind of covers. So there is like this

23:07

rumor that AVX 512 and

23:09

a few other different instruction

23:12

subsets kind of affect the

23:14

frequency of the CPU so that like once you

23:16

truly load all the course with AVX 512,

23:20

you kind of lose all the CPU

23:22

boost. The frequency really drops

23:24

and people just

23:26

kind of understand it that it's the

23:28

case, but no one quantified it or like no

23:30

one really goes into the documentation to

23:33

describe what CPU licensing

23:35

levels are. So same way

23:37

as you have like CPU cache levels like

23:39

level zero or level one,

23:41

two, three. There's also like CPU

23:43

frequency licenses. And this

23:46

is like a weird, completely separate topic that's

23:48

almost completely undocumented.

23:51

And I was doing a lot of research kind of trying

23:53

to understand like how many lines of

23:55

AVX 512 or like how many intrinsics

23:58

can I actually fire? so

24:00

that the CPU doesn't

24:02

turn down the frequency. Or let's say, how

24:05

many more should I put for all the CPU frequencies

24:07

across all the cores to be downgraded, even

24:10

if the remaining cores are not doing any AVX-512 and they're

24:12

just doing serial code. So

24:15

it's kind of an interesting thing. There's a repository

24:17

that you can run. It's on my GitHub.

24:19

And my handle everywhere is identical. It's

24:22

ASH Vardanyan, Ash Vardanyan.

24:26

So I think the repository

24:28

is called Substring Search Benchmarks or something

24:30

like that. I'll share the link. Yeah,

24:33

we'll put that in the show notes. Thank you.

24:35

And it sounds like we're sort of transitioning into

24:39

the main content of this episode

24:41

already. So a great segue there.

24:43

Thank you. But actually, before we get to the low-level stuff,

24:46

I wanted to start at least

24:47

maybe at the higher level, if

24:49

that's OK,

24:50

because you founded your own company

24:53

and consist of a whole set of projects,

24:55

which on first glance seem to be almost unrelated.

24:58

They're mostly open source. There's

25:00

a GitHub repo that we'll put in the show

25:02

notes as well. But

25:03

the tagline there

25:04

is, Rebuilding Infrastructure for

25:07

the Age of AI.

25:08

So the AI really stands out

25:11

there. So what sort of AI

25:12

is that? And how do all these libraries relate

25:14

to it?

25:15

So you're right in the fact that

25:17

we have too many things that to

25:19

most people would seem unrelated. And

25:22

to be honest, this is just the tip of the iceberg. So

25:25

I started this company seven and a half years ago.

25:27

I've been working on it essentially every single day since,

25:30

day and night. And

25:32

when I left the university and focused exclusively

25:35

on this, this is when the true science began. I

25:37

was reading like 1,000 papers a year, almost

25:40

everything that was published on the computer

25:42

science part of archive or

25:45

the AI and ML parts. I

25:47

at least glanced through it. And a lot of things

25:49

were implemented. So our

25:51

open source libraries include USTORE,

25:54

which is an open source multimodal

25:57

database that obstructs away the layer

25:59

of IQL. store. So kind

26:01

of you can take any key value store and you

26:03

add the database logic on top of it such as

26:05

being able to store different forms of data, query

26:08

them, add wrappers and drivers for different

26:10

programming languages like Python. Then there

26:13

is a library called U-Call which is essentially

26:16

a kernel bypass thing, a tiny

26:19

single or like two-file project

26:21

that took me like a couple of days to write

26:24

and one of my junior developers maintains

26:26

it now adding TLS support which

26:29

seems to be like one of the fastest networking

26:31

libraries built or at least like within

26:33

the C, C++ open source domain. It's

26:36

an RPC JSON RPC library that uses

26:38

the most recent IO URang 519 features, I mean

26:40

kernel 519 and

26:43

higher and it also uses

26:45

SIM JSON and a bunch of other SIM de-accelerated

26:48

libraries for processing the packets.

26:51

There is also a project called U-Form which has

26:53

nothing to do with C or C++. It's

26:56

a pure Python thing but kind of uses

26:58

PyTorch in a nice way to be

27:01

able to run multi-modal AI models

27:03

with MidFusion. It's essentially the kind

27:05

of setup when you have multiple transformers and

27:08

the signal between them is kind of exchanged

27:10

before it reaches the top output

27:13

of a neural network. There's

27:15

a bunch of other libraries as well but this

27:17

is just the open source stuff. Almost every one of those

27:19

has like a proprietary counterpart that

27:22

is far far more advanced. It took years

27:24

to build, has tons of

27:26

assembly in it. A lot of GPU accelerated

27:29

code includes such

27:31

remote things as Regex parsing libraries,

27:34

probably the second fastest in the world after Intel

27:36

HyperScan and the only

27:38

one that also exists 10 gigabytes per second per

27:40

core. Then there is

27:43

a graph library, one of the largest

27:45

graph algorithm collections. There

27:48

is a BLAST library so basically

27:51

an algebra subroutines but unlike

27:54

classical BLAST we don't just target dense-dense

27:56

matrix multiplications. We also do sparse-sparse

27:59

and And we do it both

28:01

on the CPU side and the GPU side, and

28:03

we also do it in invariant

28:06

to the rank manner. So let's say

28:08

there's this notion of algebraic

28:10

graph theory, where you can

28:13

replace a lot of graph processing algorithms with

28:15

matrix multiplications if you know how

28:17

to parameterize them as matrix multiplication

28:19

kernel, replacing the dot product,

28:21

essentially the plus and the multiply operations, with

28:24

something else, a different rank or a different semi-rank.

28:27

So a cool thing that most CS

28:29

people often don't realize, even

28:31

though they are familiar with the subject, is

28:33

that if you look at the Floyd-Werschel algorithm

28:36

on the Wikipedia page,

28:38

it's just three nested for loops over

28:41

i, j, and k. And then within it,

28:43

it's almost exactly the same as matrix multiplication.

28:46

But instead of addition and multiplication

28:48

on scalars, it's doing the minimum and maximum

28:51

operation. So if you take the

28:53

matrix multiplication kernel, design

28:55

it as a template, and then

28:57

pass a different operation into it rather than

28:59

plus and multiply, your matrix multiplication

29:02

kernel immediately becomes a graph processing algorithm.

29:05

So there's a lot of such seemingly

29:07

unrelated things, but my vision from the very

29:09

beginning was that those

29:12

can compose into AI of scale

29:15

that we've never seen before. So all of

29:17

the modern AI is almost exclusively

29:19

built on dense matrix multiplications

29:22

and very simple feed-forward layers

29:24

or very basic attention.

29:26

And then another part is that it

29:28

almost exclusively works on the stuff

29:30

that fits in memory or fits

29:33

within VRAM, so the memory attached

29:35

to your GPU. And those volumes

29:37

are tiny. So in

29:39

our case, I was always curious, how

29:42

can I optimize and vertically

29:44

integrate the whole stack so that even

29:47

external storage, such as the modern

29:49

high bandwidth SSDs, can

29:51

actually become part of your AI pipeline,

29:54

streaming the data, reorganizing the data,

29:56

let's say, stored on SSDs with

29:58

a participation of AI. or let's

30:00

say helping to train AI by having a much faster

30:02

data lake. So the idea

30:05

there is that modern CPUs

30:08

can have what, like one, two terabytes of RAM

30:10

per socket, but they can have

30:12

also like 400 terabytes of NVMe

30:14

storage attached to

30:17

that same socket. So like if you're not

30:20

able to address and properly use external

30:22

memory, you're really limiting yourself

30:24

to like very small part of

30:27

what's accessible.

30:28

And the additional part that kind

30:30

of adds up here is that, yes,

30:33

you can build up a good data lake to help

30:36

with AI and the AI industry, but

30:38

you can also use AI to improve

30:40

the data lake itself. It's like

30:42

very reminiscent of the Silicon Valley

30:45

series, like the guys were building compression

30:47

to kind of build AI

30:50

or and then ended up building AI

30:52

to build compression or vice versa, I think

30:54

kind of had the different order.

30:57

In our case, like if you look at the databases

30:59

like Postgres, MongoDB and many others,

31:02

they focus almost exclusively

31:04

on deterministic indexing, such as

31:06

inverted indexes or something

31:09

like that, where you just explicitly

31:11

search by a specific key or a specific string.

31:14

And you only search for exact matches or

31:17

even at best fuzzy string matches. But

31:19

with AI, we can actually search on structured

31:21

data. So by combining vector

31:23

search, by combining a database and

31:25

a multimodal pre-trained AI, what

31:28

we can do, we can actually embed

31:30

some media documents into a vector space,

31:33

and then just search through those vectors, finding

31:35

all forms of potentially unrelated content,

31:38

or hopefully related content, but across different

31:41

modalities. So being able to search

31:43

videos with a textual query, being

31:45

able to search images with a video query,

31:48

being able to search JSON documents

31:50

with a video query and so on. So I

31:52

guess this kind of gives you a glimpse of how everything

31:54

connects together and hopefully makes the list

31:57

a little bit of sense.

31:59

Yeah, that sounds very good.

31:59

Very, very interesting. Thank

32:02

you. We're gonna dig into a couple of those libraries

32:04

in a minute, I think. But

32:06

just taking a step back, you mentioned Python

32:08

a couple of times there. Seems like you're continuing

32:10

the AI ML tradition of having

32:12

a Python front end with

32:14

C++ doing the heavy lifting. Is that fair

32:16

to say?

32:17

Yeah, I guess everyone just converged to

32:19

this idea that this is the way to go. For

32:21

many years, I'd confess, I'd

32:24

confess I wasn't a super

32:26

big fan of Python. I was

32:28

too obsessed with performance to

32:31

touch a tool that kind of almost entirely

32:34

abandons the concept of performance. But

32:36

then I realized the value that it brings to

32:38

my life and my developer experience.

32:41

So I thought we should bridge

32:43

the two worlds and we are not the only company

32:45

doing this. So famously all AI and ML

32:47

frameworks are written in C++. But

32:50

at the front end, people kind of use Python

32:52

exclusively, almost

32:55

exclusively. And this kind of

32:57

spills outside of AI in all the data

32:59

science and data analytics tooling as

33:01

well. So NVDA is famously

33:03

one of those companies that builds

33:05

a lot of obviously GPUs and they

33:08

have CUDA as a language. They have a compiler

33:10

for CUDA, a lot of low level stuff. But

33:13

I would also say they have by far the best

33:15

tooling in the Python, on the Python level

33:18

to actually leverage those GPUs.

33:21

So you can take libraries like Pandas, NetworkX,

33:23

or NumPy, which are all targeting

33:26

only CPUs and the written purely in Python.

33:28

And you can replace those with libraries like QDF

33:32

as a replacement of Pandas, QGraph

33:34

as a replacement of NetworkX, and

33:37

QPy as a replacement of NumPy. And

33:40

genuinely like this is some of the

33:42

best software I've ever used and kind

33:44

of a really good benchmark for us to

33:46

compete with in a sense that they do it

33:49

for parallel programming and

33:51

we do it for external memory.

33:54

So yeah, so I can see how that

33:56

sort of fits into the infrastructure story that

33:58

enables

33:59

better. AI implementations.

34:01

So we'll

34:02

dig into some of those libraries in just a moment,

34:04

but just going to have a little break because this

34:06

episode is supported by JetBrains.

34:09

And JetBrains has a range of C++ IDEs

34:12

to help you avoid the typical pitfalls and headaches

34:14

that are often associated with coding in C++. And

34:18

exclusively for CPP cast, JetBrains

34:20

is offering a 25% discount for

34:23

purchasing or renewing the yearly individual

34:25

license on the C++ tool of your choice.

34:28

C line, ReSharper and ReSharper

34:31

C++ or Rider. Use the

34:33

coupon code JetBrains for CPP

34:35

cast, for one word,

34:37

during checkout at JetBrains.com.

34:40

So there were a couple of projects on

34:43

the

34:44

Unum repo that jumped out at me

34:47

that

34:47

I just wanted to bring up. And the first one was

34:49

a U-call, which I think you did mention earlier,

34:51

which it claims to be, you mentioned

34:54

yourself, a JSON RPC

34:56

library that is up to,

34:58

and I don't know how much work up to is doing here, but 100% faster

35:00

than Fast API.

35:02

100x, not 100%. So this is important.

35:06

Yeah, no, 100 times.

35:08

Yeah. Now, I know a little bit about Fast API, although

35:11

I haven't actually used it myself, but

35:12

I'm going to have a few web servers that I've

35:15

written and maintain,

35:16

some of them just serving JSON,

35:18

built on Python web frameworks. And I usually

35:21

use Flask.

35:22

And I know that Fast API

35:25

is also a Python web framework.

35:26

It's meant to be significantly faster than Flask.

35:29

And I've not heard people saying that

35:31

Flask is a particularly slow

35:33

framework on its own. So if you're saying you're 100% faster, sorry,

35:36

said it again, 100 times faster

35:40

than Fast API, that sounds

35:42

like quite a big claim. So how do you actually

35:44

achieve that? Sure, I'll be

35:46

happy to explain. So at

35:49

first, people may think that if a project is

35:51

popular, then it kind of optimizes

35:53

something, and it's really good at something. Even

35:56

though Fast API has fast in its name,

35:58

it's not particularly fast.

35:59

honest. So one

36:02

of the things that they do really well is

36:04

like they're very simple to use. They're

36:06

very developer friendly.

36:09

So you just put a Python decorator on top of

36:11

your Python function. And all of a sudden, this

36:13

is a RESTful web server. So

36:16

I guess maybe by fast, they meant that the

36:18

developer experience is fast, but not maybe the

36:20

runtime itself. So the story

36:22

is more or less the following. I was playing

36:24

with our neural networks. And

36:27

they're very lightweight. So we

36:29

looked at neural networks like OpenAI

36:32

clip. And we wanted to replace

36:34

those multimodal encoders with something that would

36:36

work much faster and can be deployed

36:38

on edge, maybe like even IoT

36:41

devices. So we really squeeze those transformers

36:43

made them a lot faster. And if you take

36:45

a server such as like the ggx

36:47

a100 by Nvidia, you

36:50

will end up serving 300,000 or like 200,000 inferences per

36:52

second across the eight

36:57

GPUs of that machine. So this

36:59

is a very high mark for AI

37:02

inference. And the question

37:04

is like, how do you serve it? Because

37:06

the first idea is let's take the most common

37:09

use commonly used Python library for

37:11

web servers, let's connect it

37:13

to PyTorch or something else. And

37:16

let's just serve the embeddings. So

37:18

when I tried to do this, I wasn't actually

37:21

even on the ggx, I just took a MacBook.

37:23

And when I built up a server and just ran it on

37:25

my machine, it was an Intel Core i9. I think my

37:28

response latency was close to six milliseconds.

37:31

So just the client on the server on the same machine,

37:34

and I'm waiting for six milliseconds to get the response,

37:37

I was just shocked

37:39

by the result. So

37:42

obviously, there was a lot to optimize. And

37:44

then I thought like, how far can I go? I haven't

37:47

done much

37:49

networking development

37:51

in the last couple of years. But I've done a

37:53

lot of storage related stuff. And

37:56

I loved IO U-Rank for all of its like

37:58

new advances and the performance. that it brings. Of

38:01

course, sometimes we have to go

38:03

beyond that, so we also work with SPDK and

38:05

dptk as pure

38:07

users-based drivers for kernel bypass, but

38:10

IO-U-RING by itself is also pretty good. So

38:12

if you take a very recent Linux kernel

38:15

like 5.19, it adds up

38:17

a lot of really cool features

38:19

for stateful networking. So

38:22

essentially, the idea is the following. Whenever you have a

38:24

TCP connection on the socket, you listen

38:26

for new requests and queries, and

38:28

whenever they come, you create

38:30

a new connection for every one

38:33

of the incoming clients

38:35

or a new client. One

38:38

of the system calls that you would oftentimes

38:40

do in this case, you get a new file descriptor

38:43

for the communication over a channel to a specific

38:45

client. One of the things

38:47

that IO-U-RING in 5.19 brings is

38:50

a managed pool of those file descriptors

38:53

that can also be taken using the

38:55

IO-U-RING interface without any system

38:57

calls. So with this out of the way,

39:00

almost

39:00

every system call that

39:02

we could have done that would have cost

39:04

interrupt and a contact

39:07

switch on the CPU side is now gone.

39:10

And even with a single server

39:12

thread,

39:13

we managed to get to 230,000 requests per second,

39:17

even on our machine, which is generally

39:19

considered efficient cores

39:22

rather than high-performance cores. While

39:25

FastAPI was only serving 3000 responses

39:28

per second or requests per second. So 3000

39:31

to 230,000 is a

39:34

huge gap. But at this point, we're kind

39:36

of comparing an implementation in Python and

39:38

implementation in C.

39:40

So we wrote a pure CPython

39:42

layer as a wrapper

39:44

on top of our C library. The

39:47

result was that we kind of dropped from 230,000 to 210,000,

39:49

still a major improvement over

39:54

FastAPI. And

39:56

aside from FastAPI, it's also faster,

39:58

seemingly faster than most of the the other networking

40:00

libraries, including GRPC, which

40:03

many people use as the go-to

40:05

high-performance RPC implementation. But

40:08

GRPC doesn't implement such

40:10

level of kernel bypass, let

40:13

alone the fact that parsing product buffers

40:15

is actually oftentimes slower than parsing

40:18

JSON with Steam JSON. We

40:20

win on both fronts, the packet processing

40:22

speed, and also the way we interact

40:25

with the socket. Here we go, 100x faster.

40:28

Everything you said there will sound unreasonable,

40:30

but the numbers still sound too big. I'm

40:32

definitely going to be playing with you call

40:35

on one of my little projects and see whether

40:37

that makes a difference because I'm looking to step

40:39

up the performance. Sure. Let us know. We would really

40:42

love feedback. Yeah. I will report back.

40:45

Thank you for that.

40:46

You mentioned the term Moto-Model

40:49

a few times there.

40:50

What is that exactly? I think that's like a

40:52

term of art in AI, isn't it?

40:55

Yeah. AI people use it a lot,

40:57

especially these days with what they call foundation

40:59

models, or the next

41:01

step of LLMs, large language models.

41:04

Just doing language is not enough these days.

41:06

People want multimodality,

41:08

which means essentially being able to work with multiple

41:10

forms of data at once, like images,

41:13

video content, audio content,

41:15

anything actually. An example

41:19

of multimodal AI would be something

41:21

like a text-to-image generation pipeline.

41:24

Like you put a text in

41:26

and you get an image. Another example

41:28

would be an encoder that understands

41:31

both forms of data and it produces

41:34

embeddings of vectors that can be compared

41:36

with each other. You can say if an image is

41:38

semantically similar to a textual

41:41

description that sits beneath it, for example,

41:43

like on the webpage. In

41:45

the context of, let's say, databases or

41:47

anything else, we also started

41:50

to use this term to make

41:53

the vocabulary a little bit more

41:56

universal across different parts of our repositories.

41:59

So a multimodal AI, database for us would be

42:01

a database that across different

42:03

collections of the same store can

42:05

keep different forms of data without

42:07

sacrificing the remaining properties. And

42:10

the most important property for us in a database

42:12

would be transactions and support for like

42:15

asset guarantees like atomicity, consistency,

42:18

isolation, and durability. So

42:21

if you can do a transaction where within

42:23

one transaction you are updating multiple collections

42:26

and in one of them you are storing a metadata

42:28

of an object and another one you're storing

42:31

maybe like a poster or a photo

42:33

of a specific document or something like that. And

42:35

if you can do it in one transaction with

42:37

all the guarantees included, this is multi-modal

42:40

for us.

42:41

Right, yeah, I think I'll follow

42:43

that.

42:44

And it's interesting you started talking about

42:46

databases there. I think you've done my transition

42:48

for me again because I was going to ask about another

42:50

one of your projects which is Ustore, which

42:53

at the time of writing I think on

42:55

your site is still somewhere called UKV.

42:57

Looks like you're in the middle of naming that.

42:59

So just in case people go looking for it and they

43:01

find UKV it's the same thing I believe.

43:04

Yeah, sure. Well

43:08

the readme describes it as a build your own database

43:10

talkit. But also that

43:13

it's four to five times faster, at least

43:15

in your benchmarks,

43:16

than RocksDB.

43:18

And I hadn't heard of that so I had to go and look it up. But

43:20

it

43:21

sounds like RocksDB is meant to

43:23

be at least 34% faster than MongoDB.

43:26

So I'm sure that's something people have heard of to

43:28

get an idea now.

43:31

So we're talking about

43:32

almost an order of magnitude faster than something

43:35

like MongoDB.

43:36

So that again is very impressive.

43:38

How do you achieve that?

43:40

So there are

43:42

a couple of stories here and I've

43:45

done a really bad job naming some of

43:47

the projects and it really seems like

43:50

a bit convoluted, like too much is happening.

43:52

So let me just give you a bird's

43:55

eye view of how the storage

43:57

is built today. So let's say

43:59

if you use something like a distributed database.

44:01

You have the distributed layer at the very top,

44:04

which is responsible for consensus and the ordering

44:06

of the transactions. Then

44:08

whenever you choose the lead and the master

44:10

node, we can dive

44:12

deep into that specific node. And on that node,

44:15

you have essentially an isolated single instance

44:17

solution. Within the single instance

44:19

solution, what you have is a database

44:21

layer, a key value store layer, and a file

44:23

system layer. And beneath it is

44:26

the operating system and the block storage.

44:29

So we haven't reached the distributed

44:31

layer so far. We almost exclusively

44:34

focused on vertical scaling in most

44:36

of our projects. Even though,

44:38

as we've just mentioned, you call networking is

44:40

also important for us. It's just that

44:43

we take

44:44

certain steps in specific

44:46

order.

44:47

For now, distributed hasn't been part of

44:49

the agenda. It will be this year. So

44:52

what we've done, we've built up something

44:55

that remotely resembles the strategy of

44:57

Redis. So I guess everyone is

44:59

familiar with Redis. It's essentially like a hash

45:01

table on steroids. What

45:04

they've done, they kind of focused on building a key

45:06

value store. And they allow

45:09

a lot of different additional features, essentially

45:12

adding multimodality to

45:14

the underlying binary key value store. So

45:17

now, let's kind of disassemble this into parts. A

45:20

key value store is just an associative

45:22

container, like a hash

45:24

table or a binary tree, B-tree,

45:27

log structured merge tree, anything

45:29

actually. And Redis

45:32

added pieces such as Redis JSON,

45:34

Redis search, and Redis

45:37

graph as essentially forms of converting

45:40

different modalities of data and

45:42

kind of serializing them down into a key value

45:44

store. So every

45:47

modality is just like a feature of

45:49

the underlying storage engine.

45:52

So what has been happening

45:54

on our side,

45:55

we thought, oh, cool.

45:58

Let's take a key value store.

46:00

that we love building

46:02

and in our case it's called U-disc. Let's

46:05

take other key value stores and let's

46:07

create a shared abstraction. So that's why it was

46:09

briefly mentioned as build

46:11

your database toolkit. So we thought

46:14

if Redis knows how to abstract away

46:16

the key value store and

46:17

add a

46:19

lot of features on top of it, we can actually

46:22

do something similar and just give it out to

46:24

the world for everyone to use it. So

46:26

essentially you can take any key value store that you like,

46:29

and if by any chance you love designing

46:32

associative containers and you

46:34

code in C++, it's very easy for you to

46:36

actually build up your own hash table. Take

46:39

this project which is now called U-store and

46:42

use it as an intermediate representation

46:44

layer essentially or just like a C interface.

46:47

If you add the C interface on top of

46:49

your hash table or associative

46:51

container that would be ordered,

46:53

you're getting a lot of support for different

46:55

forms of data on top of it.

46:57

You also get bindings and SDKs

47:00

for languages like C,

47:01

C++, Python,

47:04

as well as Golang and Java that have partial

47:07

support. We also had some contributions

47:09

from the community, people trying to implement Rust

47:11

bindings around it. So you've

47:14

briefly mentioned some of the benchmarks and the performance

47:16

numbers,

47:17

and I can elaborate on them as well. So

47:20

in our case,

47:22

one thing that really

47:25

surprised me a few years ago was

47:27

that I was so focused on AI and

47:30

compute and high-performance computing, I

47:32

didn't really think much about storage. When

47:35

I just tried to bring up the systems

47:37

together and compose them into one

47:39

product or one solution, I still

47:41

needed some storage. So I took

47:43

RocksDB, which is an open source

47:45

key value store by Facebook, which

47:48

seemingly is the most commonly used

47:50

database engine today. And whenever

47:52

there's a new database company, there is a very

47:55

non-zero chance that

47:58

they're using RocksDB as their underlying. underlying engine.

48:01

So essentially what's happening is the database

48:03

is adding its own logic for specific workloads

48:05

such as processing graphs in the case

48:07

of Neo4j. And beneath

48:10

it what's happening

48:11

is that

48:13

this is all converted into binary

48:15

data and is stored at RocksDB or some

48:17

other key value store. So

48:19

in reality, at least from my perspective,

48:22

the absolute majority of work that has to

48:24

be done is in the key value store layer. This

48:28

specific example in Neo4j is like

48:31

one of my biggest pains, because

48:33

I've been always fond of graphs and

48:35

graph theory. And Neo4j is

48:37

kind of synonymous with graph and graph

48:39

databases by like today. This company

48:42

has raised $755 million. So

48:45

their product must

48:48

be as polished as possible. And they've

48:50

been around for over a decade.

48:52

But every single time that I try to run

48:54

this database, it crashes with classical

48:57

Java errors. And as a C++

48:59

community, it's almost our obligation to

49:01

kind of make jokes about the Java

49:03

runtime and all

49:06

the garbage collection issues that people face

49:09

in that land. So I was facing them

49:11

all the time. There wasn't a case where I

49:13

would try to put a graph even remotely

49:15

interesting to me in terms of size,

49:17

and Neo4j wouldn't crash.

49:20

Either I, after 20 years of programming,

49:23

am so bad that I cannot even like start up a

49:25

database, or something is really

49:27

wrong on their infrastructure level.

49:29

And something was actually wrong. So

49:32

until a couple of years ago, they were not using

49:34

RocksDB, they had like internal key value store.

49:37

And in 2019, they decided

49:39

that they're kind of switching to RocksDB as a

49:41

new faster engine.

49:43

But even before they switched to RocksDB,

49:45

similar to companies like CockroachDB, Yugabyte,

49:48

and countless other companies,

49:50

half of which have this

49:52

premise of let's take Postgres and

49:54

put Postgres' engine, like

49:57

for query execution on top of RocksDB.

49:59

Even before they started doing this, we realized

50:02

RocksDB is way too slow for us. So

50:04

our ambitions were much higher than

50:07

even the best expectations that other

50:09

databases had for their future

50:11

a few years down the road. So we

50:13

kind of went into the lab,

50:15

I moved to Armenia with

50:17

all places. We ordered a bunch

50:20

of super, super high-end equipment. So

50:22

we run on the fastest SSDs on earth, 64 core

50:26

liquid-cooled CPUs, Ampere GPUs

50:29

for the last couple of years, we run on 200 gigabit

50:31

InfiniBand networking.

50:33

And we used all the state-of-the-art hardware

50:35

to actually push the limits of what software can

50:37

do.

50:38

Because when your hardware is so freaking

50:40

fast,

50:41

every single bottleneck that remains is on

50:43

your side, the software developer

50:46

side.

50:47

And what we've done, we've

50:48

created the key value

50:50

store

50:51

that is faster than RocksDB today

50:54

in almost every single workload. So

50:56

today, the only workload in which we're just

50:59

a little bit slower is the range scans,

51:02

but this is relatively easy to fix in the

51:04

upcoming versions. But in some crucial

51:06

forms of workloads, such as batch

51:08

insertions and batch read operations,

51:11

when you random gather, a random scatter,

51:14

a tons of information onto

51:16

persistent memory or from persistent

51:18

memory, we are five to seven times faster

51:20

than RocksDB, which is a number so

51:23

absurd. Most companies,

51:25

especially like smaller startups, didn't believe this

51:28

is possible. The only companies that

51:30

kind of realized and were

51:32

familiar with my prior work in the previous

51:34

years, are generally like super large,

51:36

trillion dollar plus American tech companies.

51:39

And they kind of knew some of my proprietary

51:42

work before that. And when they started testing

51:44

it last year, they were just shocked

51:46

that this is even possible.

51:48

So our database engine can be faster

51:50

than the file system.

51:52

And the only company that has ever

51:54

shown that this is like, these numbers are

51:56

possible, was Intel a couple

51:58

of years ago, on their obtain as a

51:59

They

52:01

did it using SPDK, which is a

52:03

user space driver that they design and maintain.

52:06

And they reached 10

52:09

million operations per second, most likely

52:11

with 24 SSDs. But this is purely

52:13

synthetic workload. We've managed to reach 9.5

52:16

million operations per second by

52:18

treats on our lab

52:21

setup with Intel people present

52:23

and validating those numbers on a

52:25

setup with three times less SSDs

52:28

and not synthetic read and write operations,

52:30

but actual database operations.

52:32

So this was like an incredible milestone last year,

52:35

a culmination of seven years

52:37

of my work investments and

52:40

teaching experience, I guess.

52:42

Very impressive numbers, of course, and

52:44

I'm definitely going to be trying them out myself

52:47

to see how they stack up.

52:49

But I'm sure a lot of people listening will

52:51

be fascinated by this, just

52:53

as we've been discussing it. But

52:54

since we are a C++ podcast, I know you've

52:57

mentioned

52:57

the use of C++ in many cases, but is

53:00

there anything you can say about how you've used

53:02

C++ to achieve these results?

53:04

Yeah, sure. So C++ is essentially the only

53:06

language where I can do this. There

53:09

is no way around it. I tried other

53:11

languages. C++ wasn't the first

53:13

language I used, wasn't the last one

53:16

that I adopted or tried.

53:19

So almost every one of those projects

53:21

is implemented in C++. Other

53:24

stories implemented in C++, every single one

53:26

of our internal libraries is implemented in C++. But

53:30

as a person who's been doing C++ for well

53:32

over 10 years now, I think it's

53:34

not a single language. It's

53:37

just like a pile of languages mixed together. And

53:39

every more or less senior person kind of

53:42

picks his own subset of what he kind

53:44

of allows within the code base. And

53:47

I guess most of the people who kind of stay in

53:49

the profession for this long, they develop

53:52

a taste and a lot of strong opinions about

53:54

stuff they like and dislike. So

53:57

I am this kind of... Code

54:00

Nazi within my team,

54:02

who is super aggressive in terms of not

54:05

allowing some of the features of the language to

54:07

be used while pushing

54:09

everyone to adopt other features that they

54:11

may have not been familiar with from school. So

54:14

in our case, things that I don't

54:16

like and don't use

54:19

oftentimes would be related to dynamic polymorphism,

54:23

exceptions, related

54:25

stuff. I guess you can understand,

54:27

especially in a low latency environment.

54:29

We really hate memory allocations.

54:32

We don't use new or delete.

54:35

It's very important for us to have

54:37

full control of the memory system. We

54:40

use NumAware allocators. We design some

54:42

of them.

54:43

But then on the other side, there are features

54:45

that we can't leave without. So

54:48

essentially, being able to compose very

54:50

low level abstractions with super high level

54:52

abstractions is the

54:55

kind of special thing

54:57

about C++. So as I've

55:00

mentioned, we oftentimes build

55:02

function objects that would be

55:04

essentially a templated structure with

55:07

overloaded call operators. So open

55:10

brackets, close brackets operator.

55:13

What we then do, we essentially

55:15

instantiate this template in a few different forms

55:18

and specialize it for all kinds

55:20

of different assembly targets.

55:24

We would have an implementation for x86

55:26

and for ARM. And within

55:28

x86 and ARM, we'll also target a few different generations

55:31

of CPUs. So I guess

55:33

this is one of the things that we really love.

55:35

Another thing is that we always stick to

55:37

the newest compiler.

55:40

In our case, this is mostly GCC and

55:43

LLVM. We also use NVC++

55:45

and NVCC. We

55:49

also occasionally use Intels1

55:51

API performance toolkit compilers.

55:54

I guess they are also not very good at naming

55:56

as bad as we are, renaming them almost

55:58

every year. I don't know which

56:01

name ICC or ICX they go now

56:03

for.

56:05

So those things are

56:07

crucial. Using the recent

56:10

C++ standard is also important

56:13

because when you do a lot of templates

56:15

and metaprogramming in

56:18

C++11, it's constant std enable

56:20

if. Once

56:23

I start remembering all those horrors

56:25

of 2011 and 2012, I

56:28

kind of almost lose consciousness. And

56:31

then when if constexpr appeared with

56:33

C++17, if I'm not mistaken,

56:35

or C++14, whatever,

56:38

we kind of immediately adopted this. We

56:40

now use C++20 where we can.

56:42

Some features, unfortunately,

56:45

don't work for us

56:46

for now. So coroutines still allocate.

56:49

And when you reach 10

56:52

million IOPS on eight SSDs

56:55

or aim to get to, let's say, 20

56:57

million IOPS, heap allocations

57:00

are not good. So we cannot use coroutines

57:02

there. We have to rely on PRC

57:04

interfaces. But overall,

57:07

I would claim one more time that

57:09

C++ is essentially the only language where we can achieve

57:11

this.

57:13

Yeah. So a lot of that, what you said, sounds quite familiar

57:15

for me coming from an audio processing background,

57:18

not wanting to do allocations,

57:20

not avoiding branches and runtime polymorphism

57:23

and

57:24

writing your own allocators and all of that stuff, all

57:26

the store latency stuff sounds familiar. I guess the big difference

57:28

is that in music

57:30

software, typically you can't use the latest

57:33

and greatest compilers because A, you have

57:35

to typically ship on macOS and Apple Clang

57:37

is a bit behind.

57:38

And B, you're also typically, if

57:40

you're shipping an audio plug on you, to support

57:43

older versions of macOS. So you're also

57:45

constrained on what standard library

57:47

versions you can use

57:49

because the

57:50

stuff might just not be there on an

57:53

older macOS version.

57:55

I guess you don't have any of those problems. So you can

57:57

really take

57:58

full potential of the latest and greatest.

59:59

this and ask it a completely different

1:00:02

question if you don't mind. So we

1:00:04

talked a little bit about AI and obviously you're

1:00:06

kind of working on the

1:00:08

kind of plumbing that makes all

1:00:10

of this work. But

1:00:12

if you zoom out really far, like something

1:00:14

that I found really striking over the last

1:00:16

year or so is how AI

1:00:18

systems like chat GPT or a DALI

1:00:21

or mid-journey had kind of transformed

1:00:23

how people do things. And

1:00:25

I wonder what your thoughts are

1:00:27

on this latest generations of AI. Are

1:00:30

they going to wipe us all out

1:00:33

as a humanity and within the next few years?

1:00:35

And like, AI is going to rule the planet

1:00:37

or do

1:00:38

you have any thoughts on that? Well,

1:00:41

I wouldn't be that pessimistic.

1:00:44

So of course, I just had to say that because I'm a massive

1:00:47

science fiction nerd. And it's kind of a thing that

1:00:49

people have been worrying about for quite a long

1:00:52

time.

1:00:53

So one part we have to take seriously,

1:00:55

the fact that the work is changing

1:00:57

and like jobs will be replaced, obviously.

1:01:01

Some people are very frightened with this and

1:01:03

it's understandable. Change is always frightening.

1:01:06

But on the different side, we can

1:01:09

now get much more efficient with AI and

1:01:12

people can unlock a lot more of their credit creativity.

1:01:14

So there is a lot of the opportunity

1:01:17

for people who may kind of get

1:01:19

replaced now to actually adopt a

1:01:21

new skill, which will not be easy, obviously.

1:01:24

And we have to be compassionate with them and

1:01:26

help them kind of adopt AI to

1:01:28

kind of get into a new labor

1:01:31

market.

1:01:31

But in general, I was

1:01:34

always optimistic about AI. So

1:01:37

I

1:01:38

think people are perfectly fine

1:01:41

finding ways to kill each other, even

1:01:43

without AI. So like

1:01:46

if there's something coming to kill us, I think

1:01:48

it's more likely ourselves rather than an artificial

1:01:51

form of intelligence. So I would

1:01:53

definitely bet on humans any

1:01:55

time of the day, if this question is asked.

1:01:58

But on the opposite side.

1:01:59

And just looking back

1:02:02

on the last couple of months on the chat GPT

1:02:04

release,

1:02:06

I would say for people who are inside

1:02:08

the industry and have been pre-training actively,

1:02:11

and there are not too many teams like that. So there's

1:02:14

one major team, one major cluster

1:02:16

that is the US and maybe UK with

1:02:19

DeepMind, another cluster is maybe

1:02:21

like Russia and now South Caucasus

1:02:23

where a lot of this talent has moved, ourselves

1:02:26

included. And another major cluster would

1:02:28

be China where people actually have the

1:02:30

resources to pre-train those models because this is not

1:02:32

cheap. Like you need thousands of GPUs,

1:02:35

this puts your budget,

1:02:36

starting budget at $100 million and

1:02:38

above. So like small

1:02:40

labs cannot really compete within this

1:02:44

modern heavyweight category.

1:02:47

So there are not many teams, but the

1:02:49

people who are inside of those teams have

1:02:51

been familiar with the incremental steps

1:02:54

and the incremental progress that was happening. So

1:02:56

I don't think for many of them the chat GPT release

1:02:58

was a shocker. Many of them have been

1:03:00

working on similar technology and have seen

1:03:02

every preceding paper that

1:03:05

came before that. Still it's

1:03:07

lovely to see the attention

1:03:09

to the industry. And I've seen a lot of hype

1:03:11

cycles in the last couple of decades. Like

1:03:15

crypto was the most recent one and I think

1:03:17

people are still confused about any

1:03:19

application or any effect that crypto

1:03:22

can have on our everyday lives. But

1:03:24

with AI, we have such insane

1:03:27

level of adoption already. So I think

1:03:29

there's been like a billion people who have interacted

1:03:31

with AI over the course of the last few months who

1:03:33

have never touched AI or AI related

1:03:36

tools before that. So I'm very passionate.

1:03:39

Right.

1:03:39

Yeah. I mean, what you say about the job

1:03:41

market, I've actually been thinking about that too, because

1:03:44

I'm a developer advocate. So things that I do

1:03:47

include things like writing

1:03:48

blog posts or recording videos about

1:03:50

how to do something.

1:03:52

And basically now I can

1:03:55

let touch to write the script for

1:03:57

the video and then I can train in the eye to read

1:03:59

it. out in my voice, so basically I don't have to do

1:04:02

anything anymore. So

1:04:04

that's kind of an interesting thought.

1:04:07

Okay. Shifting gears now again, completely

1:04:09

because you mentioned now, kind of

1:04:11

South Caucasus and how there's like a lot

1:04:13

of talent there. So you,

1:04:16

your bio says that you actually

1:04:18

founded a C++ meetup in Armenia.

1:04:20

And it's obviously something I'm very excited about.

1:04:23

So can you tell us just a little bit about the

1:04:25

C++ scene in Armenia and that meetup that

1:04:27

you've started and how that's going? Yeah,

1:04:30

sure. It's actually absolutely wonderful.

1:04:32

So when I moved to Armenia a couple of years ago,

1:04:34

as a person who wasn't born or raised there, I

1:04:37

just had some ancestry. Most people

1:04:39

thought I'm kind of forbidden saying to do this

1:04:41

because I had a few other opportunities

1:04:44

of other countries where I could go, but

1:04:46

now they tend to realize how much of

1:04:49

undervalued gem Armenia

1:04:51

is within like the local town, like a

1:04:54

local region. So essentially

1:04:56

in Armenia, we have a ton of hardware and

1:04:58

like chip design companies. So

1:05:00

Armenia has one of the largest offices

1:05:02

of synopsis, which is like a chip design,

1:05:04

EDA company based in the

1:05:07

United States. Their second largest office,

1:05:09

as far as I know, is in Armenia. And the third largest

1:05:12

is in India and India has a 750 times

1:05:14

larger population than Armenia.

1:05:16

And now Nvidia

1:05:18

and Mellanox also have presence in Armenia.

1:05:21

Their office is only like five minutes walking

1:05:23

distance away from mine. There's

1:05:25

AMD and Xilinx. There's Siemens EDA.

1:05:28

And whenever you hear like chip design and hardware,

1:05:31

you kind of immediately get that these are

1:05:33

low level people. It's likely that they

1:05:35

use low level languages. So obviously C++

1:05:38

is a major part of their life

1:05:39

professional and sometimes now

1:05:42

like part of their after

1:05:44

work interactions. So the

1:05:46

problem I saw when I arrived was

1:05:48

that there were a lot of professionals who use

1:05:51

C++ daily, but they are not always familiar

1:05:53

with the newest standards. They don't

1:05:56

meet together too often to discuss how

1:05:58

people tackle different problems. And

1:06:01

the overall exchange of ideas

1:06:03

between junior developers and senior developers

1:06:05

is not as rapid as maybe let's say within the United

1:06:07

States or within Russia, the places

1:06:10

where like the developer ecosystem is much more

1:06:12

developed. So I thought it makes

1:06:14

sense to help a little bit ignite this

1:06:16

activity in the last couple of months.

1:06:19

Sorry, a couple of years. We had maybe like six live

1:06:21

meetings. We grew from

1:06:23

just 10 attendees to maybe 750

1:06:26

members who kind of went through

1:06:28

those meetups and our groups

1:06:29

and kind of chat and discuss within

1:06:33

our vicinity. But there are definitely

1:06:35

a lot more developers who do C++

1:06:38

but haven't been part of a community so far.

1:06:40

So I guess there are a few thousand more. And

1:06:43

overall, we now have a Deca corn in

1:06:45

Armenia, a unicorn, five

1:06:47

to seven other companies that are about to become unicorns.

1:06:50

So this density

1:06:52

in a city with less than a million population

1:06:55

kind of dwarfs or like at least

1:06:57

competes with maybe half

1:06:59

of Scandinavia, if not all of Scandinavia

1:07:02

combined. So I'm very

1:07:04

excited. I really want to

1:07:06

invite everyone to come visit us in our country

1:07:08

in the South Caucasus. And I'll

1:07:10

try to do my part in this and hopefully next

1:07:13

year, just like Phyllis organizing

1:07:15

Cpp on C. We were not that

1:07:17

lucky to get access to C,

1:07:20

but we have beautiful mountains. Maybe we

1:07:22

should call our conference next

1:07:24

year Cpp in mountains.

1:07:26

The capital is already elevated by

1:07:29

a kilometer over the sea level,

1:07:31

but we can go even higher than that somewhere in a

1:07:33

beautiful place with a beautiful view

1:07:36

and just chat about C++ with all the brightest

1:07:38

from all over the world. So come visit us.

1:07:40

So Ash, that sounds absolutely amazing.

1:07:43

I've never actually been to Armenia, but I always wanted

1:07:45

to go. So if

1:07:48

something like that would happen there, I would totally

1:07:50

show up there. I think that that sounds really exciting.

1:07:53

And we'll be very excited to have you.

1:07:56

And if we have listeners in Armenia

1:07:58

who don't know about your meetup.

1:07:59

I'm sure there's a link that you can give us that we'll put

1:08:02

in the show notes and

1:08:03

they can tie up with you and obviously spread the word in

1:08:05

our media because there's obviously a lot of

1:08:08

technical people there that enjoy

1:08:10

the show.

1:08:12

Thank you very much for sharing all this

1:08:14

with us. We've run way over time

1:08:16

once again, but is there anything else you want to tell

1:08:19

us before we let you go?

1:08:21

Just that you, both of you, are

1:08:23

amazing hosts. I'm

1:08:25

happy to be here and would be happy

1:08:27

to chat about any one of our projects, both

1:08:31

on any podcast or within

1:08:33

our Discord groups. A lot

1:08:35

of the projects are open source. Come

1:08:37

try it. Share your experience. Don't

1:08:41

hesitate to ping us whenever you see bugs

1:08:43

because there are a lot of them, I believe.

1:08:45

Especially with this build and compilation

1:08:47

time, sometimes the packages are a bit outdated,

1:08:50

but we're doing all the best we can to

1:08:52

actually keep the best software, the fastest

1:08:54

software, always available to our users.

1:08:57

Where can people reach you, Ash?

1:08:59

I have accounts

1:09:02

that I often check

1:09:04

and use on places like LinkedIn,

1:09:08

GitHub, Facebook, and Twitter.

1:09:10

I had essentially read-only accounts on

1:09:13

Twitter for a few years, but I guess,

1:09:15

assuming how many tech people are

1:09:17

on Twitter, I have to change my policy

1:09:19

in that regard and start becoming more active. Again,

1:09:22

my name is the same everywhere, Ash for Danyan,

1:09:25

but aside from this, there's also a Discord channel

1:09:27

that you can find opening any

1:09:29

one of our open source

1:09:29

repositories. There's a few glyphs on

1:09:32

top, and one of those is a Discord

1:09:34

link on which you

1:09:36

can press and connect not

1:09:38

just with me, but with every one

1:09:41

of the engineers on my teams in Armenia

1:09:43

and abroad. Thanks, Ash. We'll put

1:09:45

some of those links into the show notes as

1:09:48

well. Lovely, guys. Pleasure

1:09:51

talking to you. Thank you so much, Ash,

1:09:53

for being a guest today.

1:09:56

Thanks so much for listening in as we chat about T++.

1:09:58

We'd love to hear what you think

1:10:00

of the podcast. Please let us know if

1:10:02

we're discussing the stuff you're interested in, or

1:10:05

if you have a suggestion for a guest or topic we'd

1:10:07

love to hear about that too. You can email

1:10:09

all your thoughts through feedback at cppcast.com.

1:10:12

We'd also appreciate it if you can follow Cppcast

1:10:15

on Twitter or Mastodon. You can

1:10:17

also follow me and Phil individually on

1:10:19

Twitter or Mastodon. All those links, as

1:10:22

well as the show notes, can be found on the podcast

1:10:24

website at cppcast.com.

1:10:28

The theme music for this episode was provided by

1:10:30

podcastthemes.com.

Rate

Join Podchaser to...

  • Rate podcasts and episodes
  • Follow podcasts and creators
  • Create podcast and episode lists
  • & much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.
,

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features