Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
0:00
Episode 359 of CppCast
0:03
with guest Ashot Vardanyan recorded
0:05
24th of April 2023.
0:08
This episode is sponsored by JetBrains,
0:11
smart IDEs to help with C++,
0:13
and Sonar, the home of clean code.
0:31
In this episode, we talk about some
0:33
new blog posts, the
0:35
annual developer survey, and
0:38
Compiler Explorer support in Sonar.
0:43
Then, we are joined by Ashot Vardanyan.
0:47
Ashot talks to us about AI and improving
0:50
the infrastructure that it runs on.
1:00
Welcome to episode 359 of
1:02
CppCast, the first podcast for
1:04
C++ developers by C++ developers. I'm
1:07
your host, Timo Dummler, joined by
1:09
my co-host, Phil Nash. Phil,
1:11
how are you doing today? I'm
1:13
all right, Timo. Just back from the ACC conference
1:16
last week, which I know you were at too,
1:18
because I saw you there. So how about you?
1:20
Back at home in Finland now? Yeah, exactly.
1:22
I arrived last night. I'm still quite
1:25
tired from the whole thing, but kind
1:27
of recovering. Yeah,
1:28
it was an awesome conference.
1:30
Yeah, things too have been back to sort of almost
1:32
normal levels. Last year was definitely
1:35
way down on attendance. So it's interesting to see
1:37
conferences getting more back to normal. So quite
1:40
hopeful for C++ on C this year.
1:43
All right. At the top of every episode, I'd
1:45
like to read a piece of feedback.
1:46
This time, we got an email from Peter, who
1:48
was commenting about episode 356 with
1:51
Andreas Weiss about safety-critical C++.
1:54
Andreas had said that the model implies
1:57
waterfall. Peter writes, Actually,
2:00
it's not a very strong implies. Waterfall
2:03
is a way to organize activities that implies a strict
2:06
ordering and stage gates, etc.
2:08
The V-model talks about appropriate levels
2:10
of system decomposition and having tests that correspond
2:12
to the elaborated requirements at each level.
2:15
The two concepts are orthogonal to each other, although
2:17
many organizations using the V-model do
2:20
follow waterfall. As a counterexample,
2:22
there is a standard called AAMI-TIR45
2:26
called Guidance on the Use of Agile Practices
2:29
in the Development of Medical Device Software that
2:31
describes how to practice agile development
2:34
in a way that the FDA will accept. This
2:36
features the V-model but no waterfall. Well,
2:39
thank you very much Peter for the clarification. It's
2:41
much appreciated and we will put the link to
2:43
AAMI-TIR45 in
2:45
the show notes.
2:46
I'm sure everybody's familiar with that document already,
2:48
but yeah, we'll put it in the show notes.
2:52
Talking of feedback, we actually got quite a bit
2:54
of feedback, still more feedback
2:56
about the ongoing RSS
2:58
feed saga.
3:00
It does seem that it wasn't quite as fixed as
3:02
I thought it was last time. And
3:04
the way to think is that different people were seeing
3:06
different behaviours. Some
3:08
weren't seeing the last episode with Matthew
3:10
Benson. Some weren't seeing the episode
3:13
before that with Herb Satter. Some
3:15
were seeing that one twice and some weren't seeing
3:17
either of them. So after
3:19
a bit of digging, I discovered that both of
3:21
those episodes were missing a GUID, which
3:24
is actually not strictly required by the RSS
3:26
spec, but
3:27
many clients do rely on it. And
3:30
you can see why that
3:31
may have different behaviours and different clients.
3:33
So hopefully that explains it all now.
3:36
I've added those GUIDs back in as
3:38
well as an extra check just to make sure that
3:40
doesn't happen again, hopefully.
3:42
And I've checked with all of the clients that I know about
3:44
and they all now seem to be complete and up to
3:46
date everywhere. So again, sorry
3:48
about that, but if you do still see anything
3:51
not quite right with the guide, do
3:53
please continue to let us know. We want to make sure
3:55
that
3:55
we get it right.
3:57
That also means that some people may
3:59
now have... of one or two older episodes
4:02
added back in their feed.
4:04
So
4:05
if you don't recognize the episode
4:07
with Herb Sutter or Matthew Benson,
4:09
or you think you've maybe missed an episode somewhere, do have a
4:11
look back in your podcast player
4:13
to see if it's back in your history unplayed.
4:15
You may have a bonus episode, so maybe
4:18
there's a plus side to this as well. But hopefully that's
4:20
all sorted now.
4:22
Thank you Phil for fixing all of this. I don't really know
4:24
how any of this works. Me neither,
4:26
apparently. Really appreciate
4:28
you sorting that out.
4:30
We'd like to hear your thoughts about the show. You
4:32
can always reach out to us on Twitter or Mastodon
4:35
or email us at feedback at cppkas.com.
4:39
Joining us today is Ashad Vardanyan,
4:41
also called Ash. Ash is
4:43
the founder of Unum and the
4:45
organizer of Armenia's C++ user group.
4:48
His work lies in the intersection of theoretical
4:50
computer science, high performance computing,
4:53
and systems design, including everything
4:55
from GPU algorithms and SIMD assembly
4:57
for x86 arm to drivers and
4:59
Linux kernel bypass for storage and
5:01
networking I.O. Ash, welcome to the show.
5:03
Hello
5:04
guys, happy to be here.
5:06
Good to have you here. So your bio is really
5:08
quite interesting,
5:09
but I did actually found another bio
5:12
from you with most of the
5:14
same information on your GitHub page.
5:17
It also says there that you're an artificial intelligence
5:19
and computer science researcher. And
5:21
it also says, and that really caught my attention,
5:23
you have a background in astrophysics.
5:26
Now I have a background in astrophysics too, so I'm
5:28
very curious what your astrophysics background is all
5:30
about. Well, if only I remembered
5:32
much. So for some
5:35
time, I was really curious about
5:37
theoretical physics, but then I didn't
5:39
feel like I'm smart enough to do it for a lifetime.
5:42
I didn't feel like I can
5:44
contribute this much for some time,
5:47
almost in the free time, I was building up some
5:49
simulation software.
5:51
And back then people were just using packages
5:53
like root and many others that were written
5:55
at CERN for physics simulations
5:58
and stuff. And I always.
5:59
kind of went the opposite direction.
6:02
So I approached all of my researchers,
6:05
advisors, whatever,
6:07
I convinced them that instead of using an
6:09
existing package for all kinds of simulations, I'll
6:11
just rewrite everything for
6:14
GPUs.
6:15
And then when I came back to my
6:17
advisors and showed them that I can run their simulation
6:21
on the laptop faster than they do it on the cluster,
6:24
they were all kind of shocked. And then I kind of started
6:26
combining all of this with some of my theoretical computer
6:28
science research and decided it's time
6:30
to leave the university and just focus on AI
6:33
for the rest of my life.
6:34
That is such a cool story. So I also
6:37
did do some astrophysical simulations
6:40
in my time. I remember there was
6:42
lots of horrible Fortran code written
6:44
by Soviet professors in
6:46
the 70s. But I never
6:48
had any brilliant ideas like that. But
6:51
yeah, so that's quite fascinating what you're saying
6:53
about the GPU stuff. So you will
6:56
get more into your bio and your work in just a few minutes.
6:58
But we do have a couple of news articles to talk about.
7:01
So feel free to comment on any of these, OK?
7:03
Sure, sure. All right, so the first thing,
7:05
upcoming conferences. So
7:08
you received an email from Inbal Levi,
7:11
who is one of the organizers of Core
7:13
C++. And she
7:15
writes, our conference Core
7:17
C++ is approaching fast. And we're really excited
7:19
about it. I was wondering if you could mention
7:22
it in your podcast. Well, dear Inbal, yes,
7:24
we can. Core C++ is
7:26
taking place in Tel Aviv, Israel. It's
7:28
actually in a new venue, although the old venue
7:30
was also pretty awesome. So I'm curious what the new venue
7:32
is like. From June 6 until 7,
7:35
with workshops on June 5 and 8,
7:38
they have amazing speakers. The keynotes
7:40
will be given by Bianus Drostropp and Daisy
7:42
Holman.
7:44
And tickets are available at CoreCPP.org.
7:47
And I received another email from meeting C++ that was
7:49
not specifically for
7:51
CPP cards. It was kind
7:53
of just a regular mailing
7:55
that they sent out. But it was also very exciting, actually,
7:58
because they have announced the dates for meeting C++. last 2023.
8:01
So that's a big conference in Berlin. It's
8:04
going to take place on the 12th to the 14th
8:06
of November this year in Berlin and
8:09
be aware this is Sunday to Tuesday and
8:11
it's not Thursday to Saturday as it has always
8:13
been in the past. Interesting. Uh, it will be a hybrid
8:15
conference with a three tracks on site, one
8:18
pre-recorded online track,
8:20
and they have announced two keynotes already, one by
8:22
Kevin Henney and another by Lydia Pinscher.
8:24
And the third keynote, the closing keynote will
8:26
be announced later.
8:28
So that's good to hear that that conference
8:31
is also coming back. Or actually both of them are coming
8:33
back this year. Been to both
8:35
of them. They're pretty awesome. Both of them. I
8:37
think Phil, you've also been to, have you been to
8:39
Core C++? I went to the first one. Yes.
8:42
Not been to one since, unfortunately. Yeah,
8:44
I was at the one last year as well. It was also pretty
8:46
awesome.
8:47
Um, so just for completeness sake, there
8:50
are a few more upcoming C++ conferences in the
8:52
next couple of months. I just want to briefly mention them as well
8:54
to kind of remind everyone that this is happening.
8:56
So there's a C++ now in Aspen,
8:59
Colorado, which is just a couple of weeks away at
9:01
this point is from the seventh through the 12th
9:03
of May,
9:04
and it's capped at 140 participants
9:07
and they sent out a mail blast as well, uh,
9:10
this week saying that they still have 20 slots
9:12
left. So you
9:14
want to grab one of the last slots to go
9:16
to C++ now in beautiful Aspen, Colorado,
9:18
buy your ticket now.
9:20
And there's obviously CBP on C, which is your conference,
9:22
Phil. Do you want to talk about that one?
9:25
Of course. Yeah. So the full schedule, but
9:27
by the time this airs, that should be available. Uh,
9:29
we'll be going live shortly after we've recalled this
9:32
and it will also announce our third
9:34
keynote speaker. So, uh, like, uh, again,
9:37
is it meeting C++? We hold one of them
9:39
back into a bit later. So
9:41
a little bit of a news era as well,
9:43
but the rest of the schedule will all be online
9:45
by the time you hear this.
9:47
Right. And then there is the one in Madrid coming
9:49
up actually this week. So by the time
9:52
you hear it, the conference is already going
9:54
to be
9:54
happening. So it's probably too late to
9:56
direct people to that one. But there's also the Italian
9:59
content. conference and
10:02
Rome on the 10th of June. That's
10:05
also not very far away. And
10:07
finally, there's also CPP North and Toronto,
10:10
Canada coming up 17th to 19th of July. So
10:13
that's another one I'm really looking forward to.
10:17
Okay, enough conferences. There's another
10:19
thing. There is
10:22
the 2023 annual C++ developer survey, which
10:24
is now out. It is an annual
10:26
survey by the ISO C++ standards committee
10:28
and the standard C++ Foundation. And
10:31
they would really appreciate your feedback to
10:33
share your experiences as a C++ developer as it
10:36
only takes 10 minutes to complete.
10:38
So please participate. If you have 10 minutes to
10:40
spare, you can do so on surveymonkey.com
10:42
slash r slash ISO CPP minus 2023.
10:46
This link will be in the show notes on cppcast.com.
10:49
And a summary of the survey
10:51
results will be posted publicly on isocpp.org.
10:56
This is one of the three big C++ community surveys that go out
10:58
every year. This one JetBrains
11:00
do their own survey and
11:02
meeting C++ as that ongoing survey. Sometimes
11:07
I have new questions as well. But between
11:09
the three of those, I know from having worked with
11:11
our two tour vendors now that we do watch
11:13
those closely to see what the trends are and
11:15
who's using what. So please
11:18
do fill those out because it helps the
11:20
whole community to do that. All right.
11:22
And we have one news item from the tooling world that you already
11:25
mentioned. So it's a nice segue. Thank you, Phil. Actually
11:28
back to you because this is about the company where you work.
11:31
So sonar has announced that you can now run sonar
11:33
static analysis inside compiler
11:34
Explorer. Yeah.
11:37
I'm really, really excited about this because it
11:39
really makes a
11:41
huge difference being able to just enable sonar analysis
11:43
on some code you already have running
11:45
in compiler Explorer. And you'll get
11:47
a massive amount of data that you can use to get the data to run
11:49
the Explorer and you'll get a much more
11:52
detailed set of
11:53
warnings or rules
11:55
that can really break down what might be wrong with your code.
11:57
Even if it compiles,
11:59
you might still. get some more insight, I help you to
12:01
clean it up.
12:02
And I've actually been working on some,
12:04
some videos to go along with that. So
12:06
if they're ready by the time this is, I'll put some links
12:08
to those in the show next to the, I'll just, just
12:11
to show you some use cases. Right.
12:14
And so finally there were two blog posts
12:17
the last couple of weeks that caught my attention. The first
12:19
one was by Viktor Sverovich,
12:21
the guy who wrote the format and the FMT
12:24
library.
12:25
It's called C++ 20 modules
12:27
in Clang.
12:28
So clan 16 that we already discussed a
12:30
little bit, I think a few episodes back. She
12:32
has pretty good support for C++ 20 modules,
12:35
kind of out of the box.
12:36
So Viktor actually went ahead and compiled his
12:39
FMT library with modules.
12:41
It requires a bit of manual work, but you can make
12:43
it work. So this has been done before by Daniela
12:45
Inge. She has a talk about that called
12:48
short tour of C++ modules that
12:50
she did a while back.
12:51
But yeah, so Viktor now has repeated this
12:54
exercise. I think with clan 16, it's quite
12:56
a lot simpler now to do. But surprisingly,
12:58
Viktor found that there wasn't actually an immeasurable
13:00
speed up in compile times, which is kind of one of
13:02
the things that, you know, modules
13:05
was promising to, to give us. And
13:08
so, so he was digging a bit deeper in that
13:10
blog post and he traced the issue down to the fact that
13:12
Clang is actually ignoring external template.
13:15
So it's kind of recompiling the templates instantiations
13:18
all over again. So this is kind of
13:20
exactly the thing that, you know, modules were supposed
13:22
to get rid of. So
13:25
yeah, it's kind of interesting to see if, if, and when
13:27
Clang is going to fix that or
13:30
improve that on what the underlying issue
13:32
is. I'm not
13:33
an expert there, but
13:34
definitely interesting to see that there
13:36
is still some work there to do, but it's kind of, kind of works,
13:39
but kind of also
13:40
doesn't quite give you all the benefits yet.
13:42
Yeah. That's seemed to be a quality
13:45
of implementation issue,
13:46
hopefully, which means that we
13:49
can move past that and we might still get the benefits
13:51
of it
13:51
down the line. Yeah. I did
13:53
like the fact that
13:56
Victor actually starts off the
13:58
post saying that the free headline.
13:59
save plus plus 20 features,
14:01
modules, coroutines,
14:03
and the third one.
14:05
So it's up to you to fill in the blank there. Well,
14:09
I mean, if he means core language features,
14:11
I guess he means concept. You
14:13
decide. I mean, it could also be
14:15
ranges. I mean,
14:18
I don't know. Interesting.
14:21
Very interesting. Okay.
14:24
And one last interesting blog post that I want to
14:26
mention on the show that caught my attention was
14:28
called Horrible Code Clean Performance by
14:30
Ivica Bogos-Savidevich. I hope
14:33
I pronounced this name not too
14:36
completely wrong. So the
14:38
title of that blog post is actually a homage
14:41
to another blog post and video
14:43
that came out a couple of months ago, Clean
14:45
Code Horrible Performance,
14:48
which is also super interesting and kind of controversial.
14:50
And we haven't really covered it in the show, but like, yeah,
14:53
look up that one as well. That's super interesting. But
14:55
yeah, this one is called Horrible Code Clean Performance.
14:58
And basically what he's doing there is that Ivica
15:01
is implementing a simple substring search
15:03
algorithm. So you kind of have a bigger string, and
15:05
then you have a smaller string, and you search for the first
15:07
occurrence of a smaller and the bigger string.
15:10
And he's kind of implementing it naively first.
15:12
So he has like a kind of a
15:14
char pointer and a size basically for both of
15:16
them. And then he's implementing like the naive loop that you
15:18
would do. And then he
15:21
does like an optimization where he does
15:23
it in a much more ugly way, but then it kind of stores
15:27
the first bit of the string that you're searching for
15:29
in a really clever way. So he ends up with this like
15:31
much more convoluted code, but he
15:33
finds that that actually runs 20% faster.
15:36
And so I thought that was really interesting.
15:38
There was also quite an interesting Reddit discussion about
15:41
that too.
15:43
So I don't actually see an interesting backwards and forwards in
15:45
the comment with someone called
15:47
Timo Dumler, who
15:49
said he wasn't able to reproduce Ivica's
15:51
results on his own machine.
15:54
So looks like that's actually still ongoing.
15:56
So that may have even changed by the time this is but
15:59
uh
15:59
Did you get anywhere with that Tim? Yeah, so
16:02
the first thing I thought,
16:04
when I was reading that blog post, I was like, that can't be.
16:06
Compilers are more clever than this. So I kind of
16:09
did the benchmark myself and it came out as
16:11
there is no benefit. And then I
16:14
told Ivica about this and he was like, oh yeah, but
16:17
you need to make sure that
16:19
the function isn't inline. Then you need to make sure that the
16:21
compiler doesn't actually see the string you're looking
16:24
for because it doesn't see its size. Otherwise
16:26
it can do clever
16:27
compile time optimizations. And I was like, oh yeah, obviously
16:30
this is what's going on
16:32
in my benchmarks. So that's all
16:34
wrong. So I need to redo it. But we
16:36
did agree on the fact that if
16:38
you do the same benchmark just with std string
16:40
find, then it's actually faster than
16:42
either of these versions on either of those machines.
16:45
It's interesting that you're mentioning this. std
16:48
string find is a bit
16:50
weird. So I always feel like
16:53
some of the libraries can have a few
16:55
more specialized versions for different
16:57
substring search. So I was
16:59
doing a few
17:02
measurements and benchmarks in the previous years.
17:04
And substring search is one of my favorite problems,
17:06
like tiny ones to tackle. So you
17:09
can always take a smart algorithm and
17:11
then try to optimize it in
17:13
the CSE way and expect some
17:16
performance improvements. But the best thing that worked
17:18
for me in substring search is very
17:20
trivial heuristics. And especially
17:22
combining them with some
17:24
single instruction multiple data intrinsics,
17:27
you get the biggest benefits.
17:30
So if you take, let's say, substring
17:32
search in the case where you have the needle that
17:35
is at least four characters long, and
17:37
you essentially cast it to 32-bit
17:40
unsigned integer, and then go
17:42
through the haystack comparing at every
17:45
byte step, the following four
17:47
bytes, cast it to Uint32 to your Uint32, you get an
17:49
improvement over std
17:53
string find and over a few other things,
17:56
which is a bit surprising because this is such an obvious
17:58
thing. Of course, it does.
17:59
doesn't work in the rare cases when the four characters
18:03
are matching quite often. But
18:05
like, say, the fifth one is a
18:07
different one. And then the coolest thing
18:09
that worked out was actually taking the AVX
18:12
registers, the AVX2 registers.
18:15
And they can fit 256 bits. So
18:18
that's how many bytes.
18:21
That's 32 bytes. And
18:24
within 32 bytes, you will be
18:26
able to do how many of
18:29
such comparisons? Eight such comparisons.
18:31
So what you can do, you can prefetch
18:34
four AVX2 registers
18:37
at one byte offsets. And
18:40
this way, with a 35-byte step, you
18:43
can actually compare 35 characters at a time, checking
18:48
if any one of those offsets matches to
18:50
your four-byte thing. So it sounds
18:52
a bit convoluted. And this is exactly the point,
18:54
I guess, of the article, like convoluted
18:56
code, good performance, versus
18:58
the opposite. But the difference is staggering.
19:01
So when you take STT shrink and you
19:03
call find, most of the time, you're getting 1.5
19:06
gigabytes per second worth of throughput,
19:08
give or take. Well, I guess me
19:10
and you, Tim, are with astrophysics
19:13
background. Anything within the 10x
19:16
order of magnitude difference is accurate enough.
19:19
It's on the part,
19:21
right? Yeah, exactly. But
19:24
what I was able to show on some of the conferences
19:27
was that with this basic intrinsic
19:30
thing, you can actually get to 12 gigabytes
19:32
per second per core. So
19:35
it's exceptionally efficient.
19:37
It's much faster than the LIP-C implementations
19:40
and the STT string. And you can literally
19:42
fit it in so many places. So I was doing
19:44
benchmarks for both AVX instructions,
19:47
AVX 512 and ARM MION. And
19:49
back then, SVE, Scalable Vector Extensions,
19:52
were not available on ARM. So I couldn't
19:54
do those. But even on ARM, the performance
19:56
benefits were huge and the energy efficiency
19:58
was absurd compared
19:59
to any code that even a super
20:02
smart compiler like Intel's compiler
20:04
can optimize. Yeah,
20:07
that is such a cool trick. That is
20:09
very, very interesting. Do you... Let me
20:11
just... Because this is my kind of topic. I love
20:13
this kind of stuff. So I just need to follow up on this very
20:15
briefly. Do you actually
20:18
have to write the SIMD instructions
20:20
by hand? I use a SIMD library, or do you just
20:22
write the algorithm in a way that lends itself to
20:25
auto vectorization?
20:26
I don't know. So actually, part
20:28
of this talk was specifically about measuring
20:30
auto vectorization capabilities versus naive
20:33
code versus handwritten intrinsics. And
20:35
there were people from Intel's teams validating
20:38
my numbers and checking if their
20:40
compilers can actually reproduce some of
20:42
the vectorization that is quite easy to
20:44
do by hand. And maybe since that
20:46
point, like five-ish years
20:48
ago, six-ish years ago, I almost
20:51
completely abandoned the idea of writing
20:53
a library for SIMD instructions. Whenever
20:55
I need top tier performance, I just
20:58
manually implement a few different
21:00
versions. Generally, I don't go
21:02
down to the level of assembly, but I almost
21:04
always use the intrinsics. And
21:06
I would generally have, let's say,
21:09
a function object
21:10
that is templated that will have,
21:13
let's say, a few different instantiations.
21:16
One of them will be, let's say, with linear code,
21:19
with serial code. Another one will be with
21:21
ARM Neon, another one with ARM SVE,
21:24
another one with AVX, another one with SSE,
21:27
and potentially the last one would be AVX 512
21:30
whenever necessary.
21:32
That's interesting. So actually, in audio, there's a
21:34
very, very similar problem. You often have
21:36
to do convolution.
21:39
And you can do convolution in Fourier space, and this
21:41
is kind of the proper way to do it. But sometimes you want to
21:43
do convolution in time domain. And then it's
21:45
essentially the same thing, except you don't compare,
21:48
but you multiply. Also,
21:50
you have a big array
21:52
and a small
21:54
array, and you multiply it, and then you move it
21:56
by one byte, you multiply it, you move it by one
21:58
frame and multiply it, and so on and so on.
21:59
And so this is why I
22:02
found that this is one of the things that
22:04
like the compiler just can't
22:06
figure out for you if you don't write it
22:08
in a particular way that like
22:10
kind of lends itself to auto-recturization or like
22:12
typically I think I've seen that you kind of have to
22:15
use a SIMD library to get this right. And
22:17
I've never tried it with just raw intrinsics because
22:19
that's kind of
22:21
non-portable and you have to do it multiple times. So
22:23
I clearly haven't
22:25
dug into this as deeply as you have, but
22:27
I should
22:29
look into this again because I think it's kind of a really
22:32
interesting problem
22:32
space. Just for sake of completeness, what
22:35
talk is that? Where can people
22:37
see that? Because we should totally put that in the show
22:39
notes, I think.
22:40
That's actually interesting.
22:43
I'm pretty sure there's a GitHub repository
22:46
that implements this. I think the talk
22:48
specifically that one was in
22:50
Russian for CPP Russia. I think
22:53
the AI tools should advance a little bit and we'll
22:55
have automatic translation. Or maybe
22:57
I can just repeat this talk somewhere, but it also
23:00
touches on a few topics that almost
23:02
no other talk I've ever seen
23:04
kind of covers. So there is like this
23:07
rumor that AVX 512 and
23:09
a few other different instruction
23:12
subsets kind of affect the
23:14
frequency of the CPU so that like once you
23:16
truly load all the course with AVX 512,
23:20
you kind of lose all the CPU
23:22
boost. The frequency really drops
23:24
and people just
23:26
kind of understand it that it's the
23:28
case, but no one quantified it or like no
23:30
one really goes into the documentation to
23:33
describe what CPU licensing
23:35
levels are. So same way
23:37
as you have like CPU cache levels like
23:39
level zero or level one,
23:41
two, three. There's also like CPU
23:43
frequency licenses. And this
23:46
is like a weird, completely separate topic that's
23:48
almost completely undocumented.
23:51
And I was doing a lot of research kind of trying
23:53
to understand like how many lines of
23:55
AVX 512 or like how many intrinsics
23:58
can I actually fire? so
24:00
that the CPU doesn't
24:02
turn down the frequency. Or let's say, how
24:05
many more should I put for all the CPU frequencies
24:07
across all the cores to be downgraded, even
24:10
if the remaining cores are not doing any AVX-512 and they're
24:12
just doing serial code. So
24:15
it's kind of an interesting thing. There's a repository
24:17
that you can run. It's on my GitHub.
24:19
And my handle everywhere is identical. It's
24:22
ASH Vardanyan, Ash Vardanyan.
24:26
So I think the repository
24:28
is called Substring Search Benchmarks or something
24:30
like that. I'll share the link. Yeah,
24:33
we'll put that in the show notes. Thank you.
24:35
And it sounds like we're sort of transitioning into
24:39
the main content of this episode
24:41
already. So a great segue there.
24:43
Thank you. But actually, before we get to the low-level stuff,
24:46
I wanted to start at least
24:47
maybe at the higher level, if
24:49
that's OK,
24:50
because you founded your own company
24:53
and consist of a whole set of projects,
24:55
which on first glance seem to be almost unrelated.
24:58
They're mostly open source. There's
25:00
a GitHub repo that we'll put in the show
25:02
notes as well. But
25:03
the tagline there
25:04
is, Rebuilding Infrastructure for
25:07
the Age of AI.
25:08
So the AI really stands out
25:11
there. So what sort of AI
25:12
is that? And how do all these libraries relate
25:14
to it?
25:15
So you're right in the fact that
25:17
we have too many things that to
25:19
most people would seem unrelated. And
25:22
to be honest, this is just the tip of the iceberg. So
25:25
I started this company seven and a half years ago.
25:27
I've been working on it essentially every single day since,
25:30
day and night. And
25:32
when I left the university and focused exclusively
25:35
on this, this is when the true science began. I
25:37
was reading like 1,000 papers a year, almost
25:40
everything that was published on the computer
25:42
science part of archive or
25:45
the AI and ML parts. I
25:47
at least glanced through it. And a lot of things
25:49
were implemented. So our
25:51
open source libraries include USTORE,
25:54
which is an open source multimodal
25:57
database that obstructs away the layer
25:59
of IQL. store. So kind
26:01
of you can take any key value store and you
26:03
add the database logic on top of it such as
26:05
being able to store different forms of data, query
26:08
them, add wrappers and drivers for different
26:10
programming languages like Python. Then there
26:13
is a library called U-Call which is essentially
26:16
a kernel bypass thing, a tiny
26:19
single or like two-file project
26:21
that took me like a couple of days to write
26:24
and one of my junior developers maintains
26:26
it now adding TLS support which
26:29
seems to be like one of the fastest networking
26:31
libraries built or at least like within
26:33
the C, C++ open source domain. It's
26:36
an RPC JSON RPC library that uses
26:38
the most recent IO URang 519 features, I mean
26:40
kernel 519 and
26:43
higher and it also uses
26:45
SIM JSON and a bunch of other SIM de-accelerated
26:48
libraries for processing the packets.
26:51
There is also a project called U-Form which has
26:53
nothing to do with C or C++. It's
26:56
a pure Python thing but kind of uses
26:58
PyTorch in a nice way to be
27:01
able to run multi-modal AI models
27:03
with MidFusion. It's essentially the kind
27:05
of setup when you have multiple transformers and
27:08
the signal between them is kind of exchanged
27:10
before it reaches the top output
27:13
of a neural network. There's
27:15
a bunch of other libraries as well but this
27:17
is just the open source stuff. Almost every one of those
27:19
has like a proprietary counterpart that
27:22
is far far more advanced. It took years
27:24
to build, has tons of
27:26
assembly in it. A lot of GPU accelerated
27:29
code includes such
27:31
remote things as Regex parsing libraries,
27:34
probably the second fastest in the world after Intel
27:36
HyperScan and the only
27:38
one that also exists 10 gigabytes per second per
27:40
core. Then there is
27:43
a graph library, one of the largest
27:45
graph algorithm collections. There
27:48
is a BLAST library so basically
27:51
an algebra subroutines but unlike
27:54
classical BLAST we don't just target dense-dense
27:56
matrix multiplications. We also do sparse-sparse
27:59
and And we do it both
28:01
on the CPU side and the GPU side, and
28:03
we also do it in invariant
28:06
to the rank manner. So let's say
28:08
there's this notion of algebraic
28:10
graph theory, where you can
28:13
replace a lot of graph processing algorithms with
28:15
matrix multiplications if you know how
28:17
to parameterize them as matrix multiplication
28:19
kernel, replacing the dot product,
28:21
essentially the plus and the multiply operations, with
28:24
something else, a different rank or a different semi-rank.
28:27
So a cool thing that most CS
28:29
people often don't realize, even
28:31
though they are familiar with the subject, is
28:33
that if you look at the Floyd-Werschel algorithm
28:36
on the Wikipedia page,
28:38
it's just three nested for loops over
28:41
i, j, and k. And then within it,
28:43
it's almost exactly the same as matrix multiplication.
28:46
But instead of addition and multiplication
28:48
on scalars, it's doing the minimum and maximum
28:51
operation. So if you take the
28:53
matrix multiplication kernel, design
28:55
it as a template, and then
28:57
pass a different operation into it rather than
28:59
plus and multiply, your matrix multiplication
29:02
kernel immediately becomes a graph processing algorithm.
29:05
So there's a lot of such seemingly
29:07
unrelated things, but my vision from the very
29:09
beginning was that those
29:12
can compose into AI of scale
29:15
that we've never seen before. So all of
29:17
the modern AI is almost exclusively
29:19
built on dense matrix multiplications
29:22
and very simple feed-forward layers
29:24
or very basic attention.
29:26
And then another part is that it
29:28
almost exclusively works on the stuff
29:30
that fits in memory or fits
29:33
within VRAM, so the memory attached
29:35
to your GPU. And those volumes
29:37
are tiny. So in
29:39
our case, I was always curious, how
29:42
can I optimize and vertically
29:44
integrate the whole stack so that even
29:47
external storage, such as the modern
29:49
high bandwidth SSDs, can
29:51
actually become part of your AI pipeline,
29:54
streaming the data, reorganizing the data,
29:56
let's say, stored on SSDs with
29:58
a participation of AI. or let's
30:00
say helping to train AI by having a much faster
30:02
data lake. So the idea
30:05
there is that modern CPUs
30:08
can have what, like one, two terabytes of RAM
30:10
per socket, but they can have
30:12
also like 400 terabytes of NVMe
30:14
storage attached to
30:17
that same socket. So like if you're not
30:20
able to address and properly use external
30:22
memory, you're really limiting yourself
30:24
to like very small part of
30:27
what's accessible.
30:28
And the additional part that kind
30:30
of adds up here is that, yes,
30:33
you can build up a good data lake to help
30:36
with AI and the AI industry, but
30:38
you can also use AI to improve
30:40
the data lake itself. It's like
30:42
very reminiscent of the Silicon Valley
30:45
series, like the guys were building compression
30:47
to kind of build AI
30:50
or and then ended up building AI
30:52
to build compression or vice versa, I think
30:54
kind of had the different order.
30:57
In our case, like if you look at the databases
30:59
like Postgres, MongoDB and many others,
31:02
they focus almost exclusively
31:04
on deterministic indexing, such as
31:06
inverted indexes or something
31:09
like that, where you just explicitly
31:11
search by a specific key or a specific string.
31:14
And you only search for exact matches or
31:17
even at best fuzzy string matches. But
31:19
with AI, we can actually search on structured
31:21
data. So by combining vector
31:23
search, by combining a database and
31:25
a multimodal pre-trained AI, what
31:28
we can do, we can actually embed
31:30
some media documents into a vector space,
31:33
and then just search through those vectors, finding
31:35
all forms of potentially unrelated content,
31:38
or hopefully related content, but across different
31:41
modalities. So being able to search
31:43
videos with a textual query, being
31:45
able to search images with a video query,
31:48
being able to search JSON documents
31:50
with a video query and so on. So I
31:52
guess this kind of gives you a glimpse of how everything
31:54
connects together and hopefully makes the list
31:57
a little bit of sense.
31:59
Yeah, that sounds very good.
31:59
Very, very interesting. Thank
32:02
you. We're gonna dig into a couple of those libraries
32:04
in a minute, I think. But
32:06
just taking a step back, you mentioned Python
32:08
a couple of times there. Seems like you're continuing
32:10
the AI ML tradition of having
32:12
a Python front end with
32:14
C++ doing the heavy lifting. Is that fair
32:16
to say?
32:17
Yeah, I guess everyone just converged to
32:19
this idea that this is the way to go. For
32:21
many years, I'd confess, I'd
32:24
confess I wasn't a super
32:26
big fan of Python. I was
32:28
too obsessed with performance to
32:31
touch a tool that kind of almost entirely
32:34
abandons the concept of performance. But
32:36
then I realized the value that it brings to
32:38
my life and my developer experience.
32:41
So I thought we should bridge
32:43
the two worlds and we are not the only company
32:45
doing this. So famously all AI and ML
32:47
frameworks are written in C++. But
32:50
at the front end, people kind of use Python
32:52
exclusively, almost
32:55
exclusively. And this kind of
32:57
spills outside of AI in all the data
32:59
science and data analytics tooling as
33:01
well. So NVDA is famously
33:03
one of those companies that builds
33:05
a lot of obviously GPUs and they
33:08
have CUDA as a language. They have a compiler
33:10
for CUDA, a lot of low level stuff. But
33:13
I would also say they have by far the best
33:15
tooling in the Python, on the Python level
33:18
to actually leverage those GPUs.
33:21
So you can take libraries like Pandas, NetworkX,
33:23
or NumPy, which are all targeting
33:26
only CPUs and the written purely in Python.
33:28
And you can replace those with libraries like QDF
33:32
as a replacement of Pandas, QGraph
33:34
as a replacement of NetworkX, and
33:37
QPy as a replacement of NumPy. And
33:40
genuinely like this is some of the
33:42
best software I've ever used and kind
33:44
of a really good benchmark for us to
33:46
compete with in a sense that they do it
33:49
for parallel programming and
33:51
we do it for external memory.
33:54
So yeah, so I can see how that
33:56
sort of fits into the infrastructure story that
33:58
enables
33:59
better. AI implementations.
34:01
So we'll
34:02
dig into some of those libraries in just a moment,
34:04
but just going to have a little break because this
34:06
episode is supported by JetBrains.
34:09
And JetBrains has a range of C++ IDEs
34:12
to help you avoid the typical pitfalls and headaches
34:14
that are often associated with coding in C++. And
34:18
exclusively for CPP cast, JetBrains
34:20
is offering a 25% discount for
34:23
purchasing or renewing the yearly individual
34:25
license on the C++ tool of your choice.
34:28
C line, ReSharper and ReSharper
34:31
C++ or Rider. Use the
34:33
coupon code JetBrains for CPP
34:35
cast, for one word,
34:37
during checkout at JetBrains.com.
34:40
So there were a couple of projects on
34:43
the
34:44
Unum repo that jumped out at me
34:47
that
34:47
I just wanted to bring up. And the first one was
34:49
a U-call, which I think you did mention earlier,
34:51
which it claims to be, you mentioned
34:54
yourself, a JSON RPC
34:56
library that is up to,
34:58
and I don't know how much work up to is doing here, but 100% faster
35:00
than Fast API.
35:02
100x, not 100%. So this is important.
35:06
Yeah, no, 100 times.
35:08
Yeah. Now, I know a little bit about Fast API, although
35:11
I haven't actually used it myself, but
35:12
I'm going to have a few web servers that I've
35:15
written and maintain,
35:16
some of them just serving JSON,
35:18
built on Python web frameworks. And I usually
35:21
use Flask.
35:22
And I know that Fast API
35:25
is also a Python web framework.
35:26
It's meant to be significantly faster than Flask.
35:29
And I've not heard people saying that
35:31
Flask is a particularly slow
35:33
framework on its own. So if you're saying you're 100% faster, sorry,
35:36
said it again, 100 times faster
35:40
than Fast API, that sounds
35:42
like quite a big claim. So how do you actually
35:44
achieve that? Sure, I'll be
35:46
happy to explain. So at
35:49
first, people may think that if a project is
35:51
popular, then it kind of optimizes
35:53
something, and it's really good at something. Even
35:56
though Fast API has fast in its name,
35:58
it's not particularly fast.
35:59
honest. So one
36:02
of the things that they do really well is
36:04
like they're very simple to use. They're
36:06
very developer friendly.
36:09
So you just put a Python decorator on top of
36:11
your Python function. And all of a sudden, this
36:13
is a RESTful web server. So
36:16
I guess maybe by fast, they meant that the
36:18
developer experience is fast, but not maybe the
36:20
runtime itself. So the story
36:22
is more or less the following. I was playing
36:24
with our neural networks. And
36:27
they're very lightweight. So we
36:29
looked at neural networks like OpenAI
36:32
clip. And we wanted to replace
36:34
those multimodal encoders with something that would
36:36
work much faster and can be deployed
36:38
on edge, maybe like even IoT
36:41
devices. So we really squeeze those transformers
36:43
made them a lot faster. And if you take
36:45
a server such as like the ggx
36:47
a100 by Nvidia, you
36:50
will end up serving 300,000 or like 200,000 inferences per
36:52
second across the eight
36:57
GPUs of that machine. So this
36:59
is a very high mark for AI
37:02
inference. And the question
37:04
is like, how do you serve it? Because
37:06
the first idea is let's take the most common
37:09
use commonly used Python library for
37:11
web servers, let's connect it
37:13
to PyTorch or something else. And
37:16
let's just serve the embeddings. So
37:18
when I tried to do this, I wasn't actually
37:21
even on the ggx, I just took a MacBook.
37:23
And when I built up a server and just ran it on
37:25
my machine, it was an Intel Core i9. I think my
37:28
response latency was close to six milliseconds.
37:31
So just the client on the server on the same machine,
37:34
and I'm waiting for six milliseconds to get the response,
37:37
I was just shocked
37:39
by the result. So
37:42
obviously, there was a lot to optimize. And
37:44
then I thought like, how far can I go? I haven't
37:47
done much
37:49
networking development
37:51
in the last couple of years. But I've done a
37:53
lot of storage related stuff. And
37:56
I loved IO U-Rank for all of its like
37:58
new advances and the performance. that it brings. Of
38:01
course, sometimes we have to go
38:03
beyond that, so we also work with SPDK and
38:05
dptk as pure
38:07
users-based drivers for kernel bypass, but
38:10
IO-U-RING by itself is also pretty good. So
38:12
if you take a very recent Linux kernel
38:15
like 5.19, it adds up
38:17
a lot of really cool features
38:19
for stateful networking. So
38:22
essentially, the idea is the following. Whenever you have a
38:24
TCP connection on the socket, you listen
38:26
for new requests and queries, and
38:28
whenever they come, you create
38:30
a new connection for every one
38:33
of the incoming clients
38:35
or a new client. One
38:38
of the system calls that you would oftentimes
38:40
do in this case, you get a new file descriptor
38:43
for the communication over a channel to a specific
38:45
client. One of the things
38:47
that IO-U-RING in 5.19 brings is
38:50
a managed pool of those file descriptors
38:53
that can also be taken using the
38:55
IO-U-RING interface without any system
38:57
calls. So with this out of the way,
39:00
almost
39:00
every system call that
39:02
we could have done that would have cost
39:04
interrupt and a contact
39:07
switch on the CPU side is now gone.
39:10
And even with a single server
39:12
thread,
39:13
we managed to get to 230,000 requests per second,
39:17
even on our machine, which is generally
39:19
considered efficient cores
39:22
rather than high-performance cores. While
39:25
FastAPI was only serving 3000 responses
39:28
per second or requests per second. So 3000
39:31
to 230,000 is a
39:34
huge gap. But at this point, we're kind
39:36
of comparing an implementation in Python and
39:38
implementation in C.
39:40
So we wrote a pure CPython
39:42
layer as a wrapper
39:44
on top of our C library. The
39:47
result was that we kind of dropped from 230,000 to 210,000,
39:49
still a major improvement over
39:54
FastAPI. And
39:56
aside from FastAPI, it's also faster,
39:58
seemingly faster than most of the the other networking
40:00
libraries, including GRPC, which
40:03
many people use as the go-to
40:05
high-performance RPC implementation. But
40:08
GRPC doesn't implement such
40:10
level of kernel bypass, let
40:13
alone the fact that parsing product buffers
40:15
is actually oftentimes slower than parsing
40:18
JSON with Steam JSON. We
40:20
win on both fronts, the packet processing
40:22
speed, and also the way we interact
40:25
with the socket. Here we go, 100x faster.
40:28
Everything you said there will sound unreasonable,
40:30
but the numbers still sound too big. I'm
40:32
definitely going to be playing with you call
40:35
on one of my little projects and see whether
40:37
that makes a difference because I'm looking to step
40:39
up the performance. Sure. Let us know. We would really
40:42
love feedback. Yeah. I will report back.
40:45
Thank you for that.
40:46
You mentioned the term Moto-Model
40:49
a few times there.
40:50
What is that exactly? I think that's like a
40:52
term of art in AI, isn't it?
40:55
Yeah. AI people use it a lot,
40:57
especially these days with what they call foundation
40:59
models, or the next
41:01
step of LLMs, large language models.
41:04
Just doing language is not enough these days.
41:06
People want multimodality,
41:08
which means essentially being able to work with multiple
41:10
forms of data at once, like images,
41:13
video content, audio content,
41:15
anything actually. An example
41:19
of multimodal AI would be something
41:21
like a text-to-image generation pipeline.
41:24
Like you put a text in
41:26
and you get an image. Another example
41:28
would be an encoder that understands
41:31
both forms of data and it produces
41:34
embeddings of vectors that can be compared
41:36
with each other. You can say if an image is
41:38
semantically similar to a textual
41:41
description that sits beneath it, for example,
41:43
like on the webpage. In
41:45
the context of, let's say, databases or
41:47
anything else, we also started
41:50
to use this term to make
41:53
the vocabulary a little bit more
41:56
universal across different parts of our repositories.
41:59
So a multimodal AI, database for us would be
42:01
a database that across different
42:03
collections of the same store can
42:05
keep different forms of data without
42:07
sacrificing the remaining properties. And
42:10
the most important property for us in a database
42:12
would be transactions and support for like
42:15
asset guarantees like atomicity, consistency,
42:18
isolation, and durability. So
42:21
if you can do a transaction where within
42:23
one transaction you are updating multiple collections
42:26
and in one of them you are storing a metadata
42:28
of an object and another one you're storing
42:31
maybe like a poster or a photo
42:33
of a specific document or something like that. And
42:35
if you can do it in one transaction with
42:37
all the guarantees included, this is multi-modal
42:40
for us.
42:41
Right, yeah, I think I'll follow
42:43
that.
42:44
And it's interesting you started talking about
42:46
databases there. I think you've done my transition
42:48
for me again because I was going to ask about another
42:50
one of your projects which is Ustore, which
42:53
at the time of writing I think on
42:55
your site is still somewhere called UKV.
42:57
Looks like you're in the middle of naming that.
42:59
So just in case people go looking for it and they
43:01
find UKV it's the same thing I believe.
43:04
Yeah, sure. Well
43:08
the readme describes it as a build your own database
43:10
talkit. But also that
43:13
it's four to five times faster, at least
43:15
in your benchmarks,
43:16
than RocksDB.
43:18
And I hadn't heard of that so I had to go and look it up. But
43:20
it
43:21
sounds like RocksDB is meant to
43:23
be at least 34% faster than MongoDB.
43:26
So I'm sure that's something people have heard of to
43:28
get an idea now.
43:31
So we're talking about
43:32
almost an order of magnitude faster than something
43:35
like MongoDB.
43:36
So that again is very impressive.
43:38
How do you achieve that?
43:40
So there are
43:42
a couple of stories here and I've
43:45
done a really bad job naming some of
43:47
the projects and it really seems like
43:50
a bit convoluted, like too much is happening.
43:52
So let me just give you a bird's
43:55
eye view of how the storage
43:57
is built today. So let's say
43:59
if you use something like a distributed database.
44:01
You have the distributed layer at the very top,
44:04
which is responsible for consensus and the ordering
44:06
of the transactions. Then
44:08
whenever you choose the lead and the master
44:10
node, we can dive
44:12
deep into that specific node. And on that node,
44:15
you have essentially an isolated single instance
44:17
solution. Within the single instance
44:19
solution, what you have is a database
44:21
layer, a key value store layer, and a file
44:23
system layer. And beneath it is
44:26
the operating system and the block storage.
44:29
So we haven't reached the distributed
44:31
layer so far. We almost exclusively
44:34
focused on vertical scaling in most
44:36
of our projects. Even though,
44:38
as we've just mentioned, you call networking is
44:40
also important for us. It's just that
44:43
we take
44:44
certain steps in specific
44:46
order.
44:47
For now, distributed hasn't been part of
44:49
the agenda. It will be this year. So
44:52
what we've done, we've built up something
44:55
that remotely resembles the strategy of
44:57
Redis. So I guess everyone is
44:59
familiar with Redis. It's essentially like a hash
45:01
table on steroids. What
45:04
they've done, they kind of focused on building a key
45:06
value store. And they allow
45:09
a lot of different additional features, essentially
45:12
adding multimodality to
45:14
the underlying binary key value store. So
45:17
now, let's kind of disassemble this into parts. A
45:20
key value store is just an associative
45:22
container, like a hash
45:24
table or a binary tree, B-tree,
45:27
log structured merge tree, anything
45:29
actually. And Redis
45:32
added pieces such as Redis JSON,
45:34
Redis search, and Redis
45:37
graph as essentially forms of converting
45:40
different modalities of data and
45:42
kind of serializing them down into a key value
45:44
store. So every
45:47
modality is just like a feature of
45:49
the underlying storage engine.
45:52
So what has been happening
45:54
on our side,
45:55
we thought, oh, cool.
45:58
Let's take a key value store.
46:00
that we love building
46:02
and in our case it's called U-disc. Let's
46:05
take other key value stores and let's
46:07
create a shared abstraction. So that's why it was
46:09
briefly mentioned as build
46:11
your database toolkit. So we thought
46:14
if Redis knows how to abstract away
46:16
the key value store and
46:17
add a
46:19
lot of features on top of it, we can actually
46:22
do something similar and just give it out to
46:24
the world for everyone to use it. So
46:26
essentially you can take any key value store that you like,
46:29
and if by any chance you love designing
46:32
associative containers and you
46:34
code in C++, it's very easy for you to
46:36
actually build up your own hash table. Take
46:39
this project which is now called U-store and
46:42
use it as an intermediate representation
46:44
layer essentially or just like a C interface.
46:47
If you add the C interface on top of
46:49
your hash table or associative
46:51
container that would be ordered,
46:53
you're getting a lot of support for different
46:55
forms of data on top of it.
46:57
You also get bindings and SDKs
47:00
for languages like C,
47:01
C++, Python,
47:04
as well as Golang and Java that have partial
47:07
support. We also had some contributions
47:09
from the community, people trying to implement Rust
47:11
bindings around it. So you've
47:14
briefly mentioned some of the benchmarks and the performance
47:16
numbers,
47:17
and I can elaborate on them as well. So
47:20
in our case,
47:22
one thing that really
47:25
surprised me a few years ago was
47:27
that I was so focused on AI and
47:30
compute and high-performance computing, I
47:32
didn't really think much about storage. When
47:35
I just tried to bring up the systems
47:37
together and compose them into one
47:39
product or one solution, I still
47:41
needed some storage. So I took
47:43
RocksDB, which is an open source
47:45
key value store by Facebook, which
47:48
seemingly is the most commonly used
47:50
database engine today. And whenever
47:52
there's a new database company, there is a very
47:55
non-zero chance that
47:58
they're using RocksDB as their underlying. underlying engine.
48:01
So essentially what's happening is the database
48:03
is adding its own logic for specific workloads
48:05
such as processing graphs in the case
48:07
of Neo4j. And beneath
48:10
it what's happening
48:11
is that
48:13
this is all converted into binary
48:15
data and is stored at RocksDB or some
48:17
other key value store. So
48:19
in reality, at least from my perspective,
48:22
the absolute majority of work that has to
48:24
be done is in the key value store layer. This
48:28
specific example in Neo4j is like
48:31
one of my biggest pains, because
48:33
I've been always fond of graphs and
48:35
graph theory. And Neo4j is
48:37
kind of synonymous with graph and graph
48:39
databases by like today. This company
48:42
has raised $755 million. So
48:45
their product must
48:48
be as polished as possible. And they've
48:50
been around for over a decade.
48:52
But every single time that I try to run
48:54
this database, it crashes with classical
48:57
Java errors. And as a C++
48:59
community, it's almost our obligation to
49:01
kind of make jokes about the Java
49:03
runtime and all
49:06
the garbage collection issues that people face
49:09
in that land. So I was facing them
49:11
all the time. There wasn't a case where I
49:13
would try to put a graph even remotely
49:15
interesting to me in terms of size,
49:17
and Neo4j wouldn't crash.
49:20
Either I, after 20 years of programming,
49:23
am so bad that I cannot even like start up a
49:25
database, or something is really
49:27
wrong on their infrastructure level.
49:29
And something was actually wrong. So
49:32
until a couple of years ago, they were not using
49:34
RocksDB, they had like internal key value store.
49:37
And in 2019, they decided
49:39
that they're kind of switching to RocksDB as a
49:41
new faster engine.
49:43
But even before they switched to RocksDB,
49:45
similar to companies like CockroachDB, Yugabyte,
49:48
and countless other companies,
49:50
half of which have this
49:52
premise of let's take Postgres and
49:54
put Postgres' engine, like
49:57
for query execution on top of RocksDB.
49:59
Even before they started doing this, we realized
50:02
RocksDB is way too slow for us. So
50:04
our ambitions were much higher than
50:07
even the best expectations that other
50:09
databases had for their future
50:11
a few years down the road. So we
50:13
kind of went into the lab,
50:15
I moved to Armenia with
50:17
all places. We ordered a bunch
50:20
of super, super high-end equipment. So
50:22
we run on the fastest SSDs on earth, 64 core
50:26
liquid-cooled CPUs, Ampere GPUs
50:29
for the last couple of years, we run on 200 gigabit
50:31
InfiniBand networking.
50:33
And we used all the state-of-the-art hardware
50:35
to actually push the limits of what software can
50:37
do.
50:38
Because when your hardware is so freaking
50:40
fast,
50:41
every single bottleneck that remains is on
50:43
your side, the software developer
50:46
side.
50:47
And what we've done, we've
50:48
created the key value
50:50
store
50:51
that is faster than RocksDB today
50:54
in almost every single workload. So
50:56
today, the only workload in which we're just
50:59
a little bit slower is the range scans,
51:02
but this is relatively easy to fix in the
51:04
upcoming versions. But in some crucial
51:06
forms of workloads, such as batch
51:08
insertions and batch read operations,
51:11
when you random gather, a random scatter,
51:14
a tons of information onto
51:16
persistent memory or from persistent
51:18
memory, we are five to seven times faster
51:20
than RocksDB, which is a number so
51:23
absurd. Most companies,
51:25
especially like smaller startups, didn't believe this
51:28
is possible. The only companies that
51:30
kind of realized and were
51:32
familiar with my prior work in the previous
51:34
years, are generally like super large,
51:36
trillion dollar plus American tech companies.
51:39
And they kind of knew some of my proprietary
51:42
work before that. And when they started testing
51:44
it last year, they were just shocked
51:46
that this is even possible.
51:48
So our database engine can be faster
51:50
than the file system.
51:52
And the only company that has ever
51:54
shown that this is like, these numbers are
51:56
possible, was Intel a couple
51:58
of years ago, on their obtain as a
51:59
They
52:01
did it using SPDK, which is a
52:03
user space driver that they design and maintain.
52:06
And they reached 10
52:09
million operations per second, most likely
52:11
with 24 SSDs. But this is purely
52:13
synthetic workload. We've managed to reach 9.5
52:16
million operations per second by
52:18
treats on our lab
52:21
setup with Intel people present
52:23
and validating those numbers on a
52:25
setup with three times less SSDs
52:28
and not synthetic read and write operations,
52:30
but actual database operations.
52:32
So this was like an incredible milestone last year,
52:35
a culmination of seven years
52:37
of my work investments and
52:40
teaching experience, I guess.
52:42
Very impressive numbers, of course, and
52:44
I'm definitely going to be trying them out myself
52:47
to see how they stack up.
52:49
But I'm sure a lot of people listening will
52:51
be fascinated by this, just
52:53
as we've been discussing it. But
52:54
since we are a C++ podcast, I know you've
52:57
mentioned
52:57
the use of C++ in many cases, but is
53:00
there anything you can say about how you've used
53:02
C++ to achieve these results?
53:04
Yeah, sure. So C++ is essentially the only
53:06
language where I can do this. There
53:09
is no way around it. I tried other
53:11
languages. C++ wasn't the first
53:13
language I used, wasn't the last one
53:16
that I adopted or tried.
53:19
So almost every one of those projects
53:21
is implemented in C++. Other
53:24
stories implemented in C++, every single one
53:26
of our internal libraries is implemented in C++. But
53:30
as a person who's been doing C++ for well
53:32
over 10 years now, I think it's
53:34
not a single language. It's
53:37
just like a pile of languages mixed together. And
53:39
every more or less senior person kind of
53:42
picks his own subset of what he kind
53:44
of allows within the code base. And
53:47
I guess most of the people who kind of stay in
53:49
the profession for this long, they develop
53:52
a taste and a lot of strong opinions about
53:54
stuff they like and dislike. So
53:57
I am this kind of... Code
54:00
Nazi within my team,
54:02
who is super aggressive in terms of not
54:05
allowing some of the features of the language to
54:07
be used while pushing
54:09
everyone to adopt other features that they
54:11
may have not been familiar with from school. So
54:14
in our case, things that I don't
54:16
like and don't use
54:19
oftentimes would be related to dynamic polymorphism,
54:23
exceptions, related
54:25
stuff. I guess you can understand,
54:27
especially in a low latency environment.
54:29
We really hate memory allocations.
54:32
We don't use new or delete.
54:35
It's very important for us to have
54:37
full control of the memory system. We
54:40
use NumAware allocators. We design some
54:42
of them.
54:43
But then on the other side, there are features
54:45
that we can't leave without. So
54:48
essentially, being able to compose very
54:50
low level abstractions with super high level
54:52
abstractions is the
54:55
kind of special thing
54:57
about C++. So as I've
55:00
mentioned, we oftentimes build
55:02
function objects that would be
55:04
essentially a templated structure with
55:07
overloaded call operators. So open
55:10
brackets, close brackets operator.
55:13
What we then do, we essentially
55:15
instantiate this template in a few different forms
55:18
and specialize it for all kinds
55:20
of different assembly targets.
55:24
We would have an implementation for x86
55:26
and for ARM. And within
55:28
x86 and ARM, we'll also target a few different generations
55:31
of CPUs. So I guess
55:33
this is one of the things that we really love.
55:35
Another thing is that we always stick to
55:37
the newest compiler.
55:40
In our case, this is mostly GCC and
55:43
LLVM. We also use NVC++
55:45
and NVCC. We
55:49
also occasionally use Intels1
55:51
API performance toolkit compilers.
55:54
I guess they are also not very good at naming
55:56
as bad as we are, renaming them almost
55:58
every year. I don't know which
56:01
name ICC or ICX they go now
56:03
for.
56:05
So those things are
56:07
crucial. Using the recent
56:10
C++ standard is also important
56:13
because when you do a lot of templates
56:15
and metaprogramming in
56:18
C++11, it's constant std enable
56:20
if. Once
56:23
I start remembering all those horrors
56:25
of 2011 and 2012, I
56:28
kind of almost lose consciousness. And
56:31
then when if constexpr appeared with
56:33
C++17, if I'm not mistaken,
56:35
or C++14, whatever,
56:38
we kind of immediately adopted this. We
56:40
now use C++20 where we can.
56:42
Some features, unfortunately,
56:45
don't work for us
56:46
for now. So coroutines still allocate.
56:49
And when you reach 10
56:52
million IOPS on eight SSDs
56:55
or aim to get to, let's say, 20
56:57
million IOPS, heap allocations
57:00
are not good. So we cannot use coroutines
57:02
there. We have to rely on PRC
57:04
interfaces. But overall,
57:07
I would claim one more time that
57:09
C++ is essentially the only language where we can achieve
57:11
this.
57:13
Yeah. So a lot of that, what you said, sounds quite familiar
57:15
for me coming from an audio processing background,
57:18
not wanting to do allocations,
57:20
not avoiding branches and runtime polymorphism
57:23
and
57:24
writing your own allocators and all of that stuff, all
57:26
the store latency stuff sounds familiar. I guess the big difference
57:28
is that in music
57:30
software, typically you can't use the latest
57:33
and greatest compilers because A, you have
57:35
to typically ship on macOS and Apple Clang
57:37
is a bit behind.
57:38
And B, you're also typically, if
57:40
you're shipping an audio plug on you, to support
57:43
older versions of macOS. So you're also
57:45
constrained on what standard library
57:47
versions you can use
57:49
because the
57:50
stuff might just not be there on an
57:53
older macOS version.
57:55
I guess you don't have any of those problems. So you can
57:57
really take
57:58
full potential of the latest and greatest.
59:59
this and ask it a completely different
1:00:02
question if you don't mind. So we
1:00:04
talked a little bit about AI and obviously you're
1:00:06
kind of working on the
1:00:08
kind of plumbing that makes all
1:00:10
of this work. But
1:00:12
if you zoom out really far, like something
1:00:14
that I found really striking over the last
1:00:16
year or so is how AI
1:00:18
systems like chat GPT or a DALI
1:00:21
or mid-journey had kind of transformed
1:00:23
how people do things. And
1:00:25
I wonder what your thoughts are
1:00:27
on this latest generations of AI. Are
1:00:30
they going to wipe us all out
1:00:33
as a humanity and within the next few years?
1:00:35
And like, AI is going to rule the planet
1:00:37
or do
1:00:38
you have any thoughts on that? Well,
1:00:41
I wouldn't be that pessimistic.
1:00:44
So of course, I just had to say that because I'm a massive
1:00:47
science fiction nerd. And it's kind of a thing that
1:00:49
people have been worrying about for quite a long
1:00:52
time.
1:00:53
So one part we have to take seriously,
1:00:55
the fact that the work is changing
1:00:57
and like jobs will be replaced, obviously.
1:01:01
Some people are very frightened with this and
1:01:03
it's understandable. Change is always frightening.
1:01:06
But on the different side, we can
1:01:09
now get much more efficient with AI and
1:01:12
people can unlock a lot more of their credit creativity.
1:01:14
So there is a lot of the opportunity
1:01:17
for people who may kind of get
1:01:19
replaced now to actually adopt a
1:01:21
new skill, which will not be easy, obviously.
1:01:24
And we have to be compassionate with them and
1:01:26
help them kind of adopt AI to
1:01:28
kind of get into a new labor
1:01:31
market.
1:01:31
But in general, I was
1:01:34
always optimistic about AI. So
1:01:37
I
1:01:38
think people are perfectly fine
1:01:41
finding ways to kill each other, even
1:01:43
without AI. So like
1:01:46
if there's something coming to kill us, I think
1:01:48
it's more likely ourselves rather than an artificial
1:01:51
form of intelligence. So I would
1:01:53
definitely bet on humans any
1:01:55
time of the day, if this question is asked.
1:01:58
But on the opposite side.
1:01:59
And just looking back
1:02:02
on the last couple of months on the chat GPT
1:02:04
release,
1:02:06
I would say for people who are inside
1:02:08
the industry and have been pre-training actively,
1:02:11
and there are not too many teams like that. So there's
1:02:14
one major team, one major cluster
1:02:16
that is the US and maybe UK with
1:02:19
DeepMind, another cluster is maybe
1:02:21
like Russia and now South Caucasus
1:02:23
where a lot of this talent has moved, ourselves
1:02:26
included. And another major cluster would
1:02:28
be China where people actually have the
1:02:30
resources to pre-train those models because this is not
1:02:32
cheap. Like you need thousands of GPUs,
1:02:35
this puts your budget,
1:02:36
starting budget at $100 million and
1:02:38
above. So like small
1:02:40
labs cannot really compete within this
1:02:44
modern heavyweight category.
1:02:47
So there are not many teams, but the
1:02:49
people who are inside of those teams have
1:02:51
been familiar with the incremental steps
1:02:54
and the incremental progress that was happening. So
1:02:56
I don't think for many of them the chat GPT release
1:02:58
was a shocker. Many of them have been
1:03:00
working on similar technology and have seen
1:03:02
every preceding paper that
1:03:05
came before that. Still it's
1:03:07
lovely to see the attention
1:03:09
to the industry. And I've seen a lot of hype
1:03:11
cycles in the last couple of decades. Like
1:03:15
crypto was the most recent one and I think
1:03:17
people are still confused about any
1:03:19
application or any effect that crypto
1:03:22
can have on our everyday lives. But
1:03:24
with AI, we have such insane
1:03:27
level of adoption already. So I think
1:03:29
there's been like a billion people who have interacted
1:03:31
with AI over the course of the last few months who
1:03:33
have never touched AI or AI related
1:03:36
tools before that. So I'm very passionate.
1:03:39
Right.
1:03:39
Yeah. I mean, what you say about the job
1:03:41
market, I've actually been thinking about that too, because
1:03:44
I'm a developer advocate. So things that I do
1:03:47
include things like writing
1:03:48
blog posts or recording videos about
1:03:50
how to do something.
1:03:52
And basically now I can
1:03:55
let touch to write the script for
1:03:57
the video and then I can train in the eye to read
1:03:59
it. out in my voice, so basically I don't have to do
1:04:02
anything anymore. So
1:04:04
that's kind of an interesting thought.
1:04:07
Okay. Shifting gears now again, completely
1:04:09
because you mentioned now, kind of
1:04:11
South Caucasus and how there's like a lot
1:04:13
of talent there. So you,
1:04:16
your bio says that you actually
1:04:18
founded a C++ meetup in Armenia.
1:04:20
And it's obviously something I'm very excited about.
1:04:23
So can you tell us just a little bit about the
1:04:25
C++ scene in Armenia and that meetup that
1:04:27
you've started and how that's going? Yeah,
1:04:30
sure. It's actually absolutely wonderful.
1:04:32
So when I moved to Armenia a couple of years ago,
1:04:34
as a person who wasn't born or raised there, I
1:04:37
just had some ancestry. Most people
1:04:39
thought I'm kind of forbidden saying to do this
1:04:41
because I had a few other opportunities
1:04:44
of other countries where I could go, but
1:04:46
now they tend to realize how much of
1:04:49
undervalued gem Armenia
1:04:51
is within like the local town, like a
1:04:54
local region. So essentially
1:04:56
in Armenia, we have a ton of hardware and
1:04:58
like chip design companies. So
1:05:00
Armenia has one of the largest offices
1:05:02
of synopsis, which is like a chip design,
1:05:04
EDA company based in the
1:05:07
United States. Their second largest office,
1:05:09
as far as I know, is in Armenia. And the third largest
1:05:12
is in India and India has a 750 times
1:05:14
larger population than Armenia.
1:05:16
And now Nvidia
1:05:18
and Mellanox also have presence in Armenia.
1:05:21
Their office is only like five minutes walking
1:05:23
distance away from mine. There's
1:05:25
AMD and Xilinx. There's Siemens EDA.
1:05:28
And whenever you hear like chip design and hardware,
1:05:31
you kind of immediately get that these are
1:05:33
low level people. It's likely that they
1:05:35
use low level languages. So obviously C++
1:05:38
is a major part of their life
1:05:39
professional and sometimes now
1:05:42
like part of their after
1:05:44
work interactions. So the
1:05:46
problem I saw when I arrived was
1:05:48
that there were a lot of professionals who use
1:05:51
C++ daily, but they are not always familiar
1:05:53
with the newest standards. They don't
1:05:56
meet together too often to discuss how
1:05:58
people tackle different problems. And
1:06:01
the overall exchange of ideas
1:06:03
between junior developers and senior developers
1:06:05
is not as rapid as maybe let's say within the United
1:06:07
States or within Russia, the places
1:06:10
where like the developer ecosystem is much more
1:06:12
developed. So I thought it makes
1:06:14
sense to help a little bit ignite this
1:06:16
activity in the last couple of months.
1:06:19
Sorry, a couple of years. We had maybe like six live
1:06:21
meetings. We grew from
1:06:23
just 10 attendees to maybe 750
1:06:26
members who kind of went through
1:06:28
those meetups and our groups
1:06:29
and kind of chat and discuss within
1:06:33
our vicinity. But there are definitely
1:06:35
a lot more developers who do C++
1:06:38
but haven't been part of a community so far.
1:06:40
So I guess there are a few thousand more. And
1:06:43
overall, we now have a Deca corn in
1:06:45
Armenia, a unicorn, five
1:06:47
to seven other companies that are about to become unicorns.
1:06:50
So this density
1:06:52
in a city with less than a million population
1:06:55
kind of dwarfs or like at least
1:06:57
competes with maybe half
1:06:59
of Scandinavia, if not all of Scandinavia
1:07:02
combined. So I'm very
1:07:04
excited. I really want to
1:07:06
invite everyone to come visit us in our country
1:07:08
in the South Caucasus. And I'll
1:07:10
try to do my part in this and hopefully next
1:07:13
year, just like Phyllis organizing
1:07:15
Cpp on C. We were not that
1:07:17
lucky to get access to C,
1:07:20
but we have beautiful mountains. Maybe we
1:07:22
should call our conference next
1:07:24
year Cpp in mountains.
1:07:26
The capital is already elevated by
1:07:29
a kilometer over the sea level,
1:07:31
but we can go even higher than that somewhere in a
1:07:33
beautiful place with a beautiful view
1:07:36
and just chat about C++ with all the brightest
1:07:38
from all over the world. So come visit us.
1:07:40
So Ash, that sounds absolutely amazing.
1:07:43
I've never actually been to Armenia, but I always wanted
1:07:45
to go. So if
1:07:48
something like that would happen there, I would totally
1:07:50
show up there. I think that that sounds really exciting.
1:07:53
And we'll be very excited to have you.
1:07:56
And if we have listeners in Armenia
1:07:58
who don't know about your meetup.
1:07:59
I'm sure there's a link that you can give us that we'll put
1:08:02
in the show notes and
1:08:03
they can tie up with you and obviously spread the word in
1:08:05
our media because there's obviously a lot of
1:08:08
technical people there that enjoy
1:08:10
the show.
1:08:12
Thank you very much for sharing all this
1:08:14
with us. We've run way over time
1:08:16
once again, but is there anything else you want to tell
1:08:19
us before we let you go?
1:08:21
Just that you, both of you, are
1:08:23
amazing hosts. I'm
1:08:25
happy to be here and would be happy
1:08:27
to chat about any one of our projects, both
1:08:31
on any podcast or within
1:08:33
our Discord groups. A lot
1:08:35
of the projects are open source. Come
1:08:37
try it. Share your experience. Don't
1:08:41
hesitate to ping us whenever you see bugs
1:08:43
because there are a lot of them, I believe.
1:08:45
Especially with this build and compilation
1:08:47
time, sometimes the packages are a bit outdated,
1:08:50
but we're doing all the best we can to
1:08:52
actually keep the best software, the fastest
1:08:54
software, always available to our users.
1:08:57
Where can people reach you, Ash?
1:08:59
I have accounts
1:09:02
that I often check
1:09:04
and use on places like LinkedIn,
1:09:08
GitHub, Facebook, and Twitter.
1:09:10
I had essentially read-only accounts on
1:09:13
Twitter for a few years, but I guess,
1:09:15
assuming how many tech people are
1:09:17
on Twitter, I have to change my policy
1:09:19
in that regard and start becoming more active. Again,
1:09:22
my name is the same everywhere, Ash for Danyan,
1:09:25
but aside from this, there's also a Discord channel
1:09:27
that you can find opening any
1:09:29
one of our open source
1:09:29
repositories. There's a few glyphs on
1:09:32
top, and one of those is a Discord
1:09:34
link on which you
1:09:36
can press and connect not
1:09:38
just with me, but with every one
1:09:41
of the engineers on my teams in Armenia
1:09:43
and abroad. Thanks, Ash. We'll put
1:09:45
some of those links into the show notes as
1:09:48
well. Lovely, guys. Pleasure
1:09:51
talking to you. Thank you so much, Ash,
1:09:53
for being a guest today.
1:09:56
Thanks so much for listening in as we chat about T++.
1:09:58
We'd love to hear what you think
1:10:00
of the podcast. Please let us know if
1:10:02
we're discussing the stuff you're interested in, or
1:10:05
if you have a suggestion for a guest or topic we'd
1:10:07
love to hear about that too. You can email
1:10:09
all your thoughts through feedback at cppcast.com.
1:10:12
We'd also appreciate it if you can follow Cppcast
1:10:15
on Twitter or Mastodon. You can
1:10:17
also follow me and Phil individually on
1:10:19
Twitter or Mastodon. All those links, as
1:10:22
well as the show notes, can be found on the podcast
1:10:24
website at cppcast.com.
1:10:28
The theme music for this episode was provided by
1:10:30
podcastthemes.com.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More