Episode Transcript
Transcripts are displayed as originally observed. Some content, including advertisements may have changed.
Use Ctrl + F to search
2:00
primarily focus on the
2:02
stability, deployment, and cost reduction
2:05
of the big data analytics clusters
2:09
on AWS such as
2:12
Druid, Treano, Spark,
2:16
etc. And they also am
2:19
involved with Kafka
2:21
production issues and cost
2:24
reduction issues, but mainly in
2:26
production and stability issues on
2:29
Kafka. Prior to
2:31
that, I was in SRE for
2:34
a company called Cognite, where
2:38
I was in charge of the
2:41
stability of big
2:43
data clusters on prem,
2:46
for Linux on prem, mainly
2:48
for Spark streaming, Spark batch,
2:51
and Presto on HDFS
2:54
at various customer sites.
2:57
Prior to that, I was a Java
3:00
backend engineer for about
3:02
10 years. And
3:05
I recently published a book
3:07
called Kafka Troubleshooting in Production,
3:11
which will be the parts
3:13
of it will be the main
3:15
topics of what we will discuss today. And
3:18
it talks about how to handle production
3:21
issues in Kafka clusters, both
3:24
on prem and on the
3:26
cloud on AWS. And
3:28
do you remember how you first got started working
3:30
in data and what it is about that space
3:32
that keeps you interested? So
3:34
I started working
3:37
on big data about
3:39
seven years ago, when I developed
3:43
my first Spark streaming
3:45
application, consuming from Kafka
3:48
and persisting to HDFS on
3:50
Linux on prem clusters. Before
3:52
that, I just wrote backend
3:55
applications that were reading
3:57
from some database, writing from some
4:00
into some databases, but it wasn't formally
4:02
big data. And
4:07
five years ago, I understood that on
4:09
big data clusters, the
4:13
production issues that I didn't see before,
4:17
and instead of, so
4:19
I reverted from being a developer,
4:22
reverted to being an SRE
4:24
and solving stability
4:26
issues in these clusters, which
4:28
meant knowing better not my
4:30
code, but the
4:34
infrastructure, the cluster that my code,
4:37
or some code, some big
4:39
data code runs on, whether
4:41
these are Presto clusters, Spark
4:45
clusters, Spark batches, Spark
4:47
streaming, Kafka, HDFS,
4:52
and it was a whole new
4:55
world for me. And
4:58
that's what brought me specifically
5:03
to handle Kafka issues,
5:05
which I do for the last,
5:09
like in production, I do it since 2018, but
5:13
I started doing it a year before. So
5:15
maybe the question is
5:18
more what brought me to be, not
5:20
to be a developer anymore, to
5:23
become an SREs, let's say data
5:26
of big data clusters, and
5:28
specifically focusing on the cost reduction,
5:30
whether on-prem or on the cloud,
5:38
and focusing on
5:41
Kafka. So
5:43
I will elaborate on the Kafka
5:45
issues. So I understood that Kafka
5:47
is like stuck in the
5:49
middle of everything, everything. If there is
5:51
a problem with Kafka, your
5:54
whole data pipeline is just stuck.
5:57
Producers can't write, consumers can't
5:59
write. And that's what,
6:01
it was a combination
6:04
of two things. One,
6:06
the understanding that Kafka, the
6:09
best of life for my time will be to focus
6:11
on Kafka. And the
6:13
second thing was that I
6:16
stumbled into a book called the
6:18
System Performance by Brendan Gregg, who
6:22
was the lead performance engineer at
6:24
Netflix. And now he
6:26
works under the Intel
6:28
CTO on
6:31
distributed clusters. And
6:34
this really opened me
6:36
a world of, that
6:38
I wasn't aware of before, of
6:41
monitoring and detecting bottlenecks in
6:44
Linux clusters. And
6:47
not only understanding what's
6:50
your bottleneck, what's the cluster bottleneck,
6:52
but also it opened the way
6:54
to reduce costs. And
6:56
what I saw when I stumbled
6:59
into production issues in Kafka,
7:02
Spark, Presto, Druid,
7:06
for this matter, any service, any
7:08
applicable service on running on Linux
7:11
cluster, is that
7:13
understanding how to diagnose issues
7:15
in a Linux cluster can
7:18
really help you detect what's
7:21
the problem in many cases. And
7:23
also allows you to
7:25
reduce costs on the cloud. In
7:29
terms of the Kafka focus, you
7:31
mentioned that you were
7:33
working as a backend developer, you
7:35
were interested in the use cases
7:37
for Kafka, the production requirements around
7:39
it, how to make sure that
7:41
it was stable and performant. And
7:44
I'm wondering if you can talk to, at
7:46
least at a high level, some of the
7:48
different environments in which you've had experience working
7:50
with Kafka and some of the categories
7:52
of operational challenge that you've had to
7:54
deal with in that process. Sure,
7:57
so I started with the on-prem cloud.
10:01
of signals that can say
10:03
that something is going to get
10:05
wrong, something is wrong in the
10:07
cluster. So
10:09
as an SRE of on-prem, you
10:11
reach the problems once the cluster
10:13
holds sometime. But
10:16
there are also, on the
10:18
cloud, the cloud provider will
10:20
tell you, okay, you don't need to handle
10:22
these failures. However,
10:29
even on the cloud, a broker can hold
10:32
because of disk deterioration,
10:34
and you wouldn't know of it
10:36
until you get into legs. But
10:39
you don't need to replace the disk,
10:41
you just can replace the broker, which
10:45
is much easier in
10:47
the cloud versus on-prem, where you need
10:49
to get
10:52
into the drawer of disks and
10:54
replace the disk. Another
10:56
big difference is the scaling. If
10:59
you have more traffic, you can
11:01
scale out. On the cloud, you can just scale out
11:03
to scale up and by
11:06
spinning or spin a new cluster
11:09
pretty easily, while on-prem,
11:13
it's very, very tough. Because
11:15
think of it like if you want to scale up,
11:17
you need to have disks on
11:19
site. And sometimes you don't have these disks
11:21
on site. What happens if you shift
11:24
the cluster to another, and your customer
11:26
is in another country? Another
11:29
issue is, like,
11:32
what happens when you need more RAM? On
11:35
the cloud, you just spin
11:38
a cluster with instances with more RAM.
11:42
But on-prem, you need to check
11:44
that you have enough slots in
11:46
order to insert DIMM
11:48
sticks, memory sticks, into
11:52
the cluster, into each machine. And
11:55
you need to do it manually. scaling
12:01
the two main differences of the scaling
12:05
option which is much easier on the
12:07
cloud and
12:09
handling failures, hardware
12:12
failures which is also when
12:16
you are the when you are
12:18
necessary of on-prem it's on
12:20
you or on the first years
12:23
that manage the issue once but
12:25
while in the cloud you just
12:28
detect the issue you need to detect
12:30
the issue and then replace the machine.
12:33
The main, so these were the two like
12:35
I came from on-prem not
12:38
only from Kafka like only
12:40
for spark application and it's it's
12:43
pretty much the same issue between on-prem
12:45
and on the cloud is these two
12:47
issues however the the
12:49
benefit of deploying on-prem is
12:53
the fact that you don't pay every
12:56
for every hour just
12:59
pay once and
13:01
there is I think today there is a
13:03
growing discussion whether
13:06
a cloud-based company
13:10
maybe they need to go on-prem for some
13:12
of their clusters so
13:16
it's the math the matter
13:18
behind this calculation sometimes
13:22
I think favors on-prem
13:24
even though you need
13:26
to to handle this
13:28
the scaling is very tough
13:30
and it's on
13:33
you to detect the hardware
13:35
issues but maybe the cost
13:37
sometimes justifies this. Yeah
13:40
the on-premise versus in cloud
13:42
debate is definitely ongoing and
13:44
always very nuanced and to
13:46
your point yeah where on-prem
13:50
the cost over time is much
13:52
lower because you own the hardware so you
13:54
don't have to pay a continuous upkeep for
13:56
it but there's the opportunity cost
13:58
of having to move slowly. and
14:00
more deliberately and perhaps
14:03
reducing the number of chances or experiments that
14:05
you take because of the fact that it
14:07
is so much lead time to bring in
14:09
that hardware and scale up the cluster. And
14:12
also on the point of Kafka
14:14
that's an interesting aspect as well
14:16
where my understanding
14:18
of the way that Kafka itself is
14:20
designed and some of the aspects of
14:24
having to define upfront the number of
14:26
topics in order to accommodate a certain
14:28
number of clients seems as though it
14:30
lends itself more readily to that fixed
14:32
installment on-prem environment versus the
14:34
cloud environment where you are
14:37
incentivized to elastically scale up
14:39
and down and I'm wondering
14:41
what you see as some of the challenges
14:44
of bringing Kafka into the cloud because
14:46
of that potential for elasticity in the
14:48
clusters. I
14:52
see that if
14:55
you don't care about the money like
14:57
use the cloud okay the
14:59
money is that the money is
15:01
the only reason why going on-prem because
15:03
it's indeed it's it's tough
15:05
but it's interesting that you mentioned that
15:09
Kafka might sound reasonable
15:12
a good candidate to
15:14
be deployed on-prem and
15:17
by the way there are companies
15:19
that most of their clusters
15:23
are hosting
15:25
open source analytics tool or
15:28
messaging tools are deployed on the cloud but
15:31
Kafka is deployed on-prem.
15:33
I know
15:35
of two not small Israeli
15:38
companies that have their Kafka
15:40
deployed on-prem and
15:42
what like from
15:44
from my experience because I was
15:47
deployed I was SRE for various
15:50
customers when I was working for
15:52
on-prem so I had I saw
15:56
several examples and
15:59
the main And
18:00
the CPU utilization is like
18:02
ridiculous. It's 10% user
18:04
time. And the
18:06
RAM usage, well, you never know what's your RAM
18:09
usage because we might later
18:11
talk about the page cache, but it's very
18:13
hard to understand what's the RAM usage in
18:15
Kafka. But the disk
18:18
storage was the utilization usage
18:20
was really high because of
18:22
this retention. And then
18:24
you come to a point, and I dedicated
18:26
a chapter in the book only for this,
18:29
to whether you use RAID, RAID
18:31
1 plus
18:34
RAID 0, or use Jboard. Because
18:36
I can give a real example
18:39
of a customer that had 17 brokers. And
18:45
just in order to satisfy the retention
18:47
requirements of the customer, and
18:49
then the customer decided to
18:52
double the amount of retention because
18:54
that's the order that he got.
18:56
It was some government law enforcement.
19:00
So you don't mess with them. They tell
19:02
you, OK, I need to double my retention.
19:04
So you need to satisfy this. And
19:09
the first reaction was, OK, let's double
19:11
down on the brokers. But
19:14
then I convinced
19:19
my manager that we just
19:22
need to switch from RAID 10 to
19:25
Jboard. RAID 10 gives you
19:27
double the amount of replication. So if you
19:29
have replication factor of three, you will get
19:31
six copies per segment if you use RAID
19:33
10. But if you use
19:36
Jboard, then you save
19:38
half of the disks. And
19:41
then you don't need to add
19:43
even one broker. Just use
19:46
the same number of disks. And if you
19:48
want even more storage, you can just add
19:50
more disks. But then you run into the
19:53
issue of, OK, I don't have enough
19:55
disks in my drawer. So
19:58
you need to add another broker. add
20:00
a draw to each broker. These
20:03
are things that on the cloud you don't
20:05
even think about them. People on the cloud
20:07
even don't know about them. But
20:10
if I will take one thing that
20:12
is really like that
20:17
makes provisioning Kafka on-prem to
20:19
be tough is the retention
20:22
requirements. And managers for
20:25
on-prem clusters tend to be very sensitive
20:27
for retention. They want 10 times
20:30
the amount of retention
20:33
or 20 times the amount of
20:35
retention that internet companies have, for
20:37
example. And bringing us
20:40
around to the book that you wrote,
20:42
which you mentioned, what was your motivation
20:44
for bringing this all together in the
20:46
written form? And what are the overall
20:48
goals that you have for the book
20:50
and the people who are reading it?
20:55
So back in,
20:57
I remember myself as a
21:00
back-end engineer trying to deploy
21:03
my application, my streaming application
21:05
on Kafka and going
21:07
to something didn't work in the Kafka on
21:09
dev. So I remember
21:11
going to the DevOps
21:14
room and everyone are afraid
21:16
when you go to the DevOps room
21:19
because they are the most important, the
21:21
critical part of the
21:23
organization. And I
21:25
asked them, okay, I don't know what's wrong
21:28
with my Kafka. And they
21:30
had no clue and they didn't get an answer because
21:32
they had no time. And then I went out to
21:34
the room. I decided I will know Kafka from
21:37
all the angles. That was
21:40
my really initial motivation. But
21:43
if to be more serious, when
21:47
I started handling production issues in
21:49
Kafka and I went over
21:51
to sites, customer
21:53
sites, I saw that
21:55
people just didn't look clueless about
21:58
what's wrong with Kafka. They blamed, everyone
22:02
blamed the other part.
22:06
The Devos blamed consumers, consumed the
22:08
blame producer, produces blamed Kafka. And
22:11
so I wanted to, and
22:16
it is in the middle of the data
22:18
pipeline. So that was my
22:20
motivation. Like the best arrive
22:22
for my time was definitely
22:24
Kafka and Linux operating system,
22:26
like understanding the metrics and
22:28
how to diagnose
22:31
using Linux metrics, diagnose problems in
22:34
Kafka. After
22:36
now, this is why I wanted
22:38
to learn Kafka
22:41
from real production issues. My
22:44
motivation for the book was
22:46
that these support engineers and
22:48
DevOps and developers that encounter
22:51
issues in Kafka, which
22:53
is not a managed service. So
22:55
the manager themselves will
22:57
have a cookbook for
23:01
understanding like recipes, understanding
23:05
how to handle production issues in Kafka. And
23:07
this is why the book
23:10
is split into three sections,
23:13
three logical section. It's the
23:15
data section, the Linux
23:17
OS section and the Kafka metrics section.
23:21
And also for most
23:23
of it cloud based, but
23:25
also two chapters dedicated specifically
23:28
to on-prem. Because
23:32
some of the problems that I saw
23:35
on-prem were just duplicated when I was
23:37
starting to work on the cloud. On
23:40
the cloud is just easier because you have monitoring
23:43
in front of your eyes, you don't need to get logs
23:46
from support engineers. But
23:51
I saw that there
23:53
was no book. There was nothing
23:56
that even resembled a book about
23:58
real production issues in Kafka. And
24:00
it is really need. A
24:02
his fall for this because the
24:04
so many Kafka deployments out their
24:06
window on Pram on the cloud.
24:09
And. And people are
24:11
just yeah yeah I talked to a
24:13
city over stop is a the start
24:15
of the told me I matthews Kafka
24:17
because that's the message us today but
24:19
it's hell. Okay, Managing
24:21
it is so tough. And.
24:24
I had it after he
24:26
said if one different people
24:29
am may different manager in
24:31
op's managers they'll. Attack.
24:33
They attack leaders in some. Small.
24:36
Companies and. And.
24:38
Of to the day Just dame.
24:41
A common common problem and
24:44
then then they do for
24:46
the bouquet in. A
24:48
I mean he didn't come to me
24:50
I got and also to write a
24:52
book and then they. A
24:55
took me some ass because it's a lot of
24:57
work. Then. And then I
24:59
decided to compile to to gather all
25:01
their stuff that they compiled for. Them:
25:04
the six years of me walking in
25:07
Kafka. And that
25:09
was. My. My Motivation:
25:11
Like I'm not saying that people
25:13
shouldn't use they Amazon Msk, Yale
25:15
your so conflict to live in.
25:18
Met. A thing that. For. For
25:21
for those who already managed
25:23
Kaska, They should have
25:25
been better array guide. Them
25:27
the the know guide first of
25:29
all and also for those who
25:31
consider daily manager gov can consume
25:34
say okay stuff it's to it's
25:36
too much so that we need
25:38
to pay license and move to
25:40
some many service. I.
25:43
Think that if they have a dick
25:45
or denied guide. Then. It
25:47
can save the money. And.
25:50
Am. I woke individuals that help
25:52
us or for twenty years and like
25:54
I say this is my a small
25:56
contribution. Today. open source
25:59
community To the
26:01
point of saving money and whether
26:03
to run your own Kafka cluster
26:05
or use a managed service, there
26:07
are a lot of considerations that
26:09
go into all of that, as
26:12
well as the use cases of what
26:14
you're going to be building on top
26:17
of Kafka. And what
26:19
are the sizing and scaling requirements?
26:21
I'm wondering for people who are
26:24
in that position of deciding, do I
26:26
want to use Kafka? How do I
26:28
want to use Kafka as far as self
26:31
managed or managed service? And
26:33
what are the parameters along which I need
26:35
to project what my cost is going to
26:38
be? What are the different
26:40
elements that they need to be considering and
26:42
planning for as they start to evaluate and
26:44
do those initial deployments? So
26:48
first of all, you need to understand your traffic.
26:50
Okay, so let's
26:53
assume you are not a small
26:56
company and that you decide if
26:58
you are the decision point whether you
27:00
need to use Kafka first of all,
27:02
and then if you use Kafka, whether
27:05
you're going to use cloud or you
27:07
manage it yourself. So regarding
27:10
alternatives to Kafka, I'm not
27:12
aware of any alternatives, however,
27:15
I never search for
27:17
alternatives because Kafka is just everywhere.
27:20
So let's say you chose Kafka and
27:22
you have enough traffic so that you
27:26
can build even a
27:28
minimal cluster of three brokers. But
27:32
let's assume you have more and let's
27:34
assume you have multiple Kafka clusters per
27:36
each team or service or group or
27:38
whatever. So in that case, first
27:40
of all, you need to know your traffic, the
27:43
number of topics, the cluster,
27:46
the number of partitions, and
27:49
who are your consumers, third
27:52
topic, who are your producers. So
27:55
understanding the producing rate, the
27:57
consuming rate, and
27:59
then then you need to provision a
28:02
cluster. Let's say even before
28:04
you know whether it's cloud based or not,
28:06
you need to understand how much CPU you're
28:08
going to require, how much run
28:11
in order to support not the Kafka process
28:13
itself, but the page cache, because you want
28:15
data to be read from the page cache
28:17
and not from the disk. And
28:20
allocate enough disks, cheap
28:22
disks, in order to support as
28:24
cheaper as possible to handle the
28:27
retention requirement. And
28:31
now, after you have this estimation
28:33
and you know how many
28:37
brokers you have and what's the size of the
28:39
broker, now it depends
28:41
the decision whether it's cloud based
28:44
or not. If you are cloud
28:46
based, you will usually pick
28:49
cloud based. But
28:51
again, some companies decide, OK, I will put
28:54
my Kafka on prem because they know it
28:56
cannot be spot instances. But
28:59
most of the companies which are cloud
29:01
based will deploy it on the cloud.
29:04
And they will deploy it probably
29:06
on on demand instances if they're on
29:08
AWS. Some,
29:10
if they have a strong
29:12
dev of steam, they
29:15
will deploy it on
29:17
Kubernetes maybe. But
29:22
I don't have an experience with Kafka on
29:24
Kubernetes so I can't say anything about it.
29:26
I also didn't write about it, of course.
29:30
But so, and
29:33
if you're on prem, you will deploy it on prem,
29:35
of course. Now, the question of
29:38
whether to deploy it on,
29:40
like purchase a license for
29:43
Confluent or for Amazon MSK or
29:48
for Ivan, these are some
29:50
of the many services, or use
29:52
it or manage it yourself. It
29:56
really depends on the, how
30:00
your ops team, whether it's DevOps
30:02
or DataOps, know
30:05
how to handle production
30:07
issues. And let's
30:10
say, you know, ops teams are
30:13
real, good ops teams, I
30:15
mean, are real. And
30:20
I think that if you, if
30:22
like, you need
30:24
good, an excellent monitoring of
30:27
Kafka. And when I say an excellent time, I don't
30:29
mean hundreds of metrics.
30:32
Okay, I mean, I mean, a
30:34
minimal subset, a minimal set of
30:36
metrics that will show you where
30:39
production issues can occur. And
30:41
in my book, I have two chapters,
30:43
one on producer metrics, one
30:45
on consumer metrics, and three
30:50
chapters on CPU,
30:52
RAM and disks for the broker
30:54
themselves. So I think like half
30:56
of the book talks about how
31:00
to diagnose issues and, and
31:03
from this, the reader can
31:05
understand what to monitor. So
31:07
you need an excellent monitoring, but not
31:09
too large. You need like
31:12
specific monitoring in
31:16
order to deploy it yourself, because if
31:18
you're blind in Kafka, you
31:20
will pay a big price, you will have
31:22
downtime, and you will just lose
31:25
data. So you must have
31:27
an excellent, excellent monitoring,
31:30
and a team who knows how to,
31:33
how to manage the Kafka cluster. And
31:37
otherwise, to choose the to
31:40
cloud based, or there
31:42
are again, because it's the
31:45
knowledge of handling Kafka clusters
31:47
is pretty rare. What
31:49
happens is that some places,
31:52
some customers, and they
31:54
saw it in on-prem, they just
31:58
sometimes Kafka holds and they lose their data. data
32:00
because there's not enough fair knowledge.
32:04
That's not, like, that's not the
32:06
kind of situation at my current
32:08
company. We have an excellent DevOps
32:11
team. And its team leader
32:13
was also the technical editor for my
32:15
book, and he affected a
32:17
lot on the content
32:20
and also on how to, like,
32:23
who, some parts of
32:25
what to focus. His name is Oaronon, and
32:28
I am very
32:30
grateful for him to,
32:33
that he invested the time
32:37
to read and to edit the book. But
32:40
again, it's some companies
32:42
you have an excellent DevOps team
32:44
and some customers you don't have.
32:47
And if you're willing to lose data at some
32:49
time, then use the
32:51
open source. And if not, then go
32:54
to, then
32:56
pay the license for Managed Kafka.
33:01
Data lakes are notoriously complex. For
33:04
data engineers who battle to build and
33:07
scale high quality data workflows on the
33:09
data lake, Starburst powers petabyte scale SQL
33:11
analytics fast at a fraction of the
33:13
cost of traditional methods so that you
33:16
can meet all of your data needs,
33:18
ranging from AI to data applications to
33:20
complete analytics. Trusted by teams of all
33:22
sizes, including Comcast and DoorDash, Starburst is
33:24
a data lake analytics platform that delivers
33:27
the adaptability and flexibility a lake has
33:29
ecosystem promises. And
33:31
Starburst does all of this on an
33:34
open architecture with first-class support for Apache
33:36
Iceberg, Delta Lake, and Hoodie, so
33:38
you always maintain ownership of your data. Want
33:41
to see Starburst in action? Go
33:43
to dataengineeringpodcast.com/Starburst and get $500
33:46
in credits to try Starburst
33:48
Galaxy today, the easiest and
33:50
fastest way to get started
33:52
using Trino. And on that note of
33:54
data loss and the... Cluster
34:00
uptime and stability. What
34:03
are some of the failure conditions
34:05
that cluster operators need to be
34:07
thinking about that might lead to
34:09
data loss and some of the
34:12
ways to mitigate that and plan
34:14
for it in order to be able to
34:16
reduce the time to resolution? In
34:19
terms of preventing a data
34:22
loss, let's start with prevention and then
34:24
go to what
34:27
can cause a data loss.
34:29
First of all, you have replication.
34:32
Now replication goes side by side with the
34:35
size of the cluster because the higher
34:37
the replication factor, then each
34:40
segment will have more copies on different
34:42
brokers. However, you will
34:44
need more storage and more
34:46
storage means,
34:49
like more storage means
34:52
sometimes more brokers. And
34:54
even if not more brokers, it means
34:56
more, more, more disks. So
34:59
it will cost you more, a
35:02
higher replication factor. Also, you
35:04
can add another level of
35:07
assurance and deploy
35:09
RAID, let's say RAID 10, RAID 1
35:11
plus RAID 0, and this
35:13
will double the amount of storage
35:15
that you need. So
35:18
replication is one thing. Making
35:21
sure that your producer,
35:24
they have the
35:26
receiver axe acknowledgement from the
35:28
broker is another thing, although it
35:30
may affect latency.
35:34
Retention policy. Retention policy, it's
35:36
pretty simple policy and you
35:38
wouldn't believe the number of
35:41
times that consumers lose data
35:43
because of it, because you
35:46
can define it by time or by
35:48
size or by both, and then the
35:50
threshold is the minimum between both, and
35:53
if you
35:55
define it by size and
35:58
suddenly the topic size, the increases
36:01
even by a small amount of percentage
36:03
and your consumer lags and you will
36:05
lose data or some consumers will lose
36:08
data. So if
36:13
the audience will take one thing regarding
36:16
retention is that they
36:18
highly consider configuring retention
36:20
by time and
36:23
not by size because you really
36:26
don't know the traffic at any
36:28
point of time and
36:31
some producer can increase the traffic by
36:33
a multiply of
36:35
10 because there
36:37
was some filter and the developer
36:39
just removed that filter and
36:42
suddenly you get 10 times the traffic.
36:45
However, if again
36:47
nothing is simple in Kafka, if you
36:50
configure retention by time then
36:52
you might get into storage
36:54
100% storage. So
36:57
again this equilibrium between retention
37:00
size, retention time and
37:03
storage. Again we reach storage.
37:07
Harder failures, monitor harder failures.
37:09
Now on-prem you
37:11
can just check the status of
37:13
the disks. I repeat again, disk,
37:16
disk, disk, this is the most
37:18
critical, the cheapest part of Kafka
37:21
but the most critical part of Kafka
37:23
and the cause for
37:25
many, many problems. So like
37:27
on on-prem you can, there are tools that
37:30
you can monitor the state of the disk
37:34
and just a very like
37:36
a commendation that can save
37:39
several cluster I assume. If
37:42
a disk becomes read-only, so
37:45
you can just read and not write, usually
37:48
it means you need to replace the disk because
37:52
even running FSCK on the disk
37:55
will help maybe for several minutes, hours or
37:57
days but after that it will become a
37:59
good thing. Again, we don't lean
38:01
and producers cannot write to it and it
38:03
will create the whole mess. So
38:06
monitor the disks on-prem
38:08
you have the tool to
38:10
check itself
38:13
and on the cloud, you can check
38:18
the disk utilization. So the
38:20
Iostat tool, the Linux tool
38:22
called Iostat, Iostat minus
38:25
X print every
38:27
one second can show you
38:29
that if the disk utilization is 100% all the
38:31
time, then
38:33
something is wrong with your disk. Another
38:36
metric is, well, there are lots
38:38
of metrics, but another one that
38:41
happens to be around every production
38:44
issue is the system time, the CPU system
38:46
time. There are four main
38:48
CPU metrics. You have the user time,
38:50
system time, I-wait
38:53
time, which is either disk time, wait
38:55
time for disk or wait time for network and
38:57
context switches. If the system time goes,
39:00
let's say about 10%, then
39:03
you should suspect something is not good
39:05
in your cluster. If your
39:07
context switches time reaches,
39:10
let's say more than three, 4%, then
39:14
check what happens there.
39:18
A common cause, by the way,
39:20
to a high context
39:22
switch time percentage
39:25
could be that you just
39:28
have too many disk algorithms compared to
39:30
the number of disks that you have. And
39:32
this is if you want to be
39:34
very cautious, you
39:40
can back up regularly
39:42
your data, either by
39:44
using the feature of storage
39:46
of Kafka, which I admit
39:48
I didn't use until now,
39:51
or you can just add another
39:54
consumer and just take
39:57
the data from the topic and persist it to
39:59
some... cheaper storage
40:01
like HDFS or
40:04
S3 where the storage is
40:06
separate from the compute. So
40:09
this is regarding some ways
40:11
to prevent data loss but
40:14
in order to but these
40:17
are the like what happens
40:19
after you already have a problem
40:23
but the question is how you prevent
40:25
the problem and from
40:27
my experience with Kafka in
40:29
production the last six years
40:32
is that Kafka cluster talk before
40:35
they hint you before they decrypt it
40:37
down they tell you stuff and they
40:40
tell you this through the discretization
40:43
diastole tool which you can
40:45
see using diastole tool they tell you this
40:47
using top command check the
40:49
system time the CPU system time
40:52
check the CPU context switch CPU
40:54
wait time if you
40:56
have GC issues in your Kafka
40:59
process then it will tell you
41:01
this through spikes
41:03
in the usual time during a full GC
41:05
you can use the JSTA tool
41:08
which is part of the JDK tools in order
41:10
to check for a full
41:12
GC the frequency of full GC again
41:16
it's like this is why half
41:18
the book talks about half the chapters almost
41:20
talk about how to
41:22
like various cases that
41:24
they stumbled upon regarding
41:28
what can lead you to data
41:31
loss I had a
41:33
colleague asking me I
41:37
gave him a copy of the book
41:39
and he asked me okay well but
41:41
you didn't talk about the issue of
41:43
unreplicated the partitions and
41:45
I replied to him well every
41:49
problem in Kafka can result
41:51
in under replicated partitions or
41:54
in in partitions that are not they
41:56
don't appear in the list of in
41:59
sync replications in the ISO. So,
42:03
and there are tens of problems which
42:05
might not even be on the Kafka
42:07
itself. Like it could be in the
42:10
operating system because something is like
42:14
sometimes it's even not in the Kafka again.
42:16
Like if
42:18
you deploy another service, if you
42:20
deploy anti-virus, anti-virus
42:23
or firework and
42:26
they scan the segments
42:28
all the time and you will have
42:30
high discretalization, this is not the fault
42:32
of Kafka but
42:34
it will cripple down your Kafka. So,
42:38
and there are some ISO on prem
42:40
clusters where you shut down the
42:42
firewall and then it comes up again
42:44
because of some policy. So,
42:47
you need to remember that not always
42:49
you run alone on the
42:51
broker but asking
42:53
like how like what can cause data
42:56
loss is like it's
43:00
like asking what should I monitor in
43:02
Kafka and how to deal with
43:04
it and that's the
43:06
topic of the book in
43:08
general. And given your experience
43:11
both working as a back-end
43:13
engineer and operating Kafka clusters
43:15
at scale, I'm wondering
43:17
if you were to be in
43:19
the room today redesigning Kafka from
43:21
the ground up, what are some
43:23
of the aspects of the system
43:25
design that you might choose to
43:28
revisit or revitalize? I
43:34
think the design is
43:37
makes a lot of sense, the
43:39
log-based approach. I
43:43
don't have many changes, I
43:46
don't think of any change other than
43:48
one change which is for me it's
43:50
a bug, I don't know if they declare it as a feature
43:54
that they
43:56
spread if you have more than one disc
43:58
then the partitions are being
44:00
spread among
44:03
the disks given the number of partitions
44:05
of disk and not given the
44:08
amount of storage per disk, which
44:11
I don't think that the CAFQ
44:13
community understands how it affects
44:16
some Kafka cluster owners to
44:18
decide whether they go RAID or JBOD.
44:22
It's like something that if it will
44:25
be fixed, the amount of cost that
44:27
will be saved for
44:29
disk and maybe for cluster will be pretty
44:32
big because
44:35
there are other
44:37
teams that say, okay, I don't want to
44:39
go JBOD because of this. I
44:42
don't want to handle the spread of the data among
44:44
the disk. But other
44:47
than this feature, feature slash
44:49
bug of spreading the data
44:51
evenly according to
44:53
storage and not the number
44:55
of partitions, I
44:59
must admit that I'm not
45:02
in the applicative side
45:05
anymore for a long time because to be
45:08
honest, it interests me much less
45:11
than the ops side. So
45:13
my focus is on the ops side and
45:17
then it comes down
45:20
to like, this question comes
45:22
down to like, it
45:24
could be asked regarding every big
45:27
data cluster, but in Kafka,
45:31
it's really hard to
45:33
detect, to understand the root cause of
45:35
production issues. To be honest,
45:37
I don't know why fix
45:40
a problem in Druid or Spark
45:42
or Spark streaming or things much
45:45
less time than understanding why
45:49
data was
45:52
lost on Kafka or
45:54
why the disk utilization is so high. I
45:57
don't fully understand it. Like
46:00
if I take the amount of time that I
46:02
invested in every Kafka production issue
46:04
compared to other classes, it will be, I'm
46:06
not kidding, like 10 times, between
46:09
five times and 10 times more. So
46:11
I think that, and
46:13
this brings me back to the motivation of writing
46:15
the book, that Kafka
46:17
uses the operating system in a
46:20
way that no other
46:22
open source that I know of
46:24
uses it, especially,
46:26
not the CPU by the
46:28
way, it's really like on the CPU,
46:30
it's like very simple, user time
46:32
and that's it, if you do it correct.
46:36
But the usage of the page cache, the trashing
46:38
of the page cache in cases that you have
46:40
legs and the
46:42
amount of stress that
46:46
comes down on the disks is
46:48
something that I don't
46:50
think that those
46:53
who created Kafka or those
46:55
who develop it or
46:58
contribute to it, I
47:00
think there's some split between those
47:03
who develop and those who maintain
47:06
and there's not enough
47:09
connection between the two
47:11
sides. And
47:13
maybe this is another one, I think of it,
47:17
I'm thinking out loud now, this
47:19
is another reason why the book
47:21
is important both for developers and
47:23
for ops team in order to
47:25
not only understand Kafka but also
47:27
understand for developers to understand the
47:29
ops team, because
47:32
there are so much problems that
47:34
are being caused by one
47:37
disk that goes wrong, one
47:40
disk, you
47:42
can have a cluster with tens of
47:44
disks and one disk go bad and
47:47
this can cause the whole
47:49
cluster to halt and
47:52
a healthy cluster
47:54
should not suffer from such
47:56
an issue. Remember getting the
47:58
call but five years ago
48:01
from the customer side that had
48:03
three brokers, six disks,
48:05
HDD disks, they configured
48:08
in RAID 10 and
48:11
one disk went bad. So you
48:13
see, but in RAID, you
48:16
see only one logical disk. So
48:19
and the utilization in IOSTA
48:21
shows the utilization of the
48:24
highest utilization among all the disks in the RAID and
48:26
then you see 100%. So
48:30
I had to, I guess that
48:33
it was one machine, one disk
48:35
that was, that got screwed and
48:38
I told the support engineer go to the room
48:40
and check if
48:42
the light waves on
48:45
one of the disks and he told me yes, the
48:47
light waves are one of them. But
48:51
imagine that you need to do, I
48:53
needed to get this
48:55
because of the lack
48:57
of monitoring. So when
49:00
I think of it, then the
49:02
work on something needs to change on,
49:06
like not on the work on
49:08
the disk because this is what Kafka does. It writes
49:11
massive amounts of data to the disk and
49:13
it reads, it tries not to
49:15
read from the disk, but at
49:17
many customer sites or cluster
49:21
it reads from disk and
49:23
first of all, Optin.node it reads from
49:25
disk. Secondly, like
49:29
many cluster have deployment of RAID.
49:31
So how can you
49:34
assist them in understanding the one
49:36
disk got screwed? They
49:39
don't have a, and in the
49:41
cloud also, by the way, why when
49:43
one disk gets bad, the
49:48
cloud provider won't tell you that because they
49:51
can't really tell you, okay, I'm on the disk
49:53
is on 80% utilization for 30 minutes.
49:59
We don't know if it's good or not so we will
50:01
not tell you everything and then
50:03
you get into high high weight and you replace
50:05
the broker so the
50:08
mitigation for these issues in
50:10
Kafka is is
50:12
is is brute
50:15
force and in figuring
50:17
out that you have this issue you
50:20
need to be a magician in order to
50:23
know this and
50:26
and developers of Kafka just
50:28
assume that okay someone
50:31
will handle it it's not us we just developed
50:34
and I think for me
50:36
coming I did a shift a career
50:38
shift which is not common like going
50:40
from developer to ops so
50:43
I understand the frustration from both
50:45
sides and
50:47
I think that there should be more people that
50:50
know both partigums
50:54
and because a lot of production
50:56
issues originate
50:58
from the lack of understanding of the
51:01
lack of a cooperation or knowledge
51:03
sharing between the between the two
51:05
if if the Kafka community had
51:07
better communication between the developers and
51:09
the ops team then I
51:13
think it would be much easier to detect
51:15
the disk issues which cause I
51:19
bet it causes like at
51:22
least third of the problems in
51:24
Kafka yeah it's definitely
51:27
always a challenge balancing the developer
51:29
of I just want to get
51:31
something shipped and do something cool
51:33
with some fancy new feature and
51:35
the operations team of I just
51:37
want you to stop crushing my
51:39
machine so that I can sleep
51:41
at night yeah yes but
51:43
but I it's like in
51:46
the kafka community I
51:48
think there is a lack of ops team that will
51:50
tell that will check these
51:53
these developments and
51:56
but it's not even your development just
51:59
for I'm telling, I'm asking the
52:01
Kafka community, like go to deployment
52:04
and of Kafka and
52:07
check, ask customers what's
52:09
the percentage of production issues
52:12
which is caused by disks? And
52:15
then try to, I
52:17
don't know, maybe every Kafka
52:19
tool needs a better monitoring of
52:22
disks. And
52:24
also if there is monitoring, how do
52:27
you read, like Brendan
52:29
Greg has an excellent explanation of how
52:31
to read IOS-TAN, the
52:33
output of IOS-TAN. So you
52:35
have the utilization, like the service time
52:38
is obsolete, no one needs
52:40
to look at it. So every utilization, you
52:42
have throughput, which is the read megabyte per
52:44
sec and write megabyte per sec and you
52:46
have IOPS, which is the read per sec
52:48
and write per sec. And
52:54
this is the fact that not many, people
52:57
know the saturation of disks is 60%,
52:59
only 60%, which
53:02
means that every increase in the
53:05
disk utilization going
53:07
is bringing the situation,
53:10
the level of IOA will become worse
53:12
and worse and worse. And
53:14
it's not like in CPU, in CPU it could be,
53:18
like the recommendation on cloud is like 75%
53:20
CPU saturation. Above
53:23
that your load average increases
53:26
in a nonlinear way. So
53:30
how do you read, how Kafka
53:32
use IOPS-TAN should
53:35
read the output of IOS-TAN? Well, it's
53:37
simple, look at the disk utilization.
53:40
The disk utilization of I saw
53:42
the Velo, IOPS-TAN telling me, okay, we reach
53:44
100%, this is not good. No, this is
53:46
okay, because Kafka works in burst.
53:49
So it writes a lot of data
53:51
for a small amount of time. So
53:53
we'll have high disk utilization caused by
53:56
read megabyte, write megabyte per sec and
53:58
write per sec. This
54:00
is good. And then you should see zero. And
54:03
then again, you have a burst of writes. And
54:07
but if you have burst of if
54:09
you have 100% utilization because of reads, then
54:12
it means that you read from the disk, which
54:15
means that you have a problem somewhere.
54:17
Maybe you have a consumer leg, maybe
54:19
a replica, some of a broker
54:22
that replicates the data is
54:24
is is a lagging behind,
54:27
which means the this partition is not in the
54:30
it's a this broker is not
54:32
in your ISR list of this partition. So
54:35
just look at the output of the
54:37
IELTSAT minus six. The
54:39
prints every one second. And if
54:41
you have several seconds of
54:43
100% utilization from a
54:46
right, then this is OK. But if you
54:48
have the same 100% utilization
54:52
from reads, then this is not
54:54
OK. Now, if the community
54:56
will take this and make it an
54:59
alert or some monitoring tool,
55:02
then it will reduce and
55:04
often will know how to read this. Then
55:07
it will reduce the it will
55:10
reduce the frequency of production issues.
55:14
It is more than 10 percent for sure.
55:16
Just this feature detect,
55:18
like detect legs
55:22
by looking at the IELTSAT. And
55:26
in your experience of working
55:29
with Kafka clusters and helping
55:31
customers and end users manage
55:33
and ensure their uptime, what
55:35
are some of the most
55:37
interesting or unexpected or challenging
55:39
production problems that you've had
55:42
the opportunity to diagnose?
55:45
OK, I will tell the most.
55:50
This is. The
55:52
OK, the most bizarre. I
55:55
will tell the most bizarre production issue, the
55:57
most the most interesting one. So the most
55:59
bizarre I. I already
56:03
mentioned it before was a
56:05
cluster of three brokers, 18 disks, 6
56:07
per broker, configuring
56:11
rate 10 that 140 disks rippled
56:13
down the cluster because it
56:16
was on rate 10. So
56:24
in fact, only two disks
56:26
were up and the disk utilization was
56:29
already pretty high. So it
56:31
just added to the party this problem. And
56:35
then I had to guess that one disk
56:37
was faulty. And
56:39
the support engineer really saw that one disk
56:42
was faulty after he went to the server
56:44
room. And of course,
56:46
the other, it's twin disk because
56:48
it's rate 10 also
56:51
didn't function. So
56:54
on date and the disk utilization
56:56
was already very high. So it
56:58
brought the disk utilization to 100% on
57:00
the broker. So third
57:03
of the leaders, partition leaders were
57:05
on that broker and then producers
57:08
stopped, couldn't try to them.
57:10
Consumers didn't read from them. And
57:13
once consumers don't read
57:15
even from one partition, then they
57:18
just cannot function. It
57:20
depends, of course, on the nature of your consumers. But
57:22
this was a streaming
57:25
application that had to read from all
57:27
consumers. So it got stuck. Even
57:30
if it wasn't the application that needs to
57:32
read all consumers, then you have
57:34
partial data. So that
57:36
was, but that was a very
57:38
bizarre problem that combined
57:41
on-prem and high disk utilization and
57:43
guessing that one disk is faulty.
57:45
The most interesting problem
57:49
that they ran into was
57:52
it took several
57:55
weeks, I think, and it
57:58
involved a... a
58:01
broker that from a certain point
58:03
in time every broker that was
58:05
added to the cluster due to
58:08
some failure of
58:10
other broker, every
58:12
broker was got
58:16
at some point to 100% discretalization and and
58:23
nobody managed to write and read
58:25
from it. Nobody managed to write
58:27
so there was nothing to consume.
58:31
Every time we replaced this broker
58:33
with another broker and again the
58:35
same phenomena happened and like
58:40
and then again looking from the IELTS start
58:43
after a long long long time of trying
58:46
to understand what is going on here. It was
58:49
like Voodoo and we
58:51
noticed that at
58:55
some brokers the
58:57
discs reached 100% utilization
59:00
with because
59:03
for all the time there were 100% utilization because
59:07
they but
59:10
when we looked at the throughput we
59:12
saw that there are half the
59:15
throughput of the throughput
59:17
that causes other discs on other
59:19
brokers to be 100% utilization. So
59:22
not only did the healthy brokers reach
59:24
100% utilization for small amounts for a
59:27
few seconds and then the
59:29
utilization went down. These
59:31
for the brokers with
59:34
40 discs reached 100%
59:36
discretalization and it kept 100%
59:39
utilization and what the throughput was
59:42
half. So just dividing
59:45
just correlating the discretalization with
59:47
the throughput and the
59:50
amount of time the disc utilization was
59:52
100% let
59:55
us understand that these are just
59:58
40 discs. But
1:00:00
this took a lot of excels and the
1:00:02
experience and just checking, trying
1:00:04
to correlate every, or
1:00:07
small or asymmetric until
1:00:09
we, we found it out. And
1:00:11
that was by far the most
1:00:13
interesting production issue
1:00:15
that I stumbled into. In
1:00:18
Kafka. And in
1:00:20
your work of writing the book
1:00:23
and consolidating all of the information
1:00:25
and experience that you've had working
1:00:27
with Kafka, I'm wondering if there
1:00:30
are any insights that that helped
1:00:32
you gain or any new knowledge that you
1:00:34
were able to obtain in the process. A
1:00:38
of course. Yeah, because mainly
1:00:43
from the world, two
1:00:46
issues like, like it
1:00:49
helped me out formulate the,
1:00:51
the, the three legs that
1:00:53
Kafka stands on a Kafka, it needs
1:00:55
to understand, let's say, which is the
1:00:57
data part, the OS part and the
1:00:59
Kafka part. I was
1:01:01
surprised to see that the Kafka part
1:01:04
is only third of the book,
1:01:07
which shows how much the data part is
1:01:09
important. How is it the, how
1:01:12
the, the, how
1:01:14
the way that the data is
1:01:16
spread among the politicians
1:01:19
is so important for the, for the
1:01:21
health of the cluster. And
1:01:24
also, I was
1:01:26
surprised of how many issues
1:01:29
production issues originate from
1:01:31
a problem with a, with
1:01:33
storage. And,
1:01:37
and also I found
1:01:40
out several producer and
1:01:42
consumer metrics that, that were
1:01:44
new to me because
1:01:47
I thought that I,
1:01:50
that many issues can be
1:01:52
fixed with, with tuning the lingual
1:01:55
and the best size on the producer and that
1:01:57
found out several very
1:01:59
important. and matrix in the consumer and produce.
1:02:02
So this was also new to
1:02:05
me. I ran across them
1:02:07
during the time, like
1:02:10
from production issues that I dealt
1:02:12
with during the time that
1:02:14
I wrote the book. And
1:02:17
I must
1:02:19
say that a lot of things written in the book
1:02:22
were not things that only
1:02:24
I discovered. So I
1:02:26
worked with several people that
1:02:28
we worked together and
1:02:33
found the issues together. So in
1:02:36
part it was just documenting what
1:02:39
team of ops people
1:02:41
found out, including
1:02:43
me but also other
1:02:46
people. I
1:02:48
have like, in
1:02:51
order to make this more specific, I
1:02:54
have for example the issue
1:02:57
of the storage, just to emphasize,
1:03:01
when I wrote the part of the storage usage,
1:03:03
turns out that it's much,
1:03:07
it's more vast than what
1:03:09
I thought of. So if
1:03:11
we just take this issue,
1:03:14
the storage usage issue, so
1:03:17
for example, running out of
1:03:19
disk space, due
1:03:23
to retention configuration, what
1:03:25
happens when you configure
1:03:27
both time-based and size-based,
1:03:31
the way the option, there is
1:03:33
an option to lose data. And
1:03:35
I was surprised to see
1:03:37
how two simple configurations like this
1:03:39
can cause data loss.
1:03:42
And so retention policy and
1:03:44
its effect on the data loss, the
1:03:48
explaining like how to add storage to
1:03:50
a cluster, how it differs between on-prem
1:03:52
and on the cloud, the
1:03:54
fact that when you're
1:03:57
on-prem, this is not only technical
1:03:59
decision, it's... I'm in a general or
1:04:01
financial decision because you can't tell the
1:04:03
owner of the data center. Okay, I
1:04:05
had a mistake. I don't need the
1:04:07
two terabyte disk. I need four terabyte
1:04:09
disk. So I need to throw
1:04:11
away all the two terabytes and buy four terabyte disk.
1:04:15
So it will not go smoothly. So
1:04:18
understanding these aspects, like
1:04:21
this chapter became partially technical
1:04:23
and partially how
1:04:25
is a provider for
1:04:28
an on-prem customer, how
1:04:30
you manage this issue. Again,
1:04:34
in DIMS, it's the same issue.
1:04:36
Okay, it was a mistake. I
1:04:38
bought 16 megabytes disk, 16 gigabytes and 32
1:04:41
or 64. How
1:04:43
do you pass this decision? How do you
1:04:45
mix DIMS in on-prem? And
1:04:49
also the effect of
1:04:51
the retention on data replays. Sometimes
1:04:53
you need to replay the data
1:04:55
because you did some wrong transformations.
1:04:58
So what's the effect of like customer,
1:05:01
ops team need to understand that
1:05:04
they need storage also for
1:05:07
replay. The
1:05:09
data skew, how data skew can affect data
1:05:11
loss. Even if you have a lot of
1:05:13
storage, if you don't partition the
1:05:16
data correctly, then you will
1:05:18
get data loss at some point, even one
1:05:20
partition. And it's for certain
1:05:22
consumer, this is like data loss in all
1:05:24
partition. You need to replay the data again.
1:05:27
So the data aspects were also something
1:05:29
that I learned along the way.
1:05:34
So this is only an example of
1:05:37
one chapter. They should discuss in the
1:05:39
chapter of the storage users. But
1:05:42
not only that I learned during
1:05:44
the writing the book, I
1:05:46
think that if I wouldn't write the
1:05:49
book, I would forget almost everything. So
1:05:53
for me, the
1:05:55
personal benefit for me
1:05:58
is that I know I
1:06:00
remember stuff, really Kafka
1:06:03
related stuff that I knew and
1:06:06
I didn't forget them, but also that I
1:06:08
learned along the way. So
1:06:12
I think the investing 10 months
1:06:16
during the weekends, every
1:06:19
weekend for this amount of period to
1:06:21
write the book was beneficial
1:06:25
for my technical knowledge. Absolutely.
1:06:29
Are there any other aspects of the
1:06:31
work that you've done with Kafka, your
1:06:33
work on the book, the overall Kafka
1:06:36
operations ecosystem that we didn't discuss yet that you'd
1:06:38
like to cover before we close out the show?
1:06:42
I think we covered a lot
1:06:45
of technical stuff. We
1:06:47
can discuss the cost
1:06:49
reduction, but
1:06:52
in very short, like I would like to mention
1:06:54
that there
1:06:57
is a chapter on cost reduction
1:06:59
in Kafka, but it
1:07:01
relates to like I brought
1:07:04
six examples, six real world
1:07:06
examples of cluster that
1:07:08
I stumbled upon, which per
1:07:12
each example, I write how
1:07:15
much, I specify how much
1:07:17
CPU RAM and disk each cluster
1:07:19
has and
1:07:22
also the usage of each of
1:07:24
these resources. And then
1:07:26
I ask whether the cluster can be scaled
1:07:28
down or scaled in. And
1:07:31
then I discuss other metrics,
1:07:35
monitoring metrics and
1:07:37
by correlating this monitoring Kafka
1:07:39
monitoring metrics and the operating
1:07:42
system metrics usage, I
1:07:45
give recommendation regarding whether you
1:07:47
can scale in or scale
1:07:49
down the cluster. And
1:07:52
I think this
1:07:55
is like the
1:07:57
cost of the Kafka cluster. It's
1:08:00
not big, I think, compared to
1:08:02
other clusters in an organization. But
1:08:05
because for cloud-based,
1:08:10
I assume that most of the
1:08:12
deployments are on demand. So
1:08:16
even if you have reservation, again, it's on
1:08:18
demand. It's not spot. So it's important, especially
1:08:21
in today's market, to
1:08:24
squeeze in every penny that you
1:08:26
can save. So
1:08:30
the cost reduction part is
1:08:32
something that can help
1:08:35
to reduce costs on
1:08:37
Kafka. But
1:08:40
there is a part that I didn't talk
1:08:42
about, which might be a
1:08:45
bigger part, even than the machines
1:08:47
themselves, which is
1:08:49
the data transfer between consumers
1:08:51
and the brokers. Because
1:08:54
if there is no rec
1:08:56
awareness in the cluster, then
1:08:59
consumers will read data only
1:09:02
from leaders. And these
1:09:04
leaders can be... Most
1:09:08
of the leaders statistically won't be in
1:09:10
the same AZ. And
1:09:15
for some companies, this can save hundreds
1:09:18
of thousands of dollars per year
1:09:20
configuring rec awareness. But
1:09:22
since I don't have any experience with rec
1:09:25
awareness, I didn't
1:09:27
discuss it thoroughly. But
1:09:31
for those who
1:09:34
listen, checking if you can
1:09:36
configure rec awareness in
1:09:38
your consumers and
1:09:42
brokers, sorry, this can
1:09:44
be beneficial. You need to check your
1:09:46
data transfer cost. And
1:09:49
maybe it's worthwhile for you to
1:09:51
invest in deploying
1:09:55
and testing and
1:09:57
validating the rec awareness. And
1:10:00
then you will read not from the leaders,
1:10:02
but from the closest replica and
1:10:05
save a data transfer code. Yeah,
1:10:08
that can definitely be a substantial cost when running
1:10:10
in the cloud and that's always one of those
1:10:13
surprise gutches when you're first getting up
1:10:15
and running in a cloud environment. Not
1:10:20
only when you're getting started, but also
1:10:22
when you're after
1:10:25
years, then you see the
1:10:27
data transfer. This brings
1:10:29
us back to what we started with
1:10:31
regarding cloud versus on-prem. There
1:10:34
are the amount of reasons
1:10:36
why it becomes
1:10:39
like I don't know
1:10:41
like for some clusters it
1:10:43
might be a good idea to check
1:10:45
the like I'm
1:10:49
saying it is because I came from on-prem, so
1:10:52
it's not Chinese for me and and
1:10:55
okay, so you don't have managed
1:10:57
services, but it
1:11:00
might be from from
1:11:02
the financial
1:11:05
perspective a valid
1:11:07
choice for some clusters to
1:11:09
be deployed on-prem.
1:11:13
All right, well for anybody who wants to get
1:11:16
in touch with you and follow along with the
1:11:18
work that you're doing I'll have you add your
1:11:20
preferred contact information to the show notes and as
1:11:22
the final question I'd just like to get your
1:11:24
perspective on what you see as being the biggest
1:11:26
gap in the tooling or technology that's available for
1:11:29
data management today. Also,
1:11:32
I usually work with analytics
1:11:37
clusters and
1:11:39
I think that if
1:11:44
there was a tool, the
1:11:46
tool would show at
1:11:48
any given point of time correlation between
1:11:50
the traffic whether
1:11:53
it's the incoming traffic or
1:11:55
a query or query load.
1:11:58
So correlating with between the load
1:12:01
on the cluster and
1:12:04
the usage of the cluster
1:12:07
in terms of CPU,
1:12:10
ROM, disk, or
1:12:12
even internal usage. For example, let's
1:12:14
say a Druid cluster that uses,
1:12:18
sometimes the bottleneck is the number of workers
1:12:21
or for three new clusters, the number
1:12:23
of split queries. So if
1:12:25
there was something that some tool that would
1:12:27
show correlation between the load on
1:12:30
the cluster and the real usage
1:12:32
and cost, it
1:12:35
would allow ops team to
1:12:39
better understand whether they can save
1:12:42
cost on the cluster, whether they can scale
1:12:44
it down or maybe
1:12:47
replace on-demands with spots or
1:12:49
maybe replace on-demand reservation with
1:12:51
on-demands without reservation and then
1:12:53
auto-scaling them. So something
1:12:55
that we show, some tool that will show
1:12:57
correlation. Between usage,
1:13:00
applicative usage and resource
1:13:02
usage and will enable to save
1:13:05
cost because
1:13:07
especially today, in today's
1:13:10
economic, it's
1:13:14
becomes pretty important to save cost.
1:13:17
All right, well, thank you very much
1:13:19
for taking the time today to join
1:13:21
me and share your experiences of running
1:13:24
and operating Kafka clusters and the work
1:13:26
that you've done on the book to
1:13:28
make that easier for everybody else to
1:13:30
do as well. It's definitely a very
1:13:32
challenging and necessary task. And
1:13:35
as you said, Kafka is very widely deployed.
1:13:37
So I appreciate the time and energy that
1:13:39
you put into sharing your hard-won knowledge with
1:13:41
everyone else. And I hope you enjoy the
1:13:43
rest of your day. Cool,
1:13:45
thank you very much. Again, thank you for hosting
1:13:48
me and
1:13:51
I hope that the audience will
1:13:53
gain something from this podcast.
1:13:57
Thank you. Thank
1:14:02
you for listening. Don't forget to check
1:14:05
out our other shows, Podcasts.init, which covers
1:14:07
the Python language, its community, and the
1:14:09
innovative ways it is being used, and
1:14:11
the Machine Learning Podcast, which helps you
1:14:13
go from idea to production with machine
1:14:16
learning. Visit the site at dataengineeringpodcasts.com, subscribe
1:14:18
to the show, sign up for the
1:14:20
mailing list and read the show notes.
1:14:23
And if you've learned something or tried out a product from the
1:14:25
show, then tell us about it. Email
1:14:27
hosts at dataengineeringpodcasts.com with your
1:14:30
story. And to help other people
1:14:32
find the show, please leave a review on Apple
1:14:34
Podcasts.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More