Troubleshooting Kafka In Production by Data Engineering Podcast | Podchaser

Episode from the podcastData Engineering Podcast

Troubleshooting Kafka In Production

Released Sunday, 24th December 2023

Good episode? Give it some love!

Troubleshooting Kafka In Production

Troubleshooting Kafka In Production

Sunday, 24th December 2023

Good episode? Give it some love!

Rate Episode

Podchaser Pro

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

2:00

primarily focus on the

2:02

stability, deployment, and cost reduction

2:05

of the big data analytics clusters

2:09

on AWS such as

2:12

Druid, Treano, Spark,

2:16

etc. And they also am

2:19

involved with Kafka

2:21

production issues and cost

2:24

reduction issues, but mainly in

2:26

production and stability issues on

2:29

Kafka. Prior to

2:31

that, I was in SRE for

2:34

a company called Cognite, where

2:38

I was in charge of the

2:41

stability of big

2:43

data clusters on prem,

2:46

for Linux on prem, mainly

2:48

for Spark streaming, Spark batch,

2:51

and Presto on HDFS

2:54

at various customer sites.

2:57

Prior to that, I was a Java

3:00

backend engineer for about

3:02

10 years. And

3:05

I recently published a book

3:07

called Kafka Troubleshooting in Production,

3:11

which will be the parts

3:13

of it will be the main

3:15

topics of what we will discuss today. And

3:18

it talks about how to handle production

3:21

issues in Kafka clusters, both

3:24

on prem and on the

3:26

cloud on AWS. And

3:28

do you remember how you first got started working

3:30

in data and what it is about that space

3:32

that keeps you interested? So

3:34

I started working

3:37

on big data about

3:39

seven years ago, when I developed

3:43

my first Spark streaming

3:45

application, consuming from Kafka

3:48

and persisting to HDFS on

3:50

Linux on prem clusters. Before

3:52

that, I just wrote backend

3:55

applications that were reading

3:57

from some database, writing from some

4:00

into some databases, but it wasn't formally

4:02

big data. And

4:07

five years ago, I understood that on

4:09

big data clusters, the

4:13

production issues that I didn't see before,

4:17

and instead of, so

4:19

I reverted from being a developer,

4:22

reverted to being an SRE

4:24

and solving stability

4:26

issues in these clusters, which

4:28

meant knowing better not my

4:30

code, but the

4:34

infrastructure, the cluster that my code,

4:37

or some code, some big

4:39

data code runs on, whether

4:41

these are Presto clusters, Spark

4:45

clusters, Spark batches, Spark

4:47

streaming, Kafka, HDFS,

4:52

and it was a whole new

4:55

world for me. And

4:58

that's what brought me specifically

5:03

to handle Kafka issues,

5:05

which I do for the last,

5:09

like in production, I do it since 2018, but

5:13

I started doing it a year before. So

5:15

maybe the question is

5:18

more what brought me to be, not

5:20

to be a developer anymore, to

5:23

become an SREs, let's say data

5:26

of big data clusters, and

5:28

specifically focusing on the cost reduction,

5:30

whether on-prem or on the cloud,

5:38

and focusing on

5:41

Kafka. So

5:43

I will elaborate on the Kafka

5:45

issues. So I understood that Kafka

5:47

is like stuck in the

5:49

middle of everything, everything. If there is

5:51

a problem with Kafka, your

5:54

whole data pipeline is just stuck.

5:57

Producers can't write, consumers can't

5:59

write. And that's what,

6:01

it was a combination

6:04

of two things. One,

6:06

the understanding that Kafka, the

6:09

best of life for my time will be to focus

6:11

on Kafka. And the

6:13

second thing was that I

6:16

stumbled into a book called the

6:18

System Performance by Brendan Gregg, who

6:22

was the lead performance engineer at

6:24

Netflix. And now he

6:26

works under the Intel

6:28

CTO on

6:31

distributed clusters. And

6:34

this really opened me

6:36

a world of, that

6:38

I wasn't aware of before, of

6:41

monitoring and detecting bottlenecks in

6:44

Linux clusters. And

6:47

not only understanding what's

6:50

your bottleneck, what's the cluster bottleneck,

6:52

but also it opened the way

6:54

to reduce costs. And

6:56

what I saw when I stumbled

6:59

into production issues in Kafka,

7:02

Spark, Presto, Druid,

7:06

for this matter, any service, any

7:08

applicable service on running on Linux

7:11

cluster, is that

7:13

understanding how to diagnose issues

7:15

in a Linux cluster can

7:18

really help you detect what's

7:21

the problem in many cases. And

7:23

also allows you to

7:25

reduce costs on the cloud. In

7:29

terms of the Kafka focus, you

7:31

mentioned that you were

7:33

working as a backend developer, you

7:35

were interested in the use cases

7:37

for Kafka, the production requirements around

7:39

it, how to make sure that

7:41

it was stable and performant. And

7:44

I'm wondering if you can talk to, at

7:46

least at a high level, some of the

7:48

different environments in which you've had experience working

7:50

with Kafka and some of the categories

7:52

of operational challenge that you've had to

7:54

deal with in that process. Sure,

7:57

so I started with the on-prem cloud.

10:01

of signals that can say

10:03

that something is going to get

10:05

wrong, something is wrong in the

10:07

cluster. So

10:09

as an SRE of on-prem, you

10:11

reach the problems once the cluster

10:13

holds sometime. But

10:16

there are also, on the

10:18

cloud, the cloud provider will

10:20

tell you, okay, you don't need to handle

10:22

these failures. However,

10:29

even on the cloud, a broker can hold

10:32

because of disk deterioration,

10:34

and you wouldn't know of it

10:36

until you get into legs. But

10:39

you don't need to replace the disk,

10:41

you just can replace the broker, which

10:45

is much easier in

10:47

the cloud versus on-prem, where you need

10:49

to get

10:52

into the drawer of disks and

10:54

replace the disk. Another

10:56

big difference is the scaling. If

10:59

you have more traffic, you can

11:01

scale out. On the cloud, you can just scale out

11:03

to scale up and by

11:06

spinning or spin a new cluster

11:09

pretty easily, while on-prem,

11:13

it's very, very tough. Because

11:15

think of it like if you want to scale up,

11:17

you need to have disks on

11:19

site. And sometimes you don't have these disks

11:21

on site. What happens if you shift

11:24

the cluster to another, and your customer

11:26

is in another country? Another

11:29

issue is, like,

11:32

what happens when you need more RAM? On

11:35

the cloud, you just spin

11:38

a cluster with instances with more RAM.

11:42

But on-prem, you need to check

11:44

that you have enough slots in

11:46

order to insert DIMM

11:48

sticks, memory sticks, into

11:52

the cluster, into each machine. And

11:55

you need to do it manually. scaling

12:01

the two main differences of the scaling

12:05

option which is much easier on the

12:07

cloud and

12:09

handling failures, hardware

12:12

failures which is also when

12:16

you are the when you are

12:18

necessary of on-prem it's on

12:20

you or on the first years

12:23

that manage the issue once but

12:25

while in the cloud you just

12:28

detect the issue you need to detect

12:30

the issue and then replace the machine.

12:33

The main, so these were the two like

12:35

I came from on-prem not

12:38

only from Kafka like only

12:40

for spark application and it's it's

12:43

pretty much the same issue between on-prem

12:45

and on the cloud is these two

12:47

issues however the the

12:49

benefit of deploying on-prem is

12:53

the fact that you don't pay every

12:56

for every hour just

12:59

pay once and

13:01

there is I think today there is a

13:03

growing discussion whether

13:06

a cloud-based company

13:10

maybe they need to go on-prem for some

13:12

of their clusters so

13:16

it's the math the matter

13:18

behind this calculation sometimes

13:22

I think favors on-prem

13:24

even though you need

13:26

to to handle this

13:28

the scaling is very tough

13:30

and it's on

13:33

you to detect the hardware

13:35

issues but maybe the cost

13:37

sometimes justifies this. Yeah

13:40

the on-premise versus in cloud

13:42

debate is definitely ongoing and

13:44

always very nuanced and to

13:46

your point yeah where on-prem

13:50

the cost over time is much

13:52

lower because you own the hardware so you

13:54

don't have to pay a continuous upkeep for

13:56

it but there's the opportunity cost

13:58

of having to move slowly. and

14:00

more deliberately and perhaps

14:03

reducing the number of chances or experiments that

14:05

you take because of the fact that it

14:07

is so much lead time to bring in

14:09

that hardware and scale up the cluster. And

14:12

also on the point of Kafka

14:14

that's an interesting aspect as well

14:16

where my understanding

14:18

of the way that Kafka itself is

14:20

designed and some of the aspects of

14:24

having to define upfront the number of

14:26

topics in order to accommodate a certain

14:28

number of clients seems as though it

14:30

lends itself more readily to that fixed

14:32

installment on-prem environment versus the

14:34

cloud environment where you are

14:37

incentivized to elastically scale up

14:39

and down and I'm wondering

14:41

what you see as some of the challenges

14:44

of bringing Kafka into the cloud because

14:46

of that potential for elasticity in the

14:48

clusters. I

14:52

see that if

14:55

you don't care about the money like

14:57

use the cloud okay the

14:59

money is that the money is

15:01

the only reason why going on-prem because

15:03

it's indeed it's it's tough

15:05

but it's interesting that you mentioned that

15:09

Kafka might sound reasonable

15:12

a good candidate to

15:14

be deployed on-prem and

15:17

by the way there are companies

15:19

that most of their clusters

15:23

are hosting

15:25

open source analytics tool or

15:28

messaging tools are deployed on the cloud but

15:31

Kafka is deployed on-prem.

15:33

I know

15:35

of two not small Israeli

15:38

companies that have their Kafka

15:40

deployed on-prem and

15:42

what like from

15:44

from my experience because I was

15:47

deployed I was SRE for various

15:50

customers when I was working for

15:52

on-prem so I had I saw

15:56

several examples and

15:59

the main And

18:00

the CPU utilization is like

18:02

ridiculous. It's 10% user

18:04

time. And the

18:06

RAM usage, well, you never know what's your RAM

18:09

usage because we might later

18:11

talk about the page cache, but it's very

18:13

hard to understand what's the RAM usage in

18:15

Kafka. But the disk

18:18

storage was the utilization usage

18:20

was really high because of

18:22

this retention. And then

18:24

you come to a point, and I dedicated

18:26

a chapter in the book only for this,

18:29

to whether you use RAID, RAID

18:31

1 plus

18:34

RAID 0, or use Jboard. Because

18:36

I can give a real example

18:39

of a customer that had 17 brokers. And

18:45

just in order to satisfy the retention

18:47

requirements of the customer, and

18:49

then the customer decided to

18:52

double the amount of retention because

18:54

that's the order that he got.

18:56

It was some government law enforcement.

19:00

So you don't mess with them. They tell

19:02

you, OK, I need to double my retention.

19:04

So you need to satisfy this. And

19:09

the first reaction was, OK, let's double

19:11

down on the brokers. But

19:14

then I convinced

19:19

my manager that we just

19:22

need to switch from RAID 10 to

19:25

Jboard. RAID 10 gives you

19:27

double the amount of replication. So if you

19:29

have replication factor of three, you will get

19:31

six copies per segment if you use RAID

19:33

10. But if you use

19:36

Jboard, then you save

19:38

half of the disks. And

19:41

then you don't need to add

19:43

even one broker. Just use

19:46

the same number of disks. And if you

19:48

want even more storage, you can just add

19:50

more disks. But then you run into the

19:53

issue of, OK, I don't have enough

19:55

disks in my drawer. So

19:58

you need to add another broker. add

20:00

a draw to each broker. These

20:03

are things that on the cloud you don't

20:05

even think about them. People on the cloud

20:07

even don't know about them. But

20:10

if I will take one thing that

20:12

is really like that

20:17

makes provisioning Kafka on-prem to

20:19

be tough is the retention

20:22

requirements. And managers for

20:25

on-prem clusters tend to be very sensitive

20:27

for retention. They want 10 times

20:30

the amount of retention

20:33

or 20 times the amount of

20:35

retention that internet companies have, for

20:37

example. And bringing us

20:40

around to the book that you wrote,

20:42

which you mentioned, what was your motivation

20:44

for bringing this all together in the

20:46

written form? And what are the overall

20:48

goals that you have for the book

20:50

and the people who are reading it?

20:55

So back in,

20:57

I remember myself as a

21:00

back-end engineer trying to deploy

21:03

my application, my streaming application

21:05

on Kafka and going

21:07

to something didn't work in the Kafka on

21:09

dev. So I remember

21:11

going to the DevOps

21:14

room and everyone are afraid

21:16

when you go to the DevOps room

21:19

because they are the most important, the

21:21

critical part of the

21:23

organization. And I

21:25

asked them, okay, I don't know what's wrong

21:28

with my Kafka. And they

21:30

had no clue and they didn't get an answer because

21:32

they had no time. And then I went out to

21:34

the room. I decided I will know Kafka from

21:37

all the angles. That was

21:40

my really initial motivation. But

21:43

if to be more serious, when

21:47

I started handling production issues in

21:49

Kafka and I went over

21:51

to sites, customer

21:53

sites, I saw that

21:55

people just didn't look clueless about

21:58

what's wrong with Kafka. They blamed, everyone

22:02

blamed the other part.

22:06

The Devos blamed consumers, consumed the

22:08

blame producer, produces blamed Kafka. And

22:11

so I wanted to, and

22:16

it is in the middle of the data

22:18

pipeline. So that was my

22:20

motivation. Like the best arrive

22:22

for my time was definitely

22:24

Kafka and Linux operating system,

22:26

like understanding the metrics and

22:28

how to diagnose

22:31

using Linux metrics, diagnose problems in

22:34

Kafka. After

22:36

now, this is why I wanted

22:38

to learn Kafka

22:41

from real production issues. My

22:44

motivation for the book was

22:46

that these support engineers and

22:48

DevOps and developers that encounter

22:51

issues in Kafka, which

22:53

is not a managed service. So

22:55

the manager themselves will

22:57

have a cookbook for

23:01

understanding like recipes, understanding

23:05

how to handle production issues in Kafka. And

23:07

this is why the book

23:10

is split into three sections,

23:13

three logical section. It's the

23:15

data section, the Linux

23:17

OS section and the Kafka metrics section.

23:21

And also for most

23:23

of it cloud based, but

23:25

also two chapters dedicated specifically

23:28

to on-prem. Because

23:32

some of the problems that I saw

23:35

on-prem were just duplicated when I was

23:37

starting to work on the cloud. On

23:40

the cloud is just easier because you have monitoring

23:43

in front of your eyes, you don't need to get logs

23:46

from support engineers. But

23:51

I saw that there

23:53

was no book. There was nothing

23:56

that even resembled a book about

23:58

real production issues in Kafka. And

24:00

it is really need. A

24:02

his fall for this because the

24:04

so many Kafka deployments out their

24:06

window on Pram on the cloud.

24:09

And. And people are

24:11

just yeah yeah I talked to a

24:13

city over stop is a the start

24:15

of the told me I matthews Kafka

24:17

because that's the message us today but

24:19

it's hell. Okay, Managing

24:21

it is so tough. And.

24:24

I had it after he

24:26

said if one different people

24:29

am may different manager in

24:31

op's managers they'll. Attack.

24:33

They attack leaders in some. Small.

24:36

Companies and. And.

24:38

Of to the day Just dame.

24:41

A common common problem and

24:44

then then they do for

24:46

the bouquet in. A

24:48

I mean he didn't come to me

24:50

I got and also to write a

24:52

book and then they. A

24:55

took me some ass because it's a lot of

24:57

work. Then. And then I

24:59

decided to compile to to gather all

25:01

their stuff that they compiled for. Them:

25:04

the six years of me walking in

25:07

Kafka. And that

25:09

was. My. My Motivation:

25:11

Like I'm not saying that people

25:13

shouldn't use they Amazon Msk, Yale

25:15

your so conflict to live in.

25:18

Met. A thing that. For. For

25:21

for those who already managed

25:23

Kaska, They should have

25:25

been better array guide. Them

25:27

the the know guide first of

25:29

all and also for those who

25:31

consider daily manager gov can consume

25:34

say okay stuff it's to it's

25:36

too much so that we need

25:38

to pay license and move to

25:40

some many service. I.

25:43

Think that if they have a dick

25:45

or denied guide. Then. It

25:47

can save the money. And.

25:50

Am. I woke individuals that help

25:52

us or for twenty years and like

25:54

I say this is my a small

25:56

contribution. Today. open source

25:59

community To the

26:01

point of saving money and whether

26:03

to run your own Kafka cluster

26:05

or use a managed service, there

26:07

are a lot of considerations that

26:09

go into all of that, as

26:12

well as the use cases of what

26:14

you're going to be building on top

26:17

of Kafka. And what

26:19

are the sizing and scaling requirements?

26:21

I'm wondering for people who are

26:24

in that position of deciding, do I

26:26

want to use Kafka? How do I

26:28

want to use Kafka as far as self

26:31

managed or managed service? And

26:33

what are the parameters along which I need

26:35

to project what my cost is going to

26:38

be? What are the different

26:40

elements that they need to be considering and

26:42

planning for as they start to evaluate and

26:44

do those initial deployments? So

26:48

first of all, you need to understand your traffic.

26:50

Okay, so let's

26:53

assume you are not a small

26:56

company and that you decide if

26:58

you are the decision point whether you

27:00

need to use Kafka first of all,

27:02

and then if you use Kafka, whether

27:05

you're going to use cloud or you

27:07

manage it yourself. So regarding

27:10

alternatives to Kafka, I'm not

27:12

aware of any alternatives, however,

27:15

I never search for

27:17

alternatives because Kafka is just everywhere.

27:20

So let's say you chose Kafka and

27:22

you have enough traffic so that you

27:26

can build even a

27:28

minimal cluster of three brokers. But

27:32

let's assume you have more and let's

27:34

assume you have multiple Kafka clusters per

27:36

each team or service or group or

27:38

whatever. So in that case, first

27:40

of all, you need to know your traffic, the

27:43

number of topics, the cluster,

27:46

the number of partitions, and

27:49

who are your consumers, third

27:52

topic, who are your producers. So

27:55

understanding the producing rate, the

27:57

consuming rate, and

27:59

then then you need to provision a

28:02

cluster. Let's say even before

28:04

you know whether it's cloud based or not,

28:06

you need to understand how much CPU you're

28:08

going to require, how much run

28:11

in order to support not the Kafka process

28:13

itself, but the page cache, because you want

28:15

data to be read from the page cache

28:17

and not from the disk. And

28:20

allocate enough disks, cheap

28:22

disks, in order to support as

28:24

cheaper as possible to handle the

28:27

retention requirement. And

28:31

now, after you have this estimation

28:33

and you know how many

28:37

brokers you have and what's the size of the

28:39

broker, now it depends

28:41

the decision whether it's cloud based

28:44

or not. If you are cloud

28:46

based, you will usually pick

28:49

cloud based. But

28:51

again, some companies decide, OK, I will put

28:54

my Kafka on prem because they know it

28:56

cannot be spot instances. But

28:59

most of the companies which are cloud

29:01

based will deploy it on the cloud.

29:04

And they will deploy it probably

29:06

on on demand instances if they're on

29:08

AWS. Some,

29:10

if they have a strong

29:12

dev of steam, they

29:15

will deploy it on

29:17

Kubernetes maybe. But

29:22

I don't have an experience with Kafka on

29:24

Kubernetes so I can't say anything about it.

29:26

I also didn't write about it, of course.

29:30

But so, and

29:33

if you're on prem, you will deploy it on prem,

29:35

of course. Now, the question of

29:38

whether to deploy it on,

29:40

like purchase a license for

29:43

Confluent or for Amazon MSK or

29:48

for Ivan, these are some

29:50

of the many services, or use

29:52

it or manage it yourself. It

29:56

really depends on the, how

30:00

your ops team, whether it's DevOps

30:02

or DataOps, know

30:05

how to handle production

30:07

issues. And let's

30:10

say, you know, ops teams are

30:13

real, good ops teams, I

30:15

mean, are real. And

30:20

I think that if you, if

30:22

like, you need

30:24

good, an excellent monitoring of

30:27

Kafka. And when I say an excellent time, I don't

30:29

mean hundreds of metrics.

30:32

Okay, I mean, I mean, a

30:34

minimal subset, a minimal set of

30:36

metrics that will show you where

30:39

production issues can occur. And

30:41

in my book, I have two chapters,

30:43

one on producer metrics, one

30:45

on consumer metrics, and three

30:50

chapters on CPU,

30:52

RAM and disks for the broker

30:54

themselves. So I think like half

30:56

of the book talks about how

31:00

to diagnose issues and, and

31:03

from this, the reader can

31:05

understand what to monitor. So

31:07

you need an excellent monitoring, but not

31:09

too large. You need like

31:12

specific monitoring in

31:16

order to deploy it yourself, because if

31:18

you're blind in Kafka, you

31:20

will pay a big price, you will have

31:22

downtime, and you will just lose

31:25

data. So you must have

31:27

an excellent, excellent monitoring,

31:30

and a team who knows how to,

31:33

how to manage the Kafka cluster. And

31:37

otherwise, to choose the to

31:40

cloud based, or there

31:42

are again, because it's the

31:45

knowledge of handling Kafka clusters

31:47

is pretty rare. What

31:49

happens is that some places,

31:52

some customers, and they

31:54

saw it in on-prem, they just

31:58

sometimes Kafka holds and they lose their data. data

32:00

because there's not enough fair knowledge.

32:04

That's not, like, that's not the

32:06

kind of situation at my current

32:08

company. We have an excellent DevOps

32:11

team. And its team leader

32:13

was also the technical editor for my

32:15

book, and he affected a

32:17

lot on the content

32:20

and also on how to, like,

32:23

who, some parts of

32:25

what to focus. His name is Oaronon, and

32:28

I am very

32:30

grateful for him to,

32:33

that he invested the time

32:37

to read and to edit the book. But

32:40

again, it's some companies

32:42

you have an excellent DevOps team

32:44

and some customers you don't have.

32:47

And if you're willing to lose data at some

32:49

time, then use the

32:51

open source. And if not, then go

32:54

to, then

32:56

pay the license for Managed Kafka.

33:01

Data lakes are notoriously complex. For

33:04

data engineers who battle to build and

33:07

scale high quality data workflows on the

33:09

data lake, Starburst powers petabyte scale SQL

33:11

analytics fast at a fraction of the

33:13

cost of traditional methods so that you

33:16

can meet all of your data needs,

33:18

ranging from AI to data applications to

33:20

complete analytics. Trusted by teams of all

33:22

sizes, including Comcast and DoorDash, Starburst is

33:24

a data lake analytics platform that delivers

33:27

the adaptability and flexibility a lake has

33:29

ecosystem promises. And

33:31

Starburst does all of this on an

33:34

open architecture with first-class support for Apache

33:36

Iceberg, Delta Lake, and Hoodie, so

33:38

you always maintain ownership of your data. Want

33:41

to see Starburst in action? Go

33:43

to dataengineeringpodcast.com/Starburst and get $500

33:46

in credits to try Starburst

33:48

Galaxy today, the easiest and

33:50

fastest way to get started

33:52

using Trino. And on that note of

33:54

data loss and the... Cluster

34:00

uptime and stability. What

34:03

are some of the failure conditions

34:05

that cluster operators need to be

34:07

thinking about that might lead to

34:09

data loss and some of the

34:12

ways to mitigate that and plan

34:14

for it in order to be able to

34:16

reduce the time to resolution? In

34:19

terms of preventing a data

34:22

loss, let's start with prevention and then

34:24

go to what

34:27

can cause a data loss.

34:29

First of all, you have replication.

34:32

Now replication goes side by side with the

34:35

size of the cluster because the higher

34:37

the replication factor, then each

34:40

segment will have more copies on different

34:42

brokers. However, you will

34:44

need more storage and more

34:46

storage means,

34:49

like more storage means

34:52

sometimes more brokers. And

34:54

even if not more brokers, it means

34:56

more, more, more disks. So

34:59

it will cost you more, a

35:02

higher replication factor. Also, you

35:04

can add another level of

35:07

assurance and deploy

35:09

RAID, let's say RAID 10, RAID 1

35:11

plus RAID 0, and this

35:13

will double the amount of storage

35:15

that you need. So

35:18

replication is one thing. Making

35:21

sure that your producer,

35:24

they have the

35:26

receiver axe acknowledgement from the

35:28

broker is another thing, although it

35:30

may affect latency.

35:34

Retention policy. Retention policy, it's

35:36

pretty simple policy and you

35:38

wouldn't believe the number of

35:41

times that consumers lose data

35:43

because of it, because you

35:46

can define it by time or by

35:48

size or by both, and then the

35:50

threshold is the minimum between both, and

35:53

if you

35:55

define it by size and

35:58

suddenly the topic size, the increases

36:01

even by a small amount of percentage

36:03

and your consumer lags and you will

36:05

lose data or some consumers will lose

36:08

data. So if

36:13

the audience will take one thing regarding

36:16

retention is that they

36:18

highly consider configuring retention

36:20

by time and

36:23

not by size because you really

36:26

don't know the traffic at any

36:28

point of time and

36:31

some producer can increase the traffic by

36:33

a multiply of

36:35

10 because there

36:37

was some filter and the developer

36:39

just removed that filter and

36:42

suddenly you get 10 times the traffic.

36:45

However, if again

36:47

nothing is simple in Kafka, if you

36:50

configure retention by time then

36:52

you might get into storage

36:54

100% storage. So

36:57

again this equilibrium between retention

37:00

size, retention time and

37:03

storage. Again we reach storage.

37:07

Harder failures, monitor harder failures.

37:09

Now on-prem you

37:11

can just check the status of

37:13

the disks. I repeat again, disk,

37:16

disk, disk, this is the most

37:18

critical, the cheapest part of Kafka

37:21

but the most critical part of Kafka

37:23

and the cause for

37:25

many, many problems. So like

37:27

on on-prem you can, there are tools that

37:30

you can monitor the state of the disk

37:34

and just a very like

37:36

a commendation that can save

37:39

several cluster I assume. If

37:42

a disk becomes read-only, so

37:45

you can just read and not write, usually

37:48

it means you need to replace the disk because

37:52

even running FSCK on the disk

37:55

will help maybe for several minutes, hours or

37:57

days but after that it will become a

37:59

good thing. Again, we don't lean

38:01

and producers cannot write to it and it

38:03

will create the whole mess. So

38:06

monitor the disks on-prem

38:08

you have the tool to

38:10

check itself

38:13

and on the cloud, you can check

38:18

the disk utilization. So the

38:20

Iostat tool, the Linux tool

38:22

called Iostat, Iostat minus

38:25

X print every

38:27

one second can show you

38:29

that if the disk utilization is 100% all the

38:31

time, then

38:33

something is wrong with your disk. Another

38:36

metric is, well, there are lots

38:38

of metrics, but another one that

38:41

happens to be around every production

38:44

issue is the system time, the CPU system

38:46

time. There are four main

38:48

CPU metrics. You have the user time,

38:50

system time, I-wait

38:53

time, which is either disk time, wait

38:55

time for disk or wait time for network and

38:57

context switches. If the system time goes,

39:00

let's say about 10%, then

39:03

you should suspect something is not good

39:05

in your cluster. If your

39:07

context switches time reaches,

39:10

let's say more than three, 4%, then

39:14

check what happens there.

39:18

A common cause, by the way,

39:20

to a high context

39:22

switch time percentage

39:25

could be that you just

39:28

have too many disk algorithms compared to

39:30

the number of disks that you have. And

39:32

this is if you want to be

39:34

very cautious, you

39:40

can back up regularly

39:42

your data, either by

39:44

using the feature of storage

39:46

of Kafka, which I admit

39:48

I didn't use until now,

39:51

or you can just add another

39:54

consumer and just take

39:57

the data from the topic and persist it to

39:59

some... cheaper storage

40:01

like HDFS or

40:04

S3 where the storage is

40:06

separate from the compute. So

40:09

this is regarding some ways

40:11

to prevent data loss but

40:14

in order to but these

40:17

are the like what happens

40:19

after you already have a problem

40:23

but the question is how you prevent

40:25

the problem and from

40:27

my experience with Kafka in

40:29

production the last six years

40:32

is that Kafka cluster talk before

40:35

they hint you before they decrypt it

40:37

down they tell you stuff and they

40:40

tell you this through the discretization

40:43

diastole tool which you can

40:45

see using diastole tool they tell you this

40:47

using top command check the

40:49

system time the CPU system time

40:52

check the CPU context switch CPU

40:54

wait time if you

40:56

have GC issues in your Kafka

40:59

process then it will tell you

41:01

this through spikes

41:03

in the usual time during a full GC

41:05

you can use the JSTA tool

41:08

which is part of the JDK tools in order

41:10

to check for a full

41:12

GC the frequency of full GC again

41:16

it's like this is why half

41:18

the book talks about half the chapters almost

41:20

talk about how to

41:22

like various cases that

41:24

they stumbled upon regarding

41:28

what can lead you to data

41:31

loss I had a

41:33

colleague asking me I

41:37

gave him a copy of the book

41:39

and he asked me okay well but

41:41

you didn't talk about the issue of

41:43

unreplicated the partitions and

41:45

I replied to him well every

41:49

problem in Kafka can result

41:51

in under replicated partitions or

41:54

in in partitions that are not they

41:56

don't appear in the list of in

41:59

sync replications in the ISO. So,

42:03

and there are tens of problems which

42:05

might not even be on the Kafka

42:07

itself. Like it could be in the

42:10

operating system because something is like

42:14

sometimes it's even not in the Kafka again.

42:16

Like if

42:18

you deploy another service, if you

42:20

deploy anti-virus, anti-virus

42:23

or firework and

42:26

they scan the segments

42:28

all the time and you will have

42:30

high discretalization, this is not the fault

42:32

of Kafka but

42:34

it will cripple down your Kafka. So,

42:38

and there are some ISO on prem

42:40

clusters where you shut down the

42:42

firewall and then it comes up again

42:44

because of some policy. So,

42:47

you need to remember that not always

42:49

you run alone on the

42:51

broker but asking

42:53

like how like what can cause data

42:56

loss is like it's

43:00

like asking what should I monitor in

43:02

Kafka and how to deal with

43:04

it and that's the

43:06

topic of the book in

43:08

general. And given your experience

43:11

both working as a back-end

43:13

engineer and operating Kafka clusters

43:15

at scale, I'm wondering

43:17

if you were to be in

43:19

the room today redesigning Kafka from

43:21

the ground up, what are some

43:23

of the aspects of the system

43:25

design that you might choose to

43:28

revisit or revitalize? I

43:34

think the design is

43:37

makes a lot of sense, the

43:39

log-based approach. I

43:43

don't have many changes, I

43:46

don't think of any change other than

43:48

one change which is for me it's

43:50

a bug, I don't know if they declare it as a feature

43:54

that they

43:56

spread if you have more than one disc

43:58

then the partitions are being

44:00

spread among

44:03

the disks given the number of partitions

44:05

of disk and not given the

44:08

amount of storage per disk, which

44:11

I don't think that the CAFQ

44:13

community understands how it affects

44:16

some Kafka cluster owners to

44:18

decide whether they go RAID or JBOD.

44:22

It's like something that if it will

44:25

be fixed, the amount of cost that

44:27

will be saved for

44:29

disk and maybe for cluster will be pretty

44:32

big because

44:35

there are other

44:37

teams that say, okay, I don't want to

44:39

go JBOD because of this. I

44:42

don't want to handle the spread of the data among

44:44

the disk. But other

44:47

than this feature, feature slash

44:49

bug of spreading the data

44:51

evenly according to

44:53

storage and not the number

44:55

of partitions, I

44:59

must admit that I'm not

45:02

in the applicative side

45:05

anymore for a long time because to be

45:08

honest, it interests me much less

45:11

than the ops side. So

45:13

my focus is on the ops side and

45:17

then it comes down

45:20

to like, this question comes

45:22

down to like, it

45:24

could be asked regarding every big

45:27

data cluster, but in Kafka,

45:31

it's really hard to

45:33

detect, to understand the root cause of

45:35

production issues. To be honest,

45:37

I don't know why fix

45:40

a problem in Druid or Spark

45:42

or Spark streaming or things much

45:45

less time than understanding why

45:49

data was

45:52

lost on Kafka or

45:54

why the disk utilization is so high. I

45:57

don't fully understand it. Like

46:00

if I take the amount of time that I

46:02

invested in every Kafka production issue

46:04

compared to other classes, it will be, I'm

46:06

not kidding, like 10 times, between

46:09

five times and 10 times more. So

46:11

I think that, and

46:13

this brings me back to the motivation of writing

46:15

the book, that Kafka

46:17

uses the operating system in a

46:20

way that no other

46:22

open source that I know of

46:24

uses it, especially,

46:26

not the CPU by the

46:28

way, it's really like on the CPU,

46:30

it's like very simple, user time

46:32

and that's it, if you do it correct.

46:36

But the usage of the page cache, the trashing

46:38

of the page cache in cases that you have

46:40

legs and the

46:42

amount of stress that

46:46

comes down on the disks is

46:48

something that I don't

46:50

think that those

46:53

who created Kafka or those

46:55

who develop it or

46:58

contribute to it, I

47:00

think there's some split between those

47:03

who develop and those who maintain

47:06

and there's not enough

47:09

connection between the two

47:11

sides. And

47:13

maybe this is another one, I think of it,

47:17

I'm thinking out loud now, this

47:19

is another reason why the book

47:21

is important both for developers and

47:23

for ops team in order to

47:25

not only understand Kafka but also

47:27

understand for developers to understand the

47:29

ops team, because

47:32

there are so much problems that

47:34

are being caused by one

47:37

disk that goes wrong, one

47:40

disk, you

47:42

can have a cluster with tens of

47:44

disks and one disk go bad and

47:47

this can cause the whole

47:49

cluster to halt and

47:52

a healthy cluster

47:54

should not suffer from such

47:56

an issue. Remember getting the

47:58

call but five years ago

48:01

from the customer side that had

48:03

three brokers, six disks,

48:05

HDD disks, they configured

48:08

in RAID 10 and

48:11

one disk went bad. So you

48:13

see, but in RAID, you

48:16

see only one logical disk. So

48:19

and the utilization in IOSTA

48:21

shows the utilization of the

48:24

highest utilization among all the disks in the RAID and

48:26

then you see 100%. So

48:30

I had to, I guess that

48:33

it was one machine, one disk

48:35

that was, that got screwed and

48:38

I told the support engineer go to the room

48:40

and check if

48:42

the light waves on

48:45

one of the disks and he told me yes, the

48:47

light waves are one of them. But

48:51

imagine that you need to do, I

48:53

needed to get this

48:55

because of the lack

48:57

of monitoring. So when

49:00

I think of it, then the

49:02

work on something needs to change on,

49:06

like not on the work on

49:08

the disk because this is what Kafka does. It writes

49:11

massive amounts of data to the disk and

49:13

it reads, it tries not to

49:15

read from the disk, but at

49:17

many customer sites or cluster

49:21

it reads from disk and

49:23

first of all, Optin.node it reads from

49:25

disk. Secondly, like

49:29

many cluster have deployment of RAID.

49:31

So how can you

49:34

assist them in understanding the one

49:36

disk got screwed? They

49:39

don't have a, and in the

49:41

cloud also, by the way, why when

49:43

one disk gets bad, the

49:48

cloud provider won't tell you that because they

49:51

can't really tell you, okay, I'm on the disk

49:53

is on 80% utilization for 30 minutes.

49:59

We don't know if it's good or not so we will

50:01

not tell you everything and then

50:03

you get into high high weight and you replace

50:05

the broker so the

50:08

mitigation for these issues in

50:10

Kafka is is

50:12

is is brute

50:15

force and in figuring

50:17

out that you have this issue you

50:20

need to be a magician in order to

50:23

know this and

50:26

and developers of Kafka just

50:28

assume that okay someone

50:31

will handle it it's not us we just developed

50:34

and I think for me

50:36

coming I did a shift a career

50:38

shift which is not common like going

50:40

from developer to ops so

50:43

I understand the frustration from both

50:45

sides and

50:47

I think that there should be more people that

50:50

know both partigums

50:54

and because a lot of production

50:56

issues originate

50:58

from the lack of understanding of the

51:01

lack of a cooperation or knowledge

51:03

sharing between the between the two

51:05

if if the Kafka community had

51:07

better communication between the developers and

51:09

the ops team then I

51:13

think it would be much easier to detect

51:15

the disk issues which cause I

51:19

bet it causes like at

51:22

least third of the problems in

51:24

Kafka yeah it's definitely

51:27

always a challenge balancing the developer

51:29

of I just want to get

51:31

something shipped and do something cool

51:33

with some fancy new feature and

51:35

the operations team of I just

51:37

want you to stop crushing my

51:39

machine so that I can sleep

51:41

at night yeah yes but

51:43

but I it's like in

51:46

the kafka community I

51:48

think there is a lack of ops team that will

51:50

tell that will check these

51:53

these developments and

51:56

but it's not even your development just

51:59

for I'm telling, I'm asking the

52:01

Kafka community, like go to deployment

52:04

and of Kafka and

52:07

check, ask customers what's

52:09

the percentage of production issues

52:12

which is caused by disks? And

52:15

then try to, I

52:17

don't know, maybe every Kafka

52:19

tool needs a better monitoring of

52:22

disks. And

52:24

also if there is monitoring, how do

52:27

you read, like Brendan

52:29

Greg has an excellent explanation of how

52:31

to read IOS-TAN, the

52:33

output of IOS-TAN. So you

52:35

have the utilization, like the service time

52:38

is obsolete, no one needs

52:40

to look at it. So every utilization, you

52:42

have throughput, which is the read megabyte per

52:44

sec and write megabyte per sec and you

52:46

have IOPS, which is the read per sec

52:48

and write per sec. And

52:54

this is the fact that not many, people

52:57

know the saturation of disks is 60%,

52:59

only 60%, which

53:02

means that every increase in the

53:05

disk utilization going

53:07

is bringing the situation,

53:10

the level of IOA will become worse

53:12

and worse and worse. And

53:14

it's not like in CPU, in CPU it could be,

53:18

like the recommendation on cloud is like 75%

53:20

CPU saturation. Above

53:23

that your load average increases

53:26

in a nonlinear way. So

53:30

how do you read, how Kafka

53:32

use IOPS-TAN should

53:35

read the output of IOS-TAN? Well, it's

53:37

simple, look at the disk utilization.

53:40

The disk utilization of I saw

53:42

the Velo, IOPS-TAN telling me, okay, we reach

53:44

100%, this is not good. No, this is

53:46

okay, because Kafka works in burst.

53:49

So it writes a lot of data

53:51

for a small amount of time. So

53:53

we'll have high disk utilization caused by

53:56

read megabyte, write megabyte per sec and

53:58

write per sec. This

54:00

is good. And then you should see zero. And

54:03

then again, you have a burst of writes. And

54:07

but if you have burst of if

54:09

you have 100% utilization because of reads, then

54:12

it means that you read from the disk, which

54:15

means that you have a problem somewhere.

54:17

Maybe you have a consumer leg, maybe

54:19

a replica, some of a broker

54:22

that replicates the data is

54:24

is is a lagging behind,

54:27

which means the this partition is not in the

54:30

it's a this broker is not

54:32

in your ISR list of this partition. So

54:35

just look at the output of the

54:37

IELTSAT minus six. The

54:39

prints every one second. And if

54:41

you have several seconds of

54:43

100% utilization from a

54:46

right, then this is OK. But if you

54:48

have the same 100% utilization

54:52

from reads, then this is not

54:54

OK. Now, if the community

54:56

will take this and make it an

54:59

alert or some monitoring tool,

55:02

then it will reduce and

55:04

often will know how to read this. Then

55:07

it will reduce the it will

55:10

reduce the frequency of production issues.

55:14

It is more than 10 percent for sure.

55:16

Just this feature detect,

55:18

like detect legs

55:22

by looking at the IELTSAT. And

55:26

in your experience of working

55:29

with Kafka clusters and helping

55:31

customers and end users manage

55:33

and ensure their uptime, what

55:35

are some of the most

55:37

interesting or unexpected or challenging

55:39

production problems that you've had

55:42

the opportunity to diagnose?

55:45

OK, I will tell the most.

55:50

This is. The

55:52

OK, the most bizarre. I

55:55

will tell the most bizarre production issue, the

55:57

most the most interesting one. So the most

55:59

bizarre I. I already

56:03

mentioned it before was a

56:05

cluster of three brokers, 18 disks, 6

56:07

per broker, configuring

56:11

rate 10 that 140 disks rippled

56:13

down the cluster because it

56:16

was on rate 10. So

56:24

in fact, only two disks

56:26

were up and the disk utilization was

56:29

already pretty high. So it

56:31

just added to the party this problem. And

56:35

then I had to guess that one disk

56:37

was faulty. And

56:39

the support engineer really saw that one disk

56:42

was faulty after he went to the server

56:44

room. And of course,

56:46

the other, it's twin disk because

56:48

it's rate 10 also

56:51

didn't function. So

56:54

on date and the disk utilization

56:56

was already very high. So it

56:58

brought the disk utilization to 100% on

57:00

the broker. So third

57:03

of the leaders, partition leaders were

57:05

on that broker and then producers

57:08

stopped, couldn't try to them.

57:10

Consumers didn't read from them. And

57:13

once consumers don't read

57:15

even from one partition, then they

57:18

just cannot function. It

57:20

depends, of course, on the nature of your consumers. But

57:22

this was a streaming

57:25

application that had to read from all

57:27

consumers. So it got stuck. Even

57:30

if it wasn't the application that needs to

57:32

read all consumers, then you have

57:34

partial data. So that

57:36

was, but that was a very

57:38

bizarre problem that combined

57:41

on-prem and high disk utilization and

57:43

guessing that one disk is faulty.

57:45

The most interesting problem

57:49

that they ran into was

57:52

it took several

57:55

weeks, I think, and it

57:58

involved a... a

58:01

broker that from a certain point

58:03

in time every broker that was

58:05

added to the cluster due to

58:08

some failure of

58:10

other broker, every

58:12

broker was got

58:16

at some point to 100% discretalization and and

58:23

nobody managed to write and read

58:25

from it. Nobody managed to write

58:27

so there was nothing to consume.

58:31

Every time we replaced this broker

58:33

with another broker and again the

58:35

same phenomena happened and like

58:40

and then again looking from the IELTS start

58:43

after a long long long time of trying

58:46

to understand what is going on here. It was

58:49

like Voodoo and we

58:51

noticed that at

58:55

some brokers the

58:57

discs reached 100% utilization

59:00

with because

59:03

for all the time there were 100% utilization because

59:07

they but

59:10

when we looked at the throughput we

59:12

saw that there are half the

59:15

throughput of the throughput

59:17

that causes other discs on other

59:19

brokers to be 100% utilization. So

59:22

not only did the healthy brokers reach

59:24

100% utilization for small amounts for a

59:27

few seconds and then the

59:29

utilization went down. These

59:31

for the brokers with

59:34

40 discs reached 100%

59:36

discretalization and it kept 100%

59:39

utilization and what the throughput was

59:42

half. So just dividing

59:45

just correlating the discretalization with

59:47

the throughput and the

59:50

amount of time the disc utilization was

59:52

100% let

59:55

us understand that these are just

59:58

40 discs. But

1:00:00

this took a lot of excels and the

1:00:02

experience and just checking, trying

1:00:04

to correlate every, or

1:00:07

small or asymmetric until

1:00:09

we, we found it out. And

1:00:11

that was by far the most

1:00:13

interesting production issue

1:00:15

that I stumbled into. In

1:00:18

Kafka. And in

1:00:20

your work of writing the book

1:00:23

and consolidating all of the information

1:00:25

and experience that you've had working

1:00:27

with Kafka, I'm wondering if there

1:00:30

are any insights that that helped

1:00:32

you gain or any new knowledge that you

1:00:34

were able to obtain in the process. A

1:00:38

of course. Yeah, because mainly

1:00:43

from the world, two

1:00:46

issues like, like it

1:00:49

helped me out formulate the,

1:00:51

the, the three legs that

1:00:53

Kafka stands on a Kafka, it needs

1:00:55

to understand, let's say, which is the

1:00:57

data part, the OS part and the

1:00:59

Kafka part. I was

1:01:01

surprised to see that the Kafka part

1:01:04

is only third of the book,

1:01:07

which shows how much the data part is

1:01:09

important. How is it the, how

1:01:12

the, the, how

1:01:14

the way that the data is

1:01:16

spread among the politicians

1:01:19

is so important for the, for the

1:01:21

health of the cluster. And

1:01:24

also, I was

1:01:26

surprised of how many issues

1:01:29

production issues originate from

1:01:31

a problem with a, with

1:01:33

storage. And,

1:01:37

and also I found

1:01:40

out several producer and

1:01:42

consumer metrics that, that were

1:01:44

new to me because

1:01:47

I thought that I,

1:01:50

that many issues can be

1:01:52

fixed with, with tuning the lingual

1:01:55

and the best size on the producer and that

1:01:57

found out several very

1:01:59

important. and matrix in the consumer and produce.

1:02:02

So this was also new to

1:02:05

me. I ran across them

1:02:07

during the time, like

1:02:10

from production issues that I dealt

1:02:12

with during the time that

1:02:14

I wrote the book. And

1:02:17

I must

1:02:19

say that a lot of things written in the book

1:02:22

were not things that only

1:02:24

I discovered. So I

1:02:26

worked with several people that

1:02:28

we worked together and

1:02:33

found the issues together. So in

1:02:36

part it was just documenting what

1:02:39

team of ops people

1:02:41

found out, including

1:02:43

me but also other

1:02:46

people. I

1:02:48

have like, in

1:02:51

order to make this more specific, I

1:02:54

have for example the issue

1:02:57

of the storage, just to emphasize,

1:03:01

when I wrote the part of the storage usage,

1:03:03

turns out that it's much,

1:03:07

it's more vast than what

1:03:09

I thought of. So if

1:03:11

we just take this issue,

1:03:14

the storage usage issue, so

1:03:17

for example, running out of

1:03:19

disk space, due

1:03:23

to retention configuration, what

1:03:25

happens when you configure

1:03:27

both time-based and size-based,

1:03:31

the way the option, there is

1:03:33

an option to lose data. And

1:03:35

I was surprised to see

1:03:37

how two simple configurations like this

1:03:39

can cause data loss.

1:03:42

And so retention policy and

1:03:44

its effect on the data loss, the

1:03:48

explaining like how to add storage to

1:03:50

a cluster, how it differs between on-prem

1:03:52

and on the cloud, the

1:03:54

fact that when you're

1:03:57

on-prem, this is not only technical

1:03:59

decision, it's... I'm in a general or

1:04:01

financial decision because you can't tell the

1:04:03

owner of the data center. Okay, I

1:04:05

had a mistake. I don't need the

1:04:07

two terabyte disk. I need four terabyte

1:04:09

disk. So I need to throw

1:04:11

away all the two terabytes and buy four terabyte disk.

1:04:15

So it will not go smoothly. So

1:04:18

understanding these aspects, like

1:04:21

this chapter became partially technical

1:04:23

and partially how

1:04:25

is a provider for

1:04:28

an on-prem customer, how

1:04:30

you manage this issue. Again,

1:04:34

in DIMS, it's the same issue.

1:04:36

Okay, it was a mistake. I

1:04:38

bought 16 megabytes disk, 16 gigabytes and 32

1:04:41

or 64. How

1:04:43

do you pass this decision? How do you

1:04:45

mix DIMS in on-prem? And

1:04:49

also the effect of

1:04:51

the retention on data replays. Sometimes

1:04:53

you need to replay the data

1:04:55

because you did some wrong transformations.

1:04:58

So what's the effect of like customer,

1:05:01

ops team need to understand that

1:05:04

they need storage also for

1:05:07

replay. The

1:05:09

data skew, how data skew can affect data

1:05:11

loss. Even if you have a lot of

1:05:13

storage, if you don't partition the

1:05:16

data correctly, then you will

1:05:18

get data loss at some point, even one

1:05:20

partition. And it's for certain

1:05:22

consumer, this is like data loss in all

1:05:24

partition. You need to replay the data again.

1:05:27

So the data aspects were also something

1:05:29

that I learned along the way.

1:05:34

So this is only an example of

1:05:37

one chapter. They should discuss in the

1:05:39

chapter of the storage users. But

1:05:42

not only that I learned during

1:05:44

the writing the book, I

1:05:46

think that if I wouldn't write the

1:05:49

book, I would forget almost everything. So

1:05:53

for me, the

1:05:55

personal benefit for me

1:05:58

is that I know I

1:06:00

remember stuff, really Kafka

1:06:03

related stuff that I knew and

1:06:06

I didn't forget them, but also that I

1:06:08

learned along the way. So

1:06:12

I think the investing 10 months

1:06:16

during the weekends, every

1:06:19

weekend for this amount of period to

1:06:21

write the book was beneficial

1:06:25

for my technical knowledge. Absolutely.

1:06:29

Are there any other aspects of the

1:06:31

work that you've done with Kafka, your

1:06:33

work on the book, the overall Kafka

1:06:36

operations ecosystem that we didn't discuss yet that you'd

1:06:38

like to cover before we close out the show?

1:06:42

I think we covered a lot

1:06:45

of technical stuff. We

1:06:47

can discuss the cost

1:06:49

reduction, but

1:06:52

in very short, like I would like to mention

1:06:54

that there

1:06:57

is a chapter on cost reduction

1:06:59

in Kafka, but it

1:07:01

relates to like I brought

1:07:04

six examples, six real world

1:07:06

examples of cluster that

1:07:08

I stumbled upon, which per

1:07:12

each example, I write how

1:07:15

much, I specify how much

1:07:17

CPU RAM and disk each cluster

1:07:19

has and

1:07:22

also the usage of each of

1:07:24

these resources. And then

1:07:26

I ask whether the cluster can be scaled

1:07:28

down or scaled in. And

1:07:31

then I discuss other metrics,

1:07:35

monitoring metrics and

1:07:37

by correlating this monitoring Kafka

1:07:39

monitoring metrics and the operating

1:07:42

system metrics usage, I

1:07:45

give recommendation regarding whether you

1:07:47

can scale in or scale

1:07:49

down the cluster. And

1:07:52

I think this

1:07:55

is like the

1:07:57

cost of the Kafka cluster. It's

1:08:00

not big, I think, compared to

1:08:02

other clusters in an organization. But

1:08:05

because for cloud-based,

1:08:10

I assume that most of the

1:08:12

deployments are on demand. So

1:08:16

even if you have reservation, again, it's on

1:08:18

demand. It's not spot. So it's important, especially

1:08:21

in today's market, to

1:08:24

squeeze in every penny that you

1:08:26

can save. So

1:08:30

the cost reduction part is

1:08:32

something that can help

1:08:35

to reduce costs on

1:08:37

Kafka. But

1:08:40

there is a part that I didn't talk

1:08:42

about, which might be a

1:08:45

bigger part, even than the machines

1:08:47

themselves, which is

1:08:49

the data transfer between consumers

1:08:51

and the brokers. Because

1:08:54

if there is no rec

1:08:56

awareness in the cluster, then

1:08:59

consumers will read data only

1:09:02

from leaders. And these

1:09:04

leaders can be... Most

1:09:08

of the leaders statistically won't be in

1:09:10

the same AZ. And

1:09:15

for some companies, this can save hundreds

1:09:18

of thousands of dollars per year

1:09:20

configuring rec awareness. But

1:09:22

since I don't have any experience with rec

1:09:25

awareness, I didn't

1:09:27

discuss it thoroughly. But

1:09:31

for those who

1:09:34

listen, checking if you can

1:09:36

configure rec awareness in

1:09:38

your consumers and

1:09:42

brokers, sorry, this can

1:09:44

be beneficial. You need to check your

1:09:46

data transfer cost. And

1:09:49

maybe it's worthwhile for you to

1:09:51

invest in deploying

1:09:55

and testing and

1:09:57

validating the rec awareness. And

1:10:00

then you will read not from the leaders,

1:10:02

but from the closest replica and

1:10:05

save a data transfer code. Yeah,

1:10:08

that can definitely be a substantial cost when running

1:10:10

in the cloud and that's always one of those

1:10:13

surprise gutches when you're first getting up

1:10:15

and running in a cloud environment. Not

1:10:20

only when you're getting started, but also

1:10:22

when you're after

1:10:25

years, then you see the

1:10:27

data transfer. This brings

1:10:29

us back to what we started with

1:10:31

regarding cloud versus on-prem. There

1:10:34

are the amount of reasons

1:10:36

why it becomes

1:10:39

like I don't know

1:10:41

like for some clusters it

1:10:43

might be a good idea to check

1:10:45

the like I'm

1:10:49

saying it is because I came from on-prem, so

1:10:52

it's not Chinese for me and and

1:10:55

okay, so you don't have managed

1:10:57

services, but it

1:11:00

might be from from

1:11:02

the financial

1:11:05

perspective a valid

1:11:07

choice for some clusters to

1:11:09

be deployed on-prem.

1:11:13

All right, well for anybody who wants to get

1:11:16

in touch with you and follow along with the

1:11:18

work that you're doing I'll have you add your

1:11:20

preferred contact information to the show notes and as

1:11:22

the final question I'd just like to get your

1:11:24

perspective on what you see as being the biggest

1:11:26

gap in the tooling or technology that's available for

1:11:29

data management today. Also,

1:11:32

I usually work with analytics

1:11:37

clusters and

1:11:39

I think that if

1:11:44

there was a tool, the

1:11:46

tool would show at

1:11:48

any given point of time correlation between

1:11:50

the traffic whether

1:11:53

it's the incoming traffic or

1:11:55

a query or query load.

1:11:58

So correlating with between the load

1:12:01

on the cluster and

1:12:04

the usage of the cluster

1:12:07

in terms of CPU,

1:12:10

ROM, disk, or

1:12:12

even internal usage. For example, let's

1:12:14

say a Druid cluster that uses,

1:12:18

sometimes the bottleneck is the number of workers

1:12:21

or for three new clusters, the number

1:12:23

of split queries. So if

1:12:25

there was something that some tool that would

1:12:27

show correlation between the load on

1:12:30

the cluster and the real usage

1:12:32

and cost, it

1:12:35

would allow ops team to

1:12:39

better understand whether they can save

1:12:42

cost on the cluster, whether they can scale

1:12:44

it down or maybe

1:12:47

replace on-demands with spots or

1:12:49

maybe replace on-demand reservation with

1:12:51

on-demands without reservation and then

1:12:53

auto-scaling them. So something

1:12:55

that we show, some tool that will show

1:12:57

correlation. Between usage,

1:13:00

applicative usage and resource

1:13:02

usage and will enable to save

1:13:05

cost because

1:13:07

especially today, in today's

1:13:10

economic, it's

1:13:14

becomes pretty important to save cost.

1:13:17

All right, well, thank you very much

1:13:19

for taking the time today to join

1:13:21

me and share your experiences of running

1:13:24

and operating Kafka clusters and the work

1:13:26

that you've done on the book to

1:13:28

make that easier for everybody else to

1:13:30

do as well. It's definitely a very

1:13:32

challenging and necessary task. And

1:13:35

as you said, Kafka is very widely deployed.

1:13:37

So I appreciate the time and energy that

1:13:39

you put into sharing your hard-won knowledge with

1:13:41

everyone else. And I hope you enjoy the

1:13:43

rest of your day. Cool,

1:13:45

thank you very much. Again, thank you for hosting

1:13:48

me and

1:13:51

I hope that the audience will

1:13:53

gain something from this podcast.

1:13:57

Thank you. Thank

1:14:02

you for listening. Don't forget to check

1:14:05

out our other shows, Podcasts.init, which covers

1:14:07

the Python language, its community, and the

1:14:09

innovative ways it is being used, and

1:14:11

the Machine Learning Podcast, which helps you

1:14:13

go from idea to production with machine

1:14:16

learning. Visit the site at dataengineeringpodcasts.com, subscribe

1:14:18

to the show, sign up for the

1:14:20

mailing list and read the show notes.

1:14:23

And if you've learned something or tried out a product from the

1:14:25

show, then tell us about it. Email

1:14:27

hosts at dataengineeringpodcasts.com with your

1:14:30

story. And to help other people

1:14:32

find the show, please leave a review on Apple

1:14:34

Podcasts.

Rate

Get this podcast via API

From The Podcast

Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Join Podchaser to...

Rate podcasts and episodes
Follow podcasts and creators
Create podcast and episode lists
& much more

Episode Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.

,

Unlock more with Podchaser Pro

Audience Insights

Contact Information

Demographics

Charts

Sponsor History

and More!

Pro Features

Resources
Help Center
Blog
API

Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More