Podchaser Logo
Home
Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Released Monday, 11th December 2023
Good episode? Give it some love!
Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Monday, 11th December 2023
Good episode? Give it some love!
Rate Episode

Episode Transcript

Transcripts are displayed as originally observed. Some content, including advertisements may have changed.

Use Ctrl + F to search

0:11

Hello, and welcome to the Data Engineering

0:13

Podcast, the show about modern data management. Introducing

0:17

Routerstack Profiles. Routerstack

0:19

Profiles takes the SAS guesswork and SQL

0:21

grunt work out of building complete customer

0:24

profiles so you can quickly ship actionable,

0:26

enriched data to every downstream team. You

0:29

specify the customer traits, then Profiles runs

0:32

the joins and computations for you to

0:34

create complete customer profiles. Get

0:36

all of the details and try the

0:39

new product today at dataengineeringpodcast.com/ Routerstack.

0:42

You shouldn't have to throw away the database

0:44

to build with fast-changing data. You

0:46

should be able to keep the familiarity of

0:48

SQL and the proven architecture of cloud warehouses

0:51

but swap the decades-old batch computation model for

0:53

an efficient incremental engine to get complex queries

0:55

that are always up to date. With

0:58

Materialize, you can. It's the only

1:00

true SQL streaming database built from the ground up

1:02

to meet the needs of modern data products. Whether

1:06

it's real-time dashboarding and analytics, personalization

1:08

and segmentation, or automation and alerting,

1:10

Materialize gives you the ability to

1:12

work with fresh, correct, and scalable

1:14

results, all in a familiar SQL

1:16

interface. Go to dataengineeringpodcast.com/Materialize

1:19

today to get two weeks

1:21

free. And

1:25

now bringing us to the Anomstack project, you

1:27

said that some of its origin comes from

1:29

the work that you're doing at NetData. But

1:31

I'm wondering if you can just give an

1:33

overview about what it is that you've built,

1:35

some of the story behind how it came

1:37

to be, and why you decided that you

1:40

wanted to make it as accessible and approachable

1:42

as possible. Yeah, so probably

1:44

primarily it's because I've had to build

1:46

versions of this in every job I've

1:48

been in for the last 10 years.

1:51

It's always been kind of custom every time,

1:53

and a little bit, you know, not very,

1:55

very custom and specific to whatever infrastructure or

1:58

data stack you're using. Nowadays,

2:00

there's a lot of open source projects and tools that

2:02

we can build on. And I just felt

2:04

like the time is right now to actually save myself

2:07

from building it the next time for the next five

2:09

years, I should just build a project that I can

2:11

open source and see if I can get some contributions

2:13

around. And so the

2:15

idea there is this is focusing on smaller

2:18

teams, smaller data

2:20

operations, give

2:22

them a simple way to just bring their

2:24

metrics and get really decent and

2:27

non-detection out of the box, basically. And

2:29

in terms of the

2:31

term metrics, given

2:33

your background in that data, that

2:36

makes me think about metrics from

2:38

an operations and infrastructure standpoint about

2:41

what is the CPU load, what

2:43

is the available memory. But

2:45

the term metrics in the data ecosystem has

2:48

also become overloaded with this idea of the

2:50

semantic layer and business metrics. And what does

2:52

it mean for somebody to be a customer?

2:54

And I'm wondering if you can maybe give

2:57

some sense about how you're thinking about metrics

2:59

in the context of an AMP stack and

3:01

the ways that it can be applied. Yeah,

3:03

so actually, metric trees is another

3:06

thing I've seen recently. There's a lot of talk

3:08

around metric trees and building these relationships on the

3:10

metrics. The main goal is

3:12

simplicity. And so there is

3:14

lots of different metric concepts in the

3:16

observability space. But we're

3:19

not using that here necessarily. So the

3:21

definition of a metric, basically, is a

3:24

row on your data frame or a

3:26

row on your database in the metrics

3:28

table, where it's literally just a metric

3:30

name, timestamp, and value. And that's it.

3:33

So that's kind of the idea there is this

3:35

makes it really easy for users. That's all a

3:37

user has to produce in these three fields.

3:41

And so we're not going too fancy

3:43

in terms of complex metric definition, because

3:46

that just adds kind of a little

3:48

bit more of a ramp for people

3:50

to actually use the system. So

3:53

there's pros and cons to each, of course,

3:55

like in observability. And you have all these

3:57

concepts in tools like Prometheus. different

4:00

types of metrics and how you work dimensions in and

4:02

stuff like that. But for our case, for an unstacked

4:04

idea, it's just keep it as simple as possible, basically,

4:06

to begin with. And that also makes

4:09

it very flexible because if you

4:12

don't necessarily have a constrained

4:14

definition of what that metric can be and

4:16

what it's supposed to mean, then that means

4:18

that everybody can map it to whatever semantic

4:21

attribute they want it to in order to

4:23

determine what are the anomalies and how does

4:25

that impact whatever it is that I'm trying

4:27

to measure. Yeah, and this is kind

4:29

of actually something that I have got on the

4:31

roadmap for the project is to extend

4:34

a little bit so that when you're defining the metric,

4:36

you also define some metadata. Obviously,

4:38

the first thing being like a metric description,

4:40

say, and because the idea there is actually

4:43

if we could do that, even if you

4:45

had a useful description, that

4:47

would help a lot more with then like the

4:50

saliency of the anomalies because an anomaly is an

4:52

anomaly, but whether it's something you care about or

4:54

not is a different question. And so if we

4:56

can get some of this metadata, like maybe things

4:59

like priority P1, P2, or whatever different tags you

5:01

want, you could

5:03

obviously then do different kind of route. You can

5:05

route the lyrics differently. But actually, like longer term,

5:07

I'm thinking that there could be something where this

5:09

could be something that large language models

5:11

could obviously use as well. So if they had this

5:14

kind of rich metadata that they could make sense of,

5:16

that could also be useful in terms of, you

5:18

might say, oh, how's my, what's my

5:20

nominees in the sales today? And the fact that you

5:22

have all this stuff in the descriptions would make that

5:25

a lot easier. So previously with the semantic, all the

5:27

semantic stuff was good, but there's a lot of overhead

5:29

to maintain it, you have to agree

5:31

in your structure front and implement it. Whereas

5:33

if we just allow some kind of free

5:35

texty, more higher level stuff, there's

5:37

definitely roles where I think language models could help

5:40

make sense of it as well in terms of sorting

5:42

through the metrics. Yeah. And giving you

5:44

some human level understanding about, is this

5:47

something that you actually care about? Yeah,

5:49

that's always the problem because oftentimes

5:52

systems like this, you end up metrics, there's thousands of

5:54

metrics. And the idea is, we want metrics to just

5:56

be like cattle, you don't have to think about them.

5:59

They're not special. just produce

6:01

your metrics, metrics, metrics. And then that's great because then you

6:03

have all these metrics but then the problem can be how

6:05

do you make sense of it when you maybe have a

6:08

hundred layers today and maybe 50

6:10

of those layers are on metrics that, they're

6:12

nice to know but they're not that important.

6:15

And so it's things like that where if you could

6:17

have each of these layers be like a little insight

6:19

snippet, you could actually maybe have a language model make

6:21

sense of it or ultimately longer term, if you had

6:24

a sort of a feedback loop on top of the

6:26

system, like an on stack where you could give thumbs

6:28

up, thumbs down to sort of try and start measuring

6:30

saliency of like, okay, what do people care about more

6:32

than average that then kind of could become a whole

6:35

different layer on top of it, but that's an open

6:37

problem. I don't think anyone's really solved that yet, to

6:39

be honest. Absolutely. And also even

6:41

if some single metric is anomalous,

6:43

it maybe doesn't matter unless it's

6:46

correlated with another anomaly in

6:48

a different metric. And it's

6:50

that conjunction of anomalies across

6:52

different metrics series, or maybe even across different

6:54

service boundaries that will let you know, oh,

6:57

hey, there's actually something really wrong here. You

7:00

need to do something about it. Yeah, yeah. And that's it.

7:02

That's like another, that's something that I've seen

7:04

some people do really well. So Amadot

7:06

is another tool I've used in the past with

7:08

for an ambience section. And they do a really

7:10

good job of this where they stack all the

7:12

alerts together into sort of, so each alert becomes

7:14

like a stack of alerts. And then you can

7:16

kind of quickly, really quickly see based on the

7:19

map of like, okay, what's making up this batch

7:21

of alerts basically. That's something

7:23

that I would like to add in the future as

7:26

well actually could be really interesting. And

7:28

you mentioned that one of your objectives of building

7:30

this project and releasing it as open source is

7:32

so that you don't have to build it again

7:35

in whatever future role you have. I'm wondering if

7:37

you can just give an overview about what are

7:39

the core objectives that you have and what are

7:41

the things that you would like to see come

7:44

out of this project and some of the direction

7:46

that you'd like to see it taken in. Yeah,

7:49

so the main objective is just have

7:51

a nice, easy open source solution for

7:53

people to get good anomaly detection on.

7:55

Typically business metrics is what I have

7:58

in my head here. and

8:01

low overhead. And then, so

8:03

if you're like someone that's kind of, you

8:06

don't necessarily have to be an infrastructure engineer, just

8:08

technical enough to maybe, you bring your own SQL

8:11

to define the metrics or you can define custom

8:13

poison functions to define the metrics as well. But

8:15

the idea is like, you could be a business

8:17

analyst who can actually just bring your metrics and

8:19

then actually stand this up yourself. And

8:22

it's just a Docker container. So that's

8:24

the main idea is like, keep it

8:26

as easy as possible for like smaller

8:28

teams to either can't afford bigger

8:30

expensive SaaS solutions, or

8:33

they don't necessarily have the

8:35

time or expertise to like build their own custom

8:38

solution. That's just use the tool like this and

8:40

get decent enough anomaly detection on all your metrics

8:42

out of the box. That's the

8:44

main aim. And for people

8:46

who are interested in being able to get

8:48

these alerts and understand, okay, I've got lots

8:50

of metrics. I don't wanna have to care

8:52

about them and keep a close eye on

8:55

them. I just want something to let me

8:57

know when there are things going wrong. What

8:59

are some of the other tools or products that

9:01

they might be evaluating when they

9:04

come across a NOM stack and what are the aspects

9:06

of a NOM stack that might sway them in its

9:08

favor? Yeah, so there's lots

9:10

of, there's kind of, there's

9:12

a couple of different solutions here, a couple

9:14

of different approaches. There's like vendors who I've

9:16

actually used in the past. Amadot is probably

9:18

the biggest and the oldest player here. Like

9:21

they really go and anomaly detection across all

9:23

types of metrics. And I'm

9:25

not the butter stuff. It was a few

9:27

years ago and I used them and it's done a

9:29

lot since as well. And so these are like services

9:31

that you pay for and enterprise setting that they're very

9:34

expensive and there's a bit of configuration involved once

9:37

they're up and running, they're good. And

9:39

then there's also lots of like

9:41

newer SaaS type startups in

9:44

the kind of modern data stack space and era that we're

9:46

in. So KOS Genius

9:48

is another one there that's actually, I've been looking at

9:50

recently that's pretty good and pretty cool. But

9:52

there's also then the other approach there. A

9:55

lot of the data warehouses now are starting to build

9:57

some of these ML features into their, into

10:00

their stacks themselves. So like Snowflake, BigQuery,

10:03

they all actually now typically have their

10:05

own anomaly detection functions

10:07

and ML functions that you

10:09

can train models and save models just

10:11

within your SQL. That's another option as

10:14

well. If you're using a platform like this, you can always,

10:16

of course, try and... It's a little bit easier now to

10:18

try and roll your own because

10:21

you can do a lot of it now in SQL

10:23

itself. And then the other vendors, like Metaflane is actually

10:25

one I've used as well. Metaflane is pretty cool. It's

10:27

a little bit more focused on the data data

10:30

engineer and data upside of

10:32

the metrics. But you can tweak some of

10:34

these things to also cover business metrics as

10:36

well. And digging

10:39

more into that concept of the

10:41

business metrics and being able to

10:43

generate alerts and detect when there

10:45

are anomalies, I guess that's

10:47

another vague term that might

10:49

be worth digging further into is that idea

10:52

of anomalies and what makes something actually anomalous.

10:54

Is it just because it is

10:56

two standard deviations away from the mean? Is

10:58

it because there's something, some specific

11:00

rule that you have that this value

11:02

can never exceed this threshold? I'm wondering

11:05

what are some of the specific types

11:07

of anomalies that you're looking to address

11:09

and alert on and some of the

11:11

ways that people need to be thinking

11:13

about how to understand when something is

11:15

actually anomalous versus just a little bit

11:17

weird. Yeah, yeah, that's a good point. And

11:19

this is kind of, I'm a little bit obsessed with

11:21

anomaly detection, to be honest, because it's one of those

11:23

areas of machine learning and data

11:25

science that still has, there's another kind of

11:28

art and science involved in

11:30

it. So there's a lot

11:33

of subjective decisions as to like, well, does this

11:35

look anomalous to you? It does to me. And

11:37

it's not as easy as just doing something like

11:42

regression or classification, where you have a simple

11:44

metric like accuracy. In anomaly detection, you don't

11:46

have any metrics like this that you can

11:49

use as a source of truth. So it's

11:51

a little bit subjective. And so

11:54

that's one of the reasons why we use good

11:57

defaults, basically. So we're using PoIOD, which is

11:59

an old open source project around

12:01

anomaly detection. And basically

12:03

we have defaults there to use

12:05

like as flexible a model by

12:07

default as possible. So it's using,

12:09

you know, best practice standard sensible

12:11

things around feature pre-processing. And then

12:14

it's using like a PCA based

12:16

anomaly detection model, which is more

12:18

flexible and it'll cover more types

12:20

of anomalies as opposed to say,

12:22

if you just have single spikes,

12:24

you know, okay, they're the obvious ones that people

12:26

always think of, but sometimes it's basically like instead

12:28

of a single spike, is it a strange little

12:30

squiggle that's changed recently? Or is it an increase

12:32

in trend and different, a different wider kind of

12:35

cast the net as wide as possible? And

12:37

so that's why we're using PYOD with like

12:39

a general, you know, flexible model underneath, but

12:42

there's also then of course, like if you're

12:44

a user, you can define your own pre-processing

12:46

functions or you can define your own

12:48

model as well. So you can, if

12:50

you wanted to, you can extend it to be like,

12:53

maybe you know, for instance, if it's say, well,

12:55

okay, this metric here is daily sales

12:58

and you actually know that there's a big impact on

13:00

whether it's the weekend or whether

13:02

it's the, you know, the weekday say, or even

13:04

time of day. So you could actually build your

13:06

own, your own pre-processing function to say, okay, I

13:09

wanna like, when it's the weekend, I want it

13:11

to be, you know, is weekend equals one. And

13:14

then when it's during the week, is weekend equals zero.

13:16

And you can then pass that through to the model

13:18

to use that as a feature. So it can

13:21

get quite sort of, it can

13:23

depend a lot on exactly how you

13:25

want to do it, but the idea here

13:27

with, you know, I'm stuck approach is to

13:29

like use as general and sensible a default

13:31

as possible that, you know, will cover

13:33

all metrics reasonably well. And then if you want to

13:35

kind of go more complex, you can, but

13:38

yeah, it can get quite sort of subjective and

13:40

complicated in terms of, you know, what

13:42

is in the nominee or not. Data

13:46

projects are notoriously complex with

13:49

multiple stakeholders to manage across varying

13:51

backgrounds and tool chains. Even simple

13:53

reports can become unwieldy to maintain.

13:56

Miro is your single pane of glass where

13:58

everyone can discover, try. and collaborate on

14:01

your organization's data. I especially

14:03

like the ability to combine your

14:05

technical diagrams with data documentation and

14:07

dependency mapping, allowing your data engineers

14:10

and data consumers to communicate seamlessly

14:12

about your projects. Find

14:14

simplicity in your most complex projects with Miro.

14:17

Your first three Miro boards are free

14:19

when you sign up today at dataengineeringpodcast.com

14:22

slash Miro. That's

14:25

three free boards

14:27

at dataengineeringpodcast.com/M-I-R-O. Digging

14:33

now into the idea of

14:35

metrics definition and identifying what are the metrics

14:37

that you should care about, what are the

14:39

metrics that are useful to be alerted on,

14:41

what are some of the

14:43

ways that data teams or operations

14:46

teams should be approaching that question

14:48

and thinking about how do I

14:50

decide what are the metrics that are actually

14:53

going to matter, what are the ones that

14:55

will give me a useful signal of something

14:57

needs to be addressed and it's going to

14:59

have some sort of business impact versus just,

15:01

hey, it might be neat to know about this

15:03

thing. Yeah, yeah. Typically,

15:07

what metrics are you reporting to your senior

15:10

management basically? Start with them. So

15:12

there's typically like there's business metrics where you

15:14

start with that. Typically

15:16

they're obviously headline business metrics like

15:19

users, payments, sign-ins.

15:22

Depending on your business, they're usually pretty

15:24

obvious, the main bright stars. And

15:27

then there's also technical metrics as well. So we

15:29

use sometimes a lot of technical metrics for underneath

15:32

the stuff for the health of the app

15:34

itself and things like that. But

15:36

generally, it should

15:39

be obvious and if it's not obvious, then

15:41

it's probably a question for, okay, well, maybe

15:43

this isn't a metric I should

15:45

use. The way I think

15:47

about things though is that

15:49

metrics are increasing, everything has become a time

15:51

series as you have more and more data

15:54

and metrics are becoming just more and

15:56

more commonplace. So it's okay to

15:58

have lots and lots of metrics. It's just...

16:00

that you want to have like priority one

16:02

level of metrics, priority two level of metrics.

16:04

So you can kind of embrace the messiness

16:06

of like, okay, we've got loads of metrics

16:08

across all these other types of business objectives,

16:11

secondary objectives, we'll put them in a different bucket

16:14

then you put your main kind of executive level

16:16

metrics. And they obviously

16:18

then would, they get a special route

16:20

when they go off versus

16:22

when all the other metrics go off. Because

16:24

then it's like, okay, well, if the

16:26

P1 metric alerts, I want

16:29

that to go straight into the Slack or I want that

16:31

to email me straight away. But then I also want like

16:33

all the other metrics that are like lower priority

16:35

or lower interest, maybe every now and then I

16:37

want to just open up that inbox and

16:40

browse through those kind of, read the newspaper

16:42

as such to see. And

16:44

that's very useful as well then because, if

16:47

you have a good anomaly detection system, it almost becomes

16:49

like a BI tool in that sense as well. And

16:51

that it's actually uncovering insights and you can quickly, it's

16:53

then more just about the UI UX, like can you

16:55

quickly scan 50 alerts and

16:59

see, oh, there's one thing there that it's actually might be interesting.

17:03

That's like that's gold dust if you can get that

17:05

in terms of an insight. Because otherwise you would have

17:07

had to pre-configure a dashboard and maybe it's in, it's

17:09

in some dashboard in the second tab and the third,

17:12

then the second quarter of the page. And you have

17:14

to get so lucky that your eyeball happens to land

17:16

on that chart. That's just, it's not

17:18

really a scalable approach to analytics, especially in this

17:20

day and age when there's just so much more

17:23

data. So that's the other flip side of

17:25

it as well. It's like, it's more about

17:27

sort of how you make, how you wrote the insights that

17:29

you get from these two. And

17:32

before we dig too much further into

17:34

the implementation of a NOM

17:36

stack, another thing that I noticed

17:38

as I was reviewing the project is that you

17:40

put in a lot of effort to make it

17:42

as easy to get up and running and get

17:44

started with and evaluate as possible with including

17:47

out of the box pipelines for

17:49

DAGSTER, having a GitHub Codespaces available.

17:52

I forget what the other options

17:54

were, but it was just very

17:56

much a, I really want you to

17:58

use this thing. And I'm wondering. what

18:01

was the impetus for putting in all

18:03

of that effort? And what

18:05

are some of the ways that that focus

18:07

of making it easy to adopt, making it

18:09

easy to test out influenced the overall design

18:11

of the project and the ways that you

18:14

were thinking about how to architect it so

18:16

that it was easy to adopt and implement.

18:19

Yeah, so main kind of consideration there is to

18:21

try and keep it like as easy as possible

18:23

in terms of like, it's not over engineered at

18:25

all. Basically under the hood, when you look into

18:27

it, everything is like a planless data frame that's

18:30

moving around. So I kind of wanted it, basically

18:32

build it for a version of myself maybe 10 years ago,

18:35

who was like, instead of back then, I

18:37

had to like stand up my own airflow VM and

18:40

come up with all the data engineering part

18:42

of it. If I can actually just Docker

18:44

compose up and then just focus on the

18:46

SQL and the metrics, then I'd be really

18:48

happy. And that's kind of what the aim

18:51

is here, is that you can easily run

18:53

through Docker or even serverless, DAGS or cloud

18:55

is really cool as well, the way they

18:57

have an integration on GitHub and it'll just

18:59

automatically deploy to DAGS or cloud. So you

19:01

don't even have any sort of operations. Then

19:04

you can just focus on a PR to

19:06

add new metrics or Azure metrics evolve. It's

19:08

all kind of get up type approach. And

19:11

the idea was there like, ideally I'd love to have,

19:13

it's still quite early on in the project. So I've

19:15

only been working on it kind of a month or

19:17

two. And the plan is kind

19:19

of have users that actually use it, could

19:21

also then become contributors as well, and

19:24

so lower the barrier to contribution as

19:26

much as possible as well. So that's

19:28

why we're kind of, all the concepts

19:30

are very straightforward and very simple. And

19:33

that's the idea, like is to actually have users that

19:35

can use it. And also like if they wanna make

19:37

an improvement, for sure, like, yeah, get

19:39

involved, make it pure, it'd be great, you know? So

19:41

that's the idea is to actually have users and contributors.

19:45

In terms of the implementation and

19:47

as you were defining the

19:49

scope of the project and thinking through, okay,

19:51

I want to have this open source anomaly

19:53

detection stack so that I don't have to

19:56

rebuild it over and over again. What

19:58

are the core capabilities? and

20:00

constraints that you were focused on

20:02

that informed the final implementation of

20:05

what you have built so far.

20:08

Yeah, so I actually originally started

20:10

with a nominee

20:12

detection provider in Airflow. So

20:14

we use Airflow and I

20:18

built a nominee

20:20

detection Airflow provider package. That's

20:22

also in the Airflow registry with

20:24

the astronomer folks. And that

20:27

works. So if you're using Airflow, that's

20:30

one approach. But I was

20:32

thinking as I was doing it, I was kind of thinking, well,

20:35

this kind of depends on Airflow. And it's a

20:37

bit silly for people to have to then stand

20:39

up Airflow to do a nominee detection. So I

20:41

wanted to look more standalone. And so I also

20:43

was aware, like at the time, a lot of

20:45

these data orchestration tools, there's so many options and

20:47

they're all great now. So

20:50

the approach there was actually

20:52

OK. I want to have a flexible

20:56

enough general simple orchestration

20:58

tool and then also use,

21:00

you know, PIO-D to all the ML stuff.

21:02

So it's basically putting all the ingredients together

21:04

into this little app approach that's kind of

21:07

fairly easy to stand up, fairly easy to reason about.

21:09

And that's

21:12

the main aim is to actually have as

21:14

little movement path as possible and just

21:17

get what we need for decent enough, you

21:19

know, a non-protection layer into your inbox. That's

21:22

the north star. And

21:24

now as far as the actual

21:27

implementation, the architecture, wondering if you

21:29

can describe how you implemented an

21:32

omstack and some of the ways

21:34

that you optimized for these particular

21:36

design constraints that you mentioned. Yeah,

21:40

so I had a look at

21:42

a few different orchestration platforms, basically.

21:44

And it was a good

21:46

excuse. I'd been aware of DAGSTIR, but

21:48

I hadn't really used it that much. I'd

21:51

mostly been used to Airflow and, you

21:53

know, other things like serverless options in

21:56

GCP and AWS. And so I

21:58

had a look at DAGSTIR and actually DAGSTIR... seems

22:00

almost perfect because the well,

22:02

that's to have an approach called software defined

22:05

assets. That's like really interesting approach that they

22:07

have. But actually a step underneath that is

22:09

basically just jobs. And a

22:11

job is the core kind

22:13

of building block here. So when the user

22:15

defines their their metrics, a metric match, basically,

22:18

it an armstack will just trigger six jobs

22:20

for four main jobs, like there's a job

22:22

to ingest a job to train a job

22:24

to score and a job to alert. And

22:26

so the main kind of

22:28

concept here is you bring your configuration

22:30

and then the tool itself will

22:32

do the orchestration. And then

22:34

also use, you know, pod for the ML stuff

22:36

as well. So like it kind of, it's more

22:39

mainly like putting, putting together these recipes of, of

22:41

different greetings that are already out there in the

22:43

ecosystem. And that's, that's kind of what

22:45

the culmination is. From the time

22:47

that you first started building this project

22:49

to where you are now, I'm wondering

22:52

what are some of the ways that

22:54

the overall goals and implementation have evolved

22:56

and maybe some of the dead ends

22:58

that you explored and ultimately discarded. Yeah,

23:01

I actually am one of the dead ends that

23:03

I almost talked with, like, I was

23:05

kind of jokey, but I have

23:07

been we've implemented a LLM alert

23:09

job itself. So instead of the

23:12

pod ML models for the anomaly

23:14

detection, we actually have a

23:16

LLM alert job that you can enable,

23:18

which basically just sends the data to,

23:21

to GPT and ask, does it look anomalous?

23:23

And it's kind of more so curious, because

23:25

it's a good, it's a

23:27

good example of like where the limits are in

23:29

terms of language models, because I wanted to see

23:31

like, how actually useful can it, can it be,

23:34

you know, getting sense back from the language model

23:36

and time series, time series data is still a

23:38

bit sort of at the edge of what LLMs

23:40

are really able to do really well. And

23:42

so it was kind of fun playing around with that. There's

23:44

a lot of iterations of doing as minute, like I started

23:46

with as minimal approach as possible, send the data to the

23:48

LLM, see what it gets back. And it was kind of,

23:51

it wasn't even kind of understanding the time series,

23:53

like it couldn't even get the order of the data

23:55

itself. And so there's been a few iterations of that,

23:58

like playing around with prompt engineer and give

24:00

it all the hints it needs to do it. And

24:03

it's actually, it's kind of surprisingly working

24:06

but it works technically, but like when you, and

24:08

it works and it makes sense, but when you look

24:10

at it then and take a higher picture as a

24:12

human, it's actually not that useful at all because the

24:14

anomalies that the LLM comes up with are technically often

24:16

they are anomalies, but they're not anomalies that you would

24:18

care as a human if you were eyeballing the data.

24:20

And so it's tricky. That was like, it was fun

24:23

to do all that. And I kind of mainly did

24:25

that just as a sort of

24:27

a joke almost, but that was something that

24:29

I think it's kind

24:31

of interesting to see, but it's being

24:33

definitely, I've just turned them all off today by default.

24:36

It's an optional kind of job that you can turn

24:38

on. It's a little bit of a dead end. I

24:40

don't think it's as useful as you

24:42

might think it is. For people who

24:44

are interested in testing it out, getting

24:46

it deployed, as we already discussed, there

24:49

is a very easy on-ramp, but for

24:52

people who want to then go from, okay,

24:54

I've tested it out, it seems interesting. Now

24:56

I want to run it in production. What

24:58

does that journey look like? And what are

25:00

some of the considerations and potential sharp edges

25:02

that people need to be thinking about as

25:04

they go from proof of concept to this

25:06

is business critical now. So there's a couple

25:08

of ways to use it. You can, the

25:10

repository itself is a GitHub template. So you

25:12

can kind of actually, you can clone, you

25:14

can obviously of course clone the repository, but you can

25:17

use the GitHub template to make a copy of it.

25:19

And then once you have that GitHub template repository, you

25:21

can then use that for your metrics and deploy

25:23

it wherever you want through Dijkstra

25:25

Cloud or just using your own kind

25:27

of CI CD and Docker Compose. So

25:31

some of the sharp edges are probably like this. There's

25:33

still, it's still very immature project. It's still

25:35

very, very young. I just finished like the

25:37

first set of proper tests today. So there's

25:40

always like, this is something that comes with

25:43

these open source projects as well is that

25:45

it's, especially when they're young like this, that

25:48

definitely is, I wouldn't, it's a pinch of salt

25:50

in terms of like, you're better off

25:52

like dog food, or gently on

25:55

stuff that's not production. And then once you're

25:57

comfortable with that, then you go from there.

25:59

So like, That's what I might

26:01

do at the moment. I'm kind of dog food

26:03

as we go. And so there's a little bit

26:05

like it's still, there's a small little bit of

26:07

infrastructure in terms of, okay, how

26:09

are you gonna run these Docker containers? How

26:12

are you gonna monitor them? How are you

26:14

gonna have availability, things like that. These are typical

26:16

enough kind of considerations with tools like

26:18

this. So there's still, there is still

26:21

a couple of kind of, it's not completely

26:23

hands-free. It's not completely paying this,

26:25

not yet, but the aim is to be

26:27

as painless as possible, basically. And

26:29

so there's definitely some typical kind of sharp

26:31

edges there in terms of like, we

26:34

don't have necessarily a standard deployment

26:36

or standard installation yet we've

26:38

given as many as possible. So you can use

26:40

Docker or you can use a local Python environment

26:43

yourself, or you can then use the serverless options

26:45

as well. And so we're kind of

26:47

waiting to see which approaches people

26:49

are most comfortable with as well. Data

26:53

lakes are notoriously complex. For

26:56

data engineers who battle to build and

26:58

scale high quality data workflows on the

27:00

data lake, Starburst powers petabyte scale SQL

27:03

analytics fast at a fraction of the

27:05

cost of traditional methods so that you

27:07

can meet all of your data needs,

27:09

ranging from AI to data applications to

27:12

complete analytics. Trusted by teams of all

27:14

sizes, including Comcast and DoorDash, Starburst is

27:16

a data lake analytics platform that

27:18

delivers the adaptability and flexibility a

27:20

lake has ecosystem promises. And

27:23

Starburst does all of this on an

27:25

open architecture with first-class support for Apache

27:27

Iceberg, Delta Lake and Hoodie. So

27:30

you always maintain ownership of your data. Wants

27:33

to see Starburst in action? Go

27:35

to dataengineeringpodcast.com/Starburst and get $500

27:37

in credits to try Starburst

27:39

Galaxy today. The easiest and

27:41

fastest way to get started

27:43

using Trino. There

27:46

are multiple different flavors of open source

27:49

projects where sometimes people just want to

27:51

produce something out in the open, but

27:53

they don't really care about getting contributions.

27:56

There's the corporate open source where we're going to

27:58

release this because it. furthers our business.

28:00

And if you happen to get use

28:03

out of it, that's great. And then

28:05

there are the open source projects that

28:07

are intended to be maintained and grown

28:09

by community. And I'm wondering what your

28:11

thoughts are on how you're approaching this

28:14

this particular project? Are you looking for

28:16

contributions? Are you just looking for feedback?

28:19

I'm wondering what types of engagement and

28:21

community you're looking to build around in

28:23

ways that folks can contribute and help

28:25

you out with this? Yeah, you

28:27

know, I'm always looking for contributions.

28:30

I would love some contributions. And kind

28:32

of I don't necessarily have like a

28:34

software engineering background myself. So that's always

28:37

been sort of a fear I've had

28:39

around the imposter syndrome and stuff like that.

28:41

So I would love if somebody came with

28:43

a contribution that completely showed me, Oh, you

28:45

know, your tests are all wrong, or you

28:47

can do something better. Or like, here's more,

28:49

here's better abstractions we can use. There's definitely

28:51

like room for improvements across the board. And

28:53

so I would love contributions. And that's been

28:55

the aim of like keeping it as simple

28:57

as possible, where, you know,

28:59

everything is basically all the main concepts

29:01

are you have like a metric batch,

29:04

which is just the definition of of

29:06

your of your metrics. And then you have jobs

29:08

which are like, you know, ingest, train score alert,

29:11

and then it would end under the hood when

29:13

you're looking at the code, really, it relies heavily

29:15

on on pandas data frames, and every job basically,

29:17

you know, produces a pandas data frame, or it

29:19

takes in a pandas data frame and produces a

29:21

pandas data frame. So it's quite easy to reason

29:23

about. And so that's the idea is that

29:25

like, if you're someone that's a comfortable enough,

29:28

and in Python developer, like it's a perfect

29:30

project to do, you know, first open source

29:32

contributions on as well, which should be really

29:34

fun, like, and for people

29:37

who are looking to get engaged

29:39

with the project, and maybe they

29:41

don't necessarily want to modify the

29:44

core of what you're building, but

29:46

they are interested in extending or

29:48

augmenting its capabilities, what are some

29:50

of the interfaces that you've built

29:52

in to make it open for

29:54

extension and customization and adapting to

29:56

a particular customer or operating

29:59

environment? Yeah, so that was a good

30:01

example of where I haven't tried to be too complicated

30:03

from the start. So obviously we support,

30:05

you know, BigQuery, Snowflake, DuckDB, a couple

30:07

of other databases. And I didn't, originally

30:10

I was thinking like, okay, do I

30:12

need to build some fancy plugin architecture,

30:14

a plugin system where somebody could bring

30:16

their own plugin? And I

30:18

said, I decided not to do that because probably

30:20

it's at the edge of my capability, but also

30:22

it makes it harder to contribute on

30:25

as well. So the way the approach would be

30:27

at the moment, for example, I'm working on Redshift

30:29

at the moment and adding Azure Blob

30:31

Storage. And, you know,

30:33

just make a fork, make

30:35

a PR, and everything's kind of

30:37

easily testable. And so that's where

30:40

we haven't gone. It's not as complicated yet in terms

30:42

of like taking, say, something like the

30:44

Airflow approach where you have plugins that you can provide,

30:46

you can install separately dependencies and stuff like that. We

30:48

haven't kind of, we're not, it's

30:50

not sort of taking that approach yet,

30:53

mainly for that goal to have like the, as

30:55

low a barrier as possible to contribution. And, but

30:57

definitely at some stage, if, you know, if the

30:59

project does become more mature and stuff like that,

31:02

then yeah, like that would be something that I

31:04

would imagine would be refactored at some stage. Digging

31:07

more into the, I'm using

31:09

this, I'm running it. I want to

31:11

feed in these different metrics. You mentioned

31:13

that it has support for pulling from

31:15

databases, running Python scripts. I'm wondering if

31:17

you can talk a little bit more

31:19

about the process of producing the metrics

31:21

that an OMS stack is going to

31:23

work from and the

31:26

overall flow of data

31:28

in evaluation, alert out, or,

31:30

you know, ignore because there's

31:32

nothing to alert on. Yeah.

31:34

Yeah. So like the main approach there, the

31:37

inputs are, there's a metrics folder basically in

31:39

the root of the project. And even in

31:41

the metrics folder, then you have, you

31:44

can, you can have a folder for, you know, each

31:46

subject area or each metric batch, or you can kind

31:48

of do, you can, you can organize the metrics however

31:50

you want, as long as they're running the metrics folder.

31:52

And then all a metric batch is,

31:55

is some ingest SQL. So

31:57

there's a template that you just define

31:59

an ingest. a SQL file, which is

32:01

basically just whatever

32:03

SQL you want to use

32:05

to generate your metrics. And

32:08

so basically, this is SQL that generates a

32:10

table which just has a metric name, a

32:12

metric value, and a metric timestamp. That's all

32:14

that's required. So once you have that, then

32:18

that's the basis for the ingestion. And then

32:20

there's also then a YAML configuration file. And

32:22

the YAML configuration file has all the other

32:24

things like schedules and parameters for the models.

32:26

And again, you don't have to fill any

32:28

of them. You can kind of just leave

32:30

that file pretty much empty and it'll use

32:32

the defaults. There's also like a default YAML

32:34

that you can edit your defaults as well.

32:37

So the idea is you just bring your ingest logic,

32:41

basically. And you

32:43

can use an ingest

32:46

SQL function, or you can actually, if you want,

32:48

you can also use your own, you

32:50

can make a custom Python function. So all you have to

32:53

define if you're doing something that maybe say you're

32:55

scraping metrics from a website or from some public

32:57

metrics, or even it doesn't, it could be anywhere.

32:59

But if it's a Python function, you can then

33:01

also just use it. You can just bring your

33:03

own Python function as long as that Python function

33:05

generates a kind of data frame that then has

33:07

those same three columns, metric name, metric value,

33:09

metric timestamp, that works as well.

33:12

So we have all the examples in the repository

33:14

that do that. Like there's examples that pull metrics

33:16

from Hacker News and Weather Metrics and Yahoo Finance

33:18

and all that sort of stuff. And once you

33:20

have that, then you can obviously, you can customize

33:23

anything under the hood. There's default templates. So like

33:25

there's a default template for the pre-processing function that

33:27

the ML uses. You don't ever have to worry

33:29

about that, but if you want to, you can

33:32

bring your own for each individual metric batch. Likewise,

33:34

for the for the

33:37

alert logic, you can also define your

33:39

own alert SQL template if you want, or

33:41

you can edit the default on that stair.

33:43

So the idea is once you bring your

33:45

ingest logic and your configuration, then

33:47

that will trigger off everything. So the ingest

33:50

jobs, the train jobs, the score jobs, and

33:52

then all that's happening behind the scenes is

33:54

it's going to kind of run that ingest script, save

33:57

the results onto a metrics table, and as I see

33:59

on the score. it also saved the

34:01

scores on the metrics table and at the

34:03

student alert, it also saved the alert on

34:05

the metrics table. So this all then just

34:07

becomes kind of orchestration that's reading to and

34:09

writing from this metrics table in your warehouse

34:12

basically, which could be Snowflake, BigQuery, whatever. And

34:14

this is like a long format metrics table where each

34:16

row is basically a new metric. So it's kind of

34:18

easy to think about as well because as you add

34:21

new metric batches, you're just depending on to the end

34:23

of that table. Or you can of course also have

34:25

like, if you want you can have different metric batches

34:27

go into different metric tables, that's all flexible. But it's

34:29

easiest to think about just having like one start with

34:31

one single metrics table that an arm stack is reading

34:33

from and writing to. And that kind

34:35

of becomes then the, that's the

34:38

actual heart of what's going on here basically. And

34:40

you can plug that into your own tools as well. So if you have your

34:42

own BI tools, or your own alert

34:44

tools, or anything like that, that

34:47

then it's just another table in your data

34:49

warehouse. So you can kind of use it like anything else basically.

34:52

And recognizing that it's still a very

34:54

early project that you are still working

34:56

on gaining visibility and getting feedback. I'm

34:58

wondering what are some of the most

35:00

interesting or innovative or unexpected ways that

35:02

you've seen an arm stack used so

35:05

far? Well, I was a couple

35:07

of weeks back, I should always have enough. So

35:09

one of the examples we use in the

35:11

examples, yeah, out of the box examples is

35:13

like Hacker News, it scrapes the top,

35:16

the scores from the Hacker News top

35:18

stories, you know, I mean, and I was like,

35:21

as soon as all the Sam Altman drama with

35:23

open AI kicked off, I was kind of crossing

35:25

my fingers thinking, oh my God, this has to,

35:27

they'll get picked up. If this isn't picked up

35:30

by, in the example job, I'll be kind of

35:32

having my face. And funny enough, I have, it

35:35

was, as soon as all that kicked off, Hacker

35:37

News exploded. And, you know, I had anomalies

35:39

straight away from the Hacker News jobs. And

35:41

I've put them into the gallery. There's a little

35:44

gallery folder in the repository as well that has

35:46

examples of like real anomalies that I've been

35:48

using it on real data. And

35:50

there's a Sam Altman fired HN

35:52

explodes.png in there as well. I

35:55

was happy to have, but yeah, that was like, it's

35:58

interesting as well just recently with sort of. We're

36:00

also doing, looking at stock prices and stuff as well,

36:02

like just trying to get a wide range of as

36:04

many examples as possible to get like realistic data. And

36:06

just the other day, I noticed all of the tech

36:08

stocks when it were down a couple of points based

36:11

on the Yahoo Finance job. And they actually Googled it

36:14

and was like, yeah, actually they are all closed. I

36:16

thought it was a problem. I thought something was going

36:18

wrong somewhere, but actually, you know, it was valid. That's

36:21

an interesting use case as well, where maybe

36:23

it's not business metrics that you care about.

36:25

Maybe it's just personal curiosities and you can

36:27

build your own sort of Google trends style

36:30

of, hey, I want to know if something changes

36:33

in this particular ecosystem, as long as there's some

36:35

sort of API you can hit, then you can

36:37

build your own personal anomaly dashboard about what are

36:39

the anomalous things happening in the world today? Yeah,

36:42

yeah, no, and that's actually Google trends is another

36:44

example. We have a Google trends example as well.

36:47

So I'm kind of constantly building out this example

36:49

folder within the metrics folder, so that you can,

36:51

and you can turn them off as well, like

36:53

so you can, but they're just, they're useful to

36:56

kind of be realistic types of examples

36:58

that people can look at as well. Yeah,

37:00

it's definitely a very cool project in that way, where as you

37:03

mentioned, there are anomaly detection tools. A

37:06

lot of times though, they're very coupled

37:08

to the product that they're trying to

37:10

generate the alerts from. So Datadog has

37:13

some anomaly detection. I

37:15

know that the Grafana cloud product has

37:17

some ML capabilities for alerting on anomalies,

37:19

but again, all of those are very

37:21

tightly coupled to the ecosystem that they're

37:23

built for, whereas this is a little

37:25

bit more open-ended of, as long as

37:27

you can get data somewhere, we can

37:29

let you know if something is weird. Yeah,

37:31

and that was almost as well. One of

37:33

the kind of design principles here was to

37:35

have no UI and have like, it's all

37:37

basically config based and get up space so

37:39

that, it's what we're used to

37:41

working in as like data engineers. And it's lower

37:44

overhead. We don't have some crazy management UI

37:46

and admin console, but you have to go

37:48

and click around and configure stuff. It's all

37:50

kind of your metrics as code basically, and

37:52

everything as code, and that kind of helps

37:55

make it easier to, if you want to add

37:57

new metrics, it's just a PR, and then no problem, you know.

38:00

Absolutely. And in your experience of building

38:02

this project, publishing it to the community,

38:04

looking for feedback, what are some of

38:06

the most interesting or unexpected or challenging

38:09

lessons that you've learned in the process?

38:11

So it's been fun actually, I had

38:13

to learn quite a lot about Daxter.

38:16

Daxter is really at the heart of it doing

38:18

all the orchestration, so I had to go quite

38:21

deep in terms of getting familiar with even

38:23

some edge cases and stuff around how

38:25

Daxter works and all different configurations to

38:27

be able to support like running locally

38:30

in your own Docker versus Daxter Cloud

38:32

versus a Python environment. There's a few

38:34

different kinds of considerations there. That's

38:36

kind of been fun and been interesting

38:40

to start from new, and new technology is

38:42

always fun, especially all these modern data stacks

38:45

technologies. It's overwhelming, there's so many of them

38:47

that it's almost too much sometimes and you

38:49

kind of just turn on the blinker. But

38:51

it's been good to have an excuse to

38:53

actually then take one, just pick one and

38:56

use it and go to have

38:58

been useful. And yeah, also as well, just

39:00

my own capabilities. I would say actually I

39:02

should pre-fist like probably another part of like

39:04

projects like this are now actually easy to

39:06

do because we have all these tools that

39:08

we can use. And once you kind of

39:10

know enough to put the ingredients together, I've

39:13

also been using, you know, co-file and chat

39:15

GPT to help a lot with the code

39:17

as well. Like it's crazy how much more

39:19

productive you can be these days, especially with

39:21

an open source project like this, where it's

39:24

like you can develop fully in the open.

39:26

You don't have to be worried about anything

39:28

confidential or anything like that. You're just unconstrained

39:30

actually use these tools. And yeah, it's

39:32

been like I'd say probably 30% of

39:35

the code in parts has been at least

39:37

inspired by co-pilot and chat

39:39

GPT. So that's been really interesting because if you, it's like,

39:41

you know, when you used to ask for help on Stack

39:43

Overflow, you have to spend a lot of time reproducible

39:46

examples and ask the question in the right way

39:48

and show your work and things like that. Same

39:50

thing applies for, you know, the language models. And

39:52

once you do that, they can actually be ridiculously

39:54

useful. So it actually, it hasn't been half as

39:56

much work as I thought it would be because,

39:58

you know, we have, all the tools that

40:01

we're using are quite easy to work with.

40:03

And then like this assist of, you know,

40:05

co-coiler type approach, it just means, you

40:07

know, if I have an idea, I can make an idea and then spec

40:10

the idea out and actually get it done, probably, you know,

40:12

in half the time that it would have taken originally. So

40:14

that just means you've got more time, you get more done

40:16

with it, you know, if, you know, at the time you

40:18

can focus on a project like this, you can just get

40:20

so much more done for it, you know.

40:23

And for people who are interested in an

40:25

OMS stack, they want to start to incorporate

40:27

some measure of anomaly detection on their business

40:29

metrics. What are the cases where it's the

40:31

wrong choice? Yeah, so I think probably the

40:33

main cases there would be if it's like

40:35

low latency, you know, per second, like that's

40:37

some of the stuff that we've done with

40:39

NetData, it's all infrastructure per second metrics, you

40:41

know, thousands of metrics a second. That's a

40:43

completely different domain where you have like just

40:45

different design challenges. And so an OMS stack

40:47

wouldn't be wouldn't be right for anything like

40:49

that. And it's more typically like, you know,

40:52

hourly metrics. I do have like

40:54

10 minute metrics and things like that.

40:57

But anything below, anything too near real time, it

40:59

wouldn't make sense. And a situation like that, you're

41:01

in more of a data observability situation where

41:03

you like things like Prometheus and that sort of

41:05

would be more useful. But the other

41:09

use case would be, I guess, if you have

41:11

scale, like if you've got thousands and thousands and

41:13

thousands of metrics, I don't think I'm not sure

41:15

how well that would happen. You

41:17

know, how well, say, a DAGS are running in a container,

41:19

how well that would scale to if we had like hundreds

41:21

and hundreds of metric batches, I reckon there's probably that'd

41:23

be a nice problem to have, if we ever get that

41:26

far, we have that problem. But I would say that's probably

41:28

another issue, where I would say it's not right

41:30

for you. And then also, like, if you're not sort of, if

41:32

you're not comfortable enough with

41:34

sort of running a Docker app, basically, then it's

41:36

a good excuse to layer, it's a good chance

41:38

to kind of get your hands dirty. And it's

41:40

not as painful as like things

41:43

used to be. But also, that's something that like, it's,

41:45

there's a little bit of consideration there in terms

41:47

of like, are you comfortable enough running this yourself?

41:49

Or obviously, like you can use the DAGS to

41:51

cloud, you know, if you have a DAGS to

41:53

cloud account, that works as well. But yeah, that's

41:56

a situation like that, probably not quite the right

41:58

option. Also, if you're using Airflow, If you already have

42:00

an Airflow, you should probably look at the Airflow

42:02

anomaly detection provider, which is a different project that

42:04

I maintain. That would be really cool to get

42:06

some get some love in there at the moment

42:09

as well. Because that one only has, I've only

42:11

really set it up for for BigQuery. But you

42:13

know, obviously there's all the different types of operators

42:15

and all this stuff already exists in Airflow. So

42:17

it's not that hard to actually use them. It's

42:19

just if somebody is motivated to, you know, come

42:21

and use it, then they might be as well

42:23

to actually use your own Airflow that you already

42:25

have, you know. And as you

42:27

continue to build and iterate on the Anom

42:29

Stack project, as you work to onboard more

42:31

contributors, what are some of the things you

42:34

have planned for the near to medium term

42:36

or any particular projects you're excited to dig

42:38

into? Yeah, so I'm there's a

42:40

couple of open issues in the repository of ideas. And

42:42

I'm just kind of throwing issues in all the time.

42:44

And one thing I want to do, I have

42:47

a feature feature request open for time

42:49

GPT. So it's still kind of shaken

42:51

out these LLM approaches. There's time GPT,

42:54

which is a new sort of time

42:56

series friendly large language model. And

42:58

I'm hoping to see if I can start to new it's still

43:00

sort of in a closed beta. So I'm hoping to get access

43:02

to that actually see if we can use that so that that

43:05

might actually be more useful. And

43:07

also just there's a few things around wanting

43:09

to to let the user

43:12

run multiple models. So like at the moment,

43:14

it's free to metrics, you define one model.

43:17

And the default model is this PCA based model. But

43:19

actually, really, maybe you want to define like three or

43:21

four, three or four different models and actually just let

43:23

them run for a week or two. And

43:25

then you can actually see, okay, as the metric comes in,

43:28

how do the anomaly scores behave and which ones work best

43:30

for this metric. So there's definitely a whole load of stuff

43:32

where we could make the ML part of this easier as

43:34

well, I think. So if you could run multiple models, and

43:36

then over time pick them, that would be good. Or if

43:38

you could do if we could do some sort of way

43:40

where you could benchmark and simulate your metrics on different models

43:42

that could help with the ML part, I think that could

43:44

be really useful as well. Because that's always the challenge is

43:46

like, it's very hard, there is no one size fits all

43:48

model, and it can sometimes take a bit of iteration as

43:50

well. So if we could take the pain out of someone

43:52

else, that could be really useful as well. I'm kind of

43:54

fun to work on as well. And

43:56

given the time series nature of the data, it

43:58

might also be interesting. to bring in some

44:01

sort of time series predictive capability, whether that's

44:03

using the profit library or I think there's

44:05

another one, Grey Kite, there are a number

44:07

of them out there now to say, this

44:09

is the current trend line. If this continues,

44:12

then this will maybe then trigger an anomaly

44:14

and so here's some kind of preemptive alerting

44:16

of something to keep an eye out for.

44:18

Yeah, yeah, and there's like, there's also, there's

44:21

lots of other contexts on ML that we could bring

44:23

into this in terms of like, forecasting is an obvious

44:25

one as well, but then there's also like change detection

44:27

is another one where sometimes what you're interested in is

44:30

a sudden change, even if it's not an

44:32

anomaly, like maybe sudden change happen, they happen every

44:34

time, but you know, they're not gonna be flagged

44:37

as anomalous because the ML is gonna look at

44:39

those shifts as like, oh well, steps happen every

44:41

now and then, but actually if you have a

44:43

real focused area where you're interested

44:45

in, okay, what happened last night, something went

44:47

wrong, what you really wanna ask

44:49

a lot of times there is, okay, change detection, show me

44:52

the metrics that had a sudden change, and

44:54

that's like a different use case where it's like a

44:56

subset of an anomaly detection, it's not quite a little

44:58

bit different, so there's all these other kind of little

45:00

ML, time series based, you know,

45:02

ML use cases that we could for sure build

45:05

in, like over time that would be interesting. Are

45:07

there any other aspects of the Anom Stack

45:09

project or this overall space of business metrics

45:11

and anomaly detection that we didn't discuss yet

45:14

that you'd like to cover before we close

45:16

out the show? No, no,

45:18

just I definitely think it's an interesting time,

45:20

especially like as we can, you know, there's

45:22

a lot of modern data stack, there's lots

45:25

of stuff going on, it's crazy, but

45:27

I do think technology is catching up, you

45:30

know, in terms of actually the metadata and making

45:32

sense of like, you know, making sense of what's

45:34

going on in your data, that's like the

45:37

hard part, we have all the plumbing, we have all the

45:39

flows, we have all the details, it's just

45:41

how do you actually make sense of like what things

45:43

matter the most, that's still sort of an open problem

45:45

that I think, now like a

45:47

lot of these kind of AI,

45:50

I won't say that for the first time I think I've said AI,

45:53

I cringe every time I say AI, but actually

45:55

this is one case where like, it actually will, I

45:57

think really be useful over the next couple of years and like making

45:59

sense. of all of the crazy business

46:01

metrics and data that companies have. All

46:04

right. Well, for anybody who wants to get in

46:06

touch with you and follow along with the work

46:08

that you're doing or contribute to the project, I'll

46:10

have you add your preferred contact information to the

46:12

show notes. And as the final question, I'd like

46:14

to get your perspective on what you see as

46:17

being the biggest gap in the tooling or technology

46:19

that's available for data management today. I

46:22

think possibly the biggest gap

46:24

is just the complexity of the space. There's

46:26

still a, I'm not sure where I sit

46:28

on this as well. So there's point solutions

46:31

that kind of focus on one thing and

46:33

do one thing well. And then there's all

46:35

these platform options. And I

46:37

think that's the biggest complication now is just

46:39

navigating the space in terms of how do

46:41

you compose things together? There's

46:45

still like there's work on standards and stuff

46:47

like Open Lineage and all these kind of

46:49

standards that are trying to become a glue for

46:51

all these different solutions. But I think that's the

46:54

biggest challenge is actually how do you actually just

46:57

put things together and already actually try and go with

47:00

like just a big cloud provider

47:02

and just use whatever they have. That's

47:04

probably the biggest gap I see. Absolutely.

47:07

All right. Well, thank you very much for taking the

47:10

time today to join me and share

47:12

the work that you've been doing on the Anomstack

47:14

project and for building it in

47:16

the first place. It's definitely a very cool

47:18

project. Definitely excited to try that out for

47:20

my own data platform and explore the possibilities

47:23

that that opens up. So I appreciate all the

47:26

time and energy you've put into that and

47:28

for taking the time today. And I hope you enjoy the rest of your day.

47:31

Thanks. Thanks. Thanks a lot for having me on.

47:33

I'm a big fan of the show and anyone

47:35

else who's interested, just come check out the repo

47:37

and make sure you make some discussions. I will

47:39

be delighted to have people come out behind. I

47:41

think possibly the biggest gap is just the complexity

47:43

of the space. There's

47:50

still a, I'm not sure where I sit on this

47:52

as well. So there's point solutions that

47:54

kind of focus on one thing and do one

47:56

thing well. And then there's all these platform options.

48:00

I think that's the biggest complication now is

48:02

just navigating the space in terms of, you

48:04

know, how do you compose things together? There's

48:07

still like there's work on, you know, standards and

48:09

stuff like open lineage and all these kind

48:11

of standards that are trying to, you know,

48:14

become a glue for all these different solutions.

48:16

But I think that's the biggest challenge is actually like, you

48:18

know, how do you actually just put things

48:20

together and already actually try and go

48:22

like just a big cloud provider and

48:24

just use whatever they have. That's

48:27

probably the biggest gap

48:29

I see. Absolutely. All right. Well, thank you

48:31

very much for taking the time today to

48:33

join me and share the work that you've

48:35

been doing on the Anomstack project and for

48:37

building it in the first place. It's definitely

48:39

a very cool project. Definitely excited to try

48:41

that out for my own data platform and

48:43

explore the possibilities that that opens up. So

48:45

I appreciate all the time and energy you've

48:47

put into that and for taking the time

48:49

today. And I hope you enjoy the rest

48:52

of your day. Thanks. Thanks. Thanks a

48:54

lot for having me on. I'm a big fan

48:56

of the show and anyone else who's interested, just

48:58

come check out the repo and make some issues,

49:00

make some discussions. I will be delighted to have

49:02

people come along and say hi.

49:10

Thank you for listening. Don't forget to

49:12

check out our other shows, podcast.init, which

49:14

covers the Python language, its community and

49:16

the innovative ways it is being used.

49:18

And the machine learning podcast, which helps

49:21

you go from idea to production with

49:23

machine learning. Visit the site at dataengineeringpodcast.com

49:25

to subscribe to the show, sign up

49:27

for the mailing list and read the

49:29

show notes. And if you've learned something

49:31

or tried out a product from the show, then tell us about

49:33

it. Email hosts at

49:36

dataengineeringpodcast.com with your story. And

49:38

to help other people find the show, please leave

49:41

a review on Apple Podcasts or tell your

49:43

friends and followers.

Unlock more with Podchaser Pro

  • Audience Insights
  • Contact Information
  • Demographics
  • Charts
  • Sponsor History
  • and More!
Pro Features